Do you make business decisions based on spreadsheets or siloed databases with non-standardized structures and formats? Do you see inconsistency in data across business units? How do you decide permissions and access levels to restricted data?
In this blog, we provide information about what a data warehouse is, what you may be missing if you don’t have one, and three questions to ask yourself when making the decision to invest in a data warehouse. And if this isn’t what you need, we provide alternatives to the traditional warehouse.
Businesses have applications that process and store thousands, even millions of transactions each day. The ability to create, retrieve, update, and delete this data is made possible by databases, also referred to as online transaction processing systems (OLTP). While these databases have traditionally been “relational” (SQL Server, Oracle, MySQL, DB2 etc.), in recent times “non-relational databases” (Cassandra, MongoDB, Redis etc.) or files systems (like Hadoop) have been adopted as an alternative for storing raw data.
A data warehouse, also commonly known as an online analytical processing system (OLAP), is a repository of data that is extracted, transformed, and loaded from one or more operational source systems and modeled to enable data analysis and reporting. There many types of data warehouses but these are the three most common:
The data stored in warehouses (OLAP) is de-normalized which allows easy aggregation, summarization, and data drill-down. Second, data warehouses enable business users to gain insights into what happened, why it happened, what will happen, and what to do about it.
On the other hand, databases (OLTP), are single applications built to quickly record specific business processes, like credit card transactions. Also, unlike the de-normalized nature of data warehouses, the data structure for databases is highly normalized to facilitate data atomicity, consistency isolation, and durability. Due to the complexity in writing queries for analysis in such applications, developers or subject matter experts are most often required for support.
Since data volumes are growing exponentially, a data warehouse becomes critical, and considerations should be made on the hardware that stores, processes, and provides a medium of data movement. Data warehouses can be stored on-premise, in the cloud, or a mixture of the two environments. Your decision may depend on requirements to keep organization-mission critical applications on premises. If you are looking into cloud solutions, take into consideration industrial regulations, security, visibility, accessibility, latency, and trustworthiness of the cloud providers. (Some top cloud providers include Amazon Web Services, Google Cloud Platform, Microsoft Azure, Oracle Cloud, Rackspace, Verizon Cloud, and VMware.)
The diagram below shows a high-level architecture of an end-to-end data warehouse solution. (click to enlarge)
The Gather layer consists of various data silos – these can include ERP systems, CRM, Excel spreadsheets, even Access databases housing corporate or divisional data. The storage subsystems used by these applications are typically not structured for easy querying or navigation (if direct access is available at all). Native reporting may be possible in some of these applications, however functionality is typically very limited and reporting is limited to only the data within the single system.
The Clean layer applies business logic and other calculations to data that is ultimately going to be made available in the Store layer (or data warehouse). Business logic can include custom KPI’s, business-specific calculations or rulesets, data hierarchies, or new derived columns that otherwise are not available from any source system. Data typically only temporarily exists in the Clean layer – this layer exists only to create these custom values and pass through to the data warehouse, and end user reporting or querying against the Clean layer is not allowed.
The Store layer represents the denormalized data warehouse that is described further throughout this blog post. While there are several design models, the Kimball approach is a leading design through which information is organized into dimension and fact tables and joined in star schemas for ease of use.
The Share layer isn’t a formal layer as much as a representation of all of the various ways data which resides in the data warehouse can be consumed. These uses include: querying via a BI reporting / dashboarding tool, direct SQL querying, or even automated extracts to feed other, unrelated systems.
Data warehouses will help you make better, more informed decisions for many reasons:
Let’s look at an example scenario: Company XYZ has three systems that are used to track leads as they move through the sales process to become clients:
A clear set of guidelines does not exist to govern what type of data should exist in each of the three systems and when data should move from application 1 to application 2. Because of the lack of consistency and rules, Company XYZ faces the following issues:
A data warehouse can be implemented to gather, clean, store, and share information and lessen the burden felt by the client services staff.
Take a closer look at how information is stored and shared across your enterprise. Inconsistencies in data and reporting, difficulty sharing data, and multiple data sources are all signs that a data warehouse might be the business solution for you.
Here are three basic questions to ask yourself if you are still considering:
1.) Do you store data in various source systems?
Gathering data that is structurally different from operational databases, flat files, and legacy systems can be challenging for many organizations. How can you integrate data from disparate systems with different structures? Sometimes it’s done by spending days and weeks pulling data to create reports. Also, operational systems are not modelled for analytical queries and most often contain data specific to a business area. The lack of integration across multiple business area might not give a full picture of the health of an organization. In addition, luck of human capital with the relevant or advanced skills and permissions to query data applications can be an impediment when dealing with various source systems.
A data warehouse can make this process efficient and automatic by extracting, transforming, and loading data from various source system in a standardized and consistent manner. The ability to access such data from a central location not only allows users to make quick and better business decisions, but also save time that could’ve been wasted trying to retrieve data from multiple information systems.
2.) Are you experiencing performance issues by reporting against operational systems?
Operational systems have volatile data that changes often. Consequently, running reports directly against such systems with almost real-time data can cause performance problems, and insights gathered might be inconsistent. A data warehouse can solve this problem, because they are usually optimized for read access, resulting in faster report generation.
3.) Do you have a single source of the truth?
Reporting on data that is stored and formatted differently across siloed enterprise information systems results in inconsistency across departments. Well-built data warehouses improve data quality by cleaning up data as it is imported, thus providing more accurate data. This means that one version of the truth can be provided for every department across the enterprise, providing consistency and assurance that each department is using the same data.
Sometimes data warehouses cannot solve all business problems due to their inherent dependence on the relational data structures. The adoption of new data sources, such as social media, IoT, logs, video, and audio has resulted in rapid changes in both content and volume. The down side of this has been the lack of internal check-points of data ownership which makes it difficult to apply data governance principles accustomed to traditional data warehouse projects.
As an alternative to the challenges brought about by the new ways of storing data, organizations have adopted emerging technologies such as data lakes, data virtualization, non-relational databases and perhaps polyglot persistence. Let’s have a look at what they offer:
A data lake is collection of unstructured, semi-structured, and structured data, copied from one or more source systems (technology independent). The data stored is an exact replica of the source. The goal is to make the raw data consumable by highly skilled analyst within an enterprise for future needs that are not known at the time of data capture.
The key difference in comparison to the data warehouse is that the data is not modeled to a predetermined schema of facts and dimensions. It is the lack of structure that empowers developers, analysts, or data scientists to create exploratory models, queries, and applications that can be refined endlessly on the fly. Here are three characteristics of data lakes:
Self-Service BI is an approach that gives freedom and responsibility to business users to create reports without relying on IT. Sometimes data warehouses lack the agility to scale to meet the need of quickly evolving companies. Self-service solutions allow companies to be nimble by giving departments access to data and information on demand. All skill levels can typically use these types of solutions:
Some of the most popular self-service BI tools available on the market are Qlik, Tableau, Power BI, Sisense, Ateryx, and Birst. See the tools we work with the most.
NoSQL is an architectural design approach to databases that does not rely on traditional representation of data. Relational database management systems organize data in tables, columns, rows, or schemas for CRUD (create, read, update and delete) operations. In comparison, NoSQL databases do not rely on relational structures, but more flexible data models that offer speed, scalability, and flexibly. These are key characteristics needed to deal with “Big Data”.
There are various types of NoSQL databases available on the market today and they fall into four main categories:
Whether your data resides in spreadsheets, various operational systems, “data landfills”, or you already have an enterprise data warehouse, Analytics8 can help you transform it into consistent and useful information so you can make better, more informed decisions. Read about our data warehousing services here.
Read about our data warehousing work with Guggenheim.