Data Warehouse Basics – Do You Need One?

Learn More About Data Warehousing

Do you make business decisions based on spreadsheets or siloed databases with non-standardized structures and formats? Do you see inconsistency in data across business units? How do you decide permissions and access levels to restricted data?

In this blog, we provide information about what a data warehouse is, what you may be missing if you don’t have one, and three questions to ask yourself when making the decision to invest in a data warehouse. And if this isn’t what you need, we provide alternatives to the traditional warehouse.

What is a Data Warehouse?

Businesses have applications that process and store thousands, even millions of transactions each day. The ability to create, retrieve, update, and delete this data is made possible by databases, also referred to as online transaction processing systems (OLTP). While these databases have traditionally been “relational” (SQL Server, Oracle, MySQL, DB2 etc.), in recent times “non-relational databases” (Cassandra, MongoDB, Redis etc.) or files systems (like Hadoop) have been adopted as an alternative for storing raw data.

A data warehouse, also commonly known as an online analytical processing system (OLAP), is a repository of data that is extracted, transformed, and loaded from one or more operational source systems and modeled to enable data analysis and reporting. There many types of data warehouses but these are the three most common:

  1. Enterprise data warehouse – Provides a central repository tailored for support decision-making for the entire enterprise
  2. Operational Data Store – Similar to the enterprise warehouse in terms of scope, but data is refreshed in near real time and can be used for operational reporting
  3. Data Mart – This is a subset of a data warehouse used to support a specific region, business unit or function area (i.e. Sales)

What are key differences between OLTP and OLAP systems?

The data stored in warehouses (OLAP) is de-normalized which allows easy aggregation, summarization, and data drill-down. Second, data warehouses enable business users to gain insights into what happened, why it happened, what will happen, and what to do about it.

On the other hand, databases (OLTP), are single applications built to quickly record specific business processes, like credit card transactions. Also, unlike the de-normalized nature of data warehouses, the data structure for databases is highly normalized to facilitate data atomicity, consistency isolation, and durability. Due to the complexity in writing queries for analysis in such applications, developers or subject matter experts are most often required for support.

Where are Data Warehouses Stored?

Since data volumes are growing exponentially, a data warehouse becomes critical, and considerations should be made on the hardware that stores, processes, and provides a medium of data movement. Data warehouses can be stored on-premise, in the cloud, or a mixture of the two environments. Your decision may depend on requirements to keep organization-mission critical applications on premises. If you are looking into cloud solutions, take into consideration industrial regulations, security, visibility, accessibility, latency, and trustworthiness of the cloud providers. (Some top cloud providers include Amazon Web Services, Google Cloud Platform, Microsoft Azure, Oracle Cloud, Rackspace, Verizon Cloud, and VMware.)

So what does a typical data warehouse implementation look like?

The diagram below shows a high-level architecture of an end-to-end data warehouse solution. (click to enlarge)

Gather

The Gather layer consists of various data silos – these can include ERP systems, CRM, Excel spreadsheets, even Access databases housing corporate or divisional data. The storage subsystems used by these applications are typically not structured for easy querying or navigation (if direct access is available at all).  Native reporting may be possible in some of these applications, however functionality is typically very limited and reporting is limited to only the data within the single system.

Clean

The Clean layer applies business logic and other calculations to data that is ultimately going to be made available in the Store layer (or data warehouse). Business logic can include custom KPI’s, business-specific calculations or rulesets, data hierarchies, or new derived columns that otherwise are not available from any source system.  Data typically only temporarily exists in the Clean layer – this layer exists only to create these custom values and pass through to the data warehouse, and end user reporting or querying against the Clean layer is not allowed.

Store

The Store layer represents the denormalized data warehouse that is described further throughout this blog post. While there are several design models, the Kimball approach is a leading design through which information is organized into dimension and fact tables and joined in star schemas for ease of use.

Share

The Share layer isn’t a formal layer as much as a representation of all of the various ways data which resides in the data warehouse can be consumed. These uses include: querying via a BI reporting / dashboarding tool, direct SQL querying, or even automated extracts to feed other, unrelated systems.

The Benefits of Having a Data Warehouse

Data warehouses will help you make better, more informed decisions for many reasons:

  • Improved business intelligence: When you integrate multiple sources, you make decisions based on ALL of your data.
  • Timely access to data: Quickly access critical data in one centralized location.
  • Enhanced data quality and consistency: Data throughout the organization is standardized and stored in the same format so all departments are making decisions based on uniform data.
  • Historical intelligence: Because a data warehouse stores large amounts of historical data, you can identify trends through year-over-year and month-over-month analysis.
  • Quick query response times: Most data warehouses are modeled, built, and optimized for read access, and that means fast report generation.
  • Data mining: Explore “Big Data” to predict future trends.
  • Security: A data warehouse makes provision easy by giving access to specific data to qualified end users, while excluding others.
  • Auditing: Data stored appropriately in a data warehouse provides a complete audit trail of exactly when data was loaded and from which source(s).
  • Analytical tool support: Analytical tools that offer drill-down ability work best when extracting data from a data warehouse.
  • Government regulation requirements: With a data warehouse, it is easier to comply with Sarbanes-Oxley and other related regulations than with some transactional systems.
  • Metadata creation: Descriptions of the data can be stored in the data warehouse so that users understand the data in the warehouse, making report creation much simpler.
  • Scalability: If you have volumes of historical data that need consolidation, a data warehouse makes for easy access in a common place, with the ability to scale in the future.
  • Real-time performance: A data warehouse can merge disparate data source with capabilities to preserve history as soon as the data is available.

What Could Happen if You Don’t Have a Data Warehouse

Let’s look at an example scenario: Company XYZ has three systems that are used to track leads as they move through the sales process to become clients:

  • Application 1, a web-based tool, is used to capture potential leads from website advertising.
  • Application 2 is used to capture leads for direct calling and is also used by client services for client management tasks.
  • Application 3 is used for client management and ongoing support of existing clients.

A clear set of guidelines does not exist to govern what type of data should exist in each of the three systems and when data should move from application 1 to application 2.  Because of the lack of consistency and rules, Company XYZ faces the following issues:

  • Time and money spent pursuing unqualified and marginal leads;
  • Inability to determine where to best spend marketing dollars;
  • Inconsistencies in contact data for existing customers;
  • Incomplete data feeds reporting;
  • Overextended call center and client services staff who must deal with bad data daily; and
  • Inefficient use of staff time correcting errors that could be better allocated to more profitable tasks.

A data warehouse can be implemented to gather, clean, store, and share information and lessen the burden felt by the client services staff.

3 Questions to Ask Yourself if Considering a Data Warehouse

Take a closer look at how information is stored and shared across your enterprise. Inconsistencies in data and reporting, difficulty sharing data, and multiple data sources are all signs that a data warehouse might be the business solution for you.

Here are three basic questions to ask yourself if you are still considering:

1.)  Do you store data in various source systems?

Gathering data that is structurally different from operational databases, flat files, and legacy systems can be challenging for many organizations. How can you integrate data from disparate systems with different structures? Sometimes it’s done by spending days and weeks pulling data to create reports. Also, operational systems are not modelled for analytical queries and most often contain data specific to a business area. The lack of integration across multiple business area might not give a full picture of the health of an organization. In addition, luck of human capital with the relevant or advanced skills and permissions to query data applications can be an impediment when dealing with various source systems.

A data warehouse can make this process efficient and automatic by extracting, transforming, and loading data from various source system in a standardized and consistent manner. The ability to access such data from a central location not only allows users to make quick and better business decisions, but also save time that could’ve been wasted trying to retrieve data from multiple information systems.

2.)  Are you experiencing performance issues by reporting against operational systems?

Operational systems have volatile data that changes often. Consequently, running reports directly against such systems with almost real-time data can cause performance problems, and insights gathered might be inconsistent.  A data warehouse can solve this problem, because they are usually optimized for read access, resulting in faster report generation.

3.)  Do you have a single source of the truth?

Reporting on data that is stored and formatted differently across siloed enterprise information systems results in inconsistency across departments. Well-built data warehouses improve data quality by cleaning up data as it is imported, thus providing more accurate data. This means that one version of the truth can be provided for every department across the enterprise, providing consistency and assurance that each department is using the same data.

Alternatives to a Traditional Data Warehouse

Sometimes data warehouses cannot solve all business problems due to their inherent dependence on the relational data structures. The adoption of new data sources, such as social media, IoT, logs, video, and audio has resulted in rapid changes in both content and volume. The down side of this has been the lack of internal check-points of data ownership which makes it difficult to apply data governance principles accustomed to traditional data warehouse projects.

As an alternative to the challenges brought about by the new ways of storing data, organizations have adopted emerging technologies such as data lakes, data virtualization, non-relational databases and perhaps polyglot persistence. Let’s have a look at what they offer:

Data Lakes

A data lake is collection of unstructured, semi-structured, and structured data, copied from one or more source systems (technology independent). The data stored is an exact replica of the source. The goal is to make the raw data consumable by highly skilled analyst within an enterprise for future needs that are not known at the time of data capture.

The key difference in comparison to the data warehouse is that the data is not modeled to a predetermined schema of facts and dimensions. It is the lack of structure that empowers developers, analysts, or data scientists to create exploratory models, queries, and applications that can be refined endlessly on the fly. Here are three characteristics of data lakes:

  1. All data is extracted and loaded from source system
  2. All data types are supported
  3. Data transformation and modelling is done for meet analysis requirements

Self-Service BI

Self-Service BI is an approach that gives freedom and responsibility to business users to create reports without relying on IT. Sometimes data warehouses lack the agility to scale to meet the need of quickly evolving companies. Self-service solutions allow companies to be nimble by giving departments access to data and information on demand. All skill levels can typically use these types of solutions:

  1. Casual users – possesses limited BI skillset and require simple requirements
  2. Power users – skilled BI users with the ability to analyze data and create new reports and dashboards from scratch
  3. Business analysts – advanced BI Skills in data exploration, modeling, and deployment of BI environments

Some of the most popular self-service BI tools available on the market are Qlik, Tableau, Power BI, Sisense, Ateryx, and Birst. See the tools we work with the most.

No-Relational (No SQL)

NoSQL is an architectural design approach to databases that does not rely on traditional representation of data. Relational database management systems organize data in tables, columns, rows, or schemas for CRUD (create, read, update and delete) operations. In comparison, NoSQL databases do not rely on relational structures, but more flexible data models that offer speed, scalability, and flexibly. These are key characteristics needed to deal with “Big Data”.

There are various types of NoSQL databases available on the market today and they fall into four main categories:

  1. Key-value databases – These databases use associations between elements as a data model such that a key is associated with only one value in a collection. Usually the value stored is a blob. This means that no one knows what the value of the data is until a key is provided as an identifier to provide access. Key-value database offerings include DynamoDB, Azure Table Storage, Riak, and Redis.
  2. Document databases – They store and retrieve documents in several formats, such as XML, JSON, and BSON. Popular providers of document stores are Elastic, MongoDB, CouchDB, Terrastore, RavenDB, and Azure DocumentDB.
  3. Wide-column databases – These databases stores data in tables with rows and columns much like traditional relational databases. However, the names and formats of columns can vary from row to row across the table. In other words, columns of related data are grouped together, enabling retrieval of related data in a single query operation. On the other hand, relational databases store rows in different places on disk that would require multiple disk I/O for data retrieval. Providers for such database include Hadoop, Cloudera, Cassandra, HBase, Amazon DynamoDB and Hypertable.
  4. Graph databases – These databases utilize graph structures to store, map, and query relationships between entities. Providers for such database include Sparksee, InfiniteGraph, New4J, and OrientDB.

Our Data Warehousing Services

Whether your data resides in spreadsheets, various operational systems, “data landfills”, or you already have an enterprise data warehouse, Analytics8 can help you transform it into consistent and useful information so you can make better, more informed decisions.  Read about our data warehousing services here.

Read about our data warehousing work with Guggenheim.

Renson is a Business Intelligence Professional with more than 15 years of experience in full SDLC projects in marketing/branding, manufacturing, supply chain, banking, healthcare, and insurance industries. He enjoys playing and watching soccer.

 

Contact Us

National Office Telephone | Mon-Fri 8:30am-5:30pm CT