The data lakehouse was created to combine the benefits of a data warehouse and a data lake. Learn about data lakehouse features, architecture, technology, and if it could be a good fit for your organization.

As the rate at which companies collect data continues to grow, the volume and variability of this data has proven difficult to manage using existing data warehouse and data lake architectures. Many organizations lack the ability to gain actionable insights from all their data because it is often spread across multiple systems—data warehouses, data lakes, or a combination of both. The need for a better approach to data management has led to new solutions like the data lakehouse.

But what makes a data lakehouse different, how can it fit into your existing architecture, and how can it help your organization do more with the data you own?

In this blog, we will explain what makes a data lakehouse different from a data warehouse and a data lake and discuss a pragmatic approach to data lakehouse architecture. We will also go over the steps you can take to better understand if it is the right solution for your organization and cover the technology you can use to build one.

What is a Data Lakehouse? 

A data lakehouse is a new data management architectural pattern that combines the low-cost, scalability, and flexibility of a data lake with the data management and data structure of a data warehouse. Key features of a data lakehouse include:

  • Handles structured and unstructured data, which allows for the collection of data from traditional transaction data to images, video, and text.
  • Open source, which enables organizations to build on different tools from different vendors, effectively removing the all-too-common vendor lock-in that occurs with data management technologies. Open file formats in the data lake, such as Apache Parquet and ORC, are a part of this open-source approach. The other part is open APIs to access the data stored in the storage layer, allowing users—from business analysts to data scientists—the ability to access data using SQL, Python, Scala or R, to name a few.
  • ACID transaction support, which ensures data consistency between concurrent read and writes, increasing the trustworthiness of the data. ACID properties of a transaction ensure:
    • Atomicity: Each transaction—whether it be read, write, update, or delete—is treated as a single unit and must execute in its entirety or not at all. This property prevents data loss and corruption from occurring, increasing trust in the data.
    • Consistency: When a transaction occurs, this ensures tables change in predictable ways. Transactional consistency ensures that any corruption that occurs doesn’t have an unintended effect on the table or other data.
    • Isolation: Transaction—whether they be reads or writes—typically occur concurrently with many different users. Isolation ensures those transactions don’t interfere with or affect any other transaction. From a user’s perspective, it feels like these transactions occur one by one when they’re being executed at the same time.
    • Durability: Durability ensures that when a transaction completes, it’s saved and not lost with any system failure.
  • Schema enforcement and data governance, which ensures new data follows the schema of the target table to reduce data quality issues. This includes robust auditing of data and data access control.
  • Decoupling of storage and compute, which allows for flexibility in tools and ensures that storage and compute use different clusters. These systems can scale independent of each other depending on the needs of the business.
  • Business intelligence support that allows for BI tools to directly access the data to ensure freshness of data and data readiness. This also eliminates the need for data duplication as it can be accessed from the same system as other advanced analytic workloads.
  • Streaming support that allows for streaming data and enables real-time reporting

How is a Data Lakehouse Different than a Data Warehouse or Data Lake?

A data lakehouse melds the best elements of a data warehouse and a data lake into one entity with the goal of addressing the limitations of both.

From the data warehouse, it brings structure to the data using a metadata layer that enables data management and business intelligence features. From the data lake, it brings the low-cost storage of raw, open file formats for massive volumes of data. Together, these elements address the limitations of the data warehouse by allowing low-cost storage for raw data while allowing for managed and query optimized data not achievable in a data lake.

Blue graphic showing comparison between a data warehouse, a data lake, and a data lakehouse.

A data lakehouse is designed to address the limitations of the data warehouse by allowing low-cost storage for raw data while allowing for managed and query optimized data not achievable in a data lake. Photo credit: Databricks

Data warehouses
have been around in some form since the 1980s. A data warehouse is a centralized relational database that combines the data from various source systems into a structured schema. This data is used primarily for business intelligence and reporting. Data warehouses often contain “golden records” for reporting as data from multiple sources are cleansed and combined to build a complete record.

Limitations of the Data Warehouse

Despite its strengths, there are several key limitations of a data warehouse, including:

  • High cost for massive data volumes
  • Inability to handle unstructured data and difficulty with semi-structured data
  • Inability to deal with streaming data
  • High level of work when introducing new data sources

Data lakes were created in the 2010s to attempt to resolve the limitations of existing data warehouse solutions by introducing low-cost, massive storage of raw data in an open file format. Unlike data warehouses that enforce a tables schema—rejecting data that does not match that schema—data lakes do not enforce schema on write. Instead, data lakes are schema on read, when the data is read, the schema is applied to the data. This allows for any type of data to be stored within a data lake including video, text files, images, audio, etc.

Due to this architecture, data lakes excel at big data processing, artificial intelligence, and machine learning purposes.

Limitations of the Data Lake

Despite its openness, there are several key limitations with data lakes, including:

  • Data lakes need additional tools to allow for SQL queries for business intelligence and reporting
  • Poor data quality and data governance
  • Security and access control limitations
  • Disorganized data resulting in stale unused data

Many organizations combine approaches and extract, transform, and load a subset of data stored in their data lake into a data warehouse for business intelligence and reporting. The data lakehouse is an alternative to the two-tier architecture.

How the Data Lake Improves on the Data Warehouse and Data Lake

The data lakehouse architecture aims at resolving:

  • Improved data quality and reliability: Schema enforcement and less need to transfer data ensures higher quality, reliability, and availability of data.
  • Lower storage costs: All of an organization’s data can exist in one location instead of spread across multiple systems, regardless of file type.
  • Enhanced data management: Using the open file format, data can be used for traditional business intelligence applications and advanced analytics, such as machine learning and artificial intelligence.
  • Limited data redundancy: Data is stored in one location instead of copied into multiple data warehouses and data marts.
  • Flexible actionable data: Raw unstructured data can now be quickly accessed for machine learning right besides data for more traditional analytic workloads allowing for more flexible use of data to drive business insights.
  • Openness: Using open file formats such as Apache Parquet and OCR provides organizations the flexibility to change metadata layer solutions to fit their changing needs. An open API allows businesses the flexibility to also change their consumption layer solution to fit their evolving analytic needs.

Defining a Data Lakehouse Architecture

White and purple graphic representing five layers of a data lakehouse.

The data lakehouse architecture consists of five distinct layers, including ingestion, storage, metadata, API, and consumption layers. Each play a key role in making the data lakehouse a modern data architecture. Image credit: Striim

  • The ingestion layer is responsible for pulling in data from various sources. From streaming data to transactional batch loads, this layer brings in structured, semi-structured, and unstructured data using various protocols and tools.
  • The storage layer stores massive volumes of data at low-cost in an open file format such as Parquet. The data can be structured, semi-structured, or unstructured data, and can be used directly in machine learning or artificial training application and models.
  • The metadata layer is what provides the data lakehouse architecture the benefits that exist in modern day data warehouses. It is a unified catalog of metadata on all objects that exist within the storage layer. This metadata allows for the following features most often found in rationale database management systems (RDBMs):
    • ACID transactions to ensure data integrity, consistency, and reliability.
    • Schema governance and enforcement to ensure data quality by preventing unwanted data from populating a table.
    • Indexing to improve data retrieval and performance.
    • Caching to store data in-memory for fast retrieval and increased performance for commonly used queries.
    • Time travel to enable access to historical data, allowing a user to restore a table to a previous version that may have been accidently deleted or altered.
    • Access control to ensure database security by allowing only authorized users from accessing databases objects and data.
  • The API layer allows for various tools and processes to access the data quickly and efficiently. Since the data stored in the data lakehouse is in an open format, machine learning and other advanced analytic applications and software can directly access the data.
  • The consumption layer allows for business intelligence tools such as Power BI and Tableau to access data while also allowing for SQL queries and machine learning jobs.

Which Technologies Offer a Data Lakehouse and What are Other Options on the Market?

You have options to adopt a strict “by-the-book” data lakehouse with a technology like Databricks, or you can use other vendors you might already be invested in which offer the same benefits of a data lakehouse using slightly different architectures.  

  • Databricks is leading the push in the data lakehouse space with Delta Lake—which is an open format storage layer that delivers reliability, security, and performance for a data lake. Due to the open nature of Delta Lake, it can be used with AWS, Azure, or GCP. It uses open-source technology which may be appealing to certain organizations.
  • Snowflake allows for many of the features and capabilities that make up the data lakehouse architecture. While Snowflake does not have a traditional data lake, it can act as one with its micro-partitioning, essentially providing similar capabilities. Snowflake does use proprietary technologies which makes it more difficult to move to another system should the need arise. This proprietary storage and metadata layer removes the open-source feature of a pure data lakehouse architecture, but it is, within the Snowflake environment, functionally the same.
  • Azure Synapse Analytics coupled with Azure Data Lake allows for many of the features and capabilities of the data lakehouse architecture. Azure Synapse Analytics is a fully managed petabyte-scale cloud-based data warehouse product designed for large-scale dataset storage and analysis. Synapse has the capabilities of connecting to a data lake, providing many of the capabilities of the data lakehouse, but this removes the open-source feature of a pure data lakehouse architecture. Just like Snowflake, however, in Synapse is it functionally the same.
  • Amazon Redshift coupled with Amazon S3 allows for many of the features and capabilities of the data lakehouse architecture. Amazon Redshift is a fully managed petabyte-scale cloud-based data warehouse product designed for large-scale data set storage and analysis. Redshift can sit on top an Amazon S3 instance and provide query capabilities for BI and advanced analytical needs. Like other solutions, Amazon Redshift is built using proprietary technology, it is not open-source and doesn’t follow the “rule book” definition but provides most of the key features found in the data lakehouse architecture. This makes it a great solution for any business or organization looking to build a data lakehouse.

This list isn’t exhaustive and there are other vendors that provide other solutions for the data lakehouse architecture. It’s also important to note that the architecture for a data lakehouse allows for mixing of technologies from different providers. For instance, you can have an Amazon S3 instance with a Databricks Delta Lake acting as the metadata layer.

Is The Data Lakehouse Solution Right for Your Business?

If you’re looking to achieve any of the below, then a data lakehouse solution should be assessed for your business:

  • Perform both BI and advanced analytics. The data lakehouse enables the ability to analyze data that is structured optimally for traditional BI tools such as Power BI or Tableau, while also enabling data scientists the ability to use datasets structured optimally for machine learning and artificial intelligence.
  • Reduce data redundancy across the organization. Instead of replicating data across multiple data lakes, data warehouses, or data marts, the data lakehouse enables you to keep all your data in one place.
  • Simplify data observability by reducing the amount of data movement through data pipelines and other methodologies into multiple systems, all of which requires development, maintenance, and governance. You can instead have data move into just one central open-format data lakehouse.
  • Simplify your data management solution from a two (or more) tier architectural solution to a one-tier, one-stop-shop, architectural solution.
  • Improve data security, such as enabling access control on data used in machine learning and artificial intelligence workflows.
  • Add flexibility as your analytics needs evolve with your business data consumption needs with the ability to change storage, metadata and/or consumption layers based on the open-source file format and API.
  • Lower storage cost by capturing and storing all your data in its original raw format. Store data from csv’s to audio and video at low-cost. Regardless of how much data you have, the data lakehouse can scale to fit your business’ needs.

While this is not an exhaustive list, it helps to start the conversation on whether a data lakehouse is a good solution for your organization.

John Bemenderfer John is a Senior Consultant based out of our Dallas office. He has experience across the entire data stack, from data engineering to analytics, helping clients get the most value out of their data. He also helps lead the Power BI practice for Analytics8. Outside of work, John enjoys spending time with his daughter and wife, dungeons and dragons, and anything Star Wars related.
Subscribe to

The Insider

Sign up to receive our monthly newsletter, and get the latest insights, tips, and advice.

Thank You!