Databricks: A Solution for Today's Data Engineering Obstacles

The myriad of tools offered in the analytics space can be overwhelming—with each providing a new promise— whether it’s simplifying your workflow, creating faster pipelines, unlocking insights in your data, or machine learning capabilities. With every new product introduced, there is a new ‘silver bullet’ offered to revolutionize your organization’s data analytics department.

Even the nomenclature can trip up seasoned data professionals: data lake, data warehouse, data vault, delta lake, data lakehouse… if you are in the analytics space, you probably continued this list in your head. Where do you start so that you can understand which tool is best for your needs?

What is Databricks?

Databricks is a cloud-based data engineering and machine learning platform (named a Leader in Gartner’s 2021 Magic Quadrant for the third year in a row). It is a cloud-agnostic platform for running tasks on Apache Spark—while simplifying the deployment of the architecture. It uses a language-agnostic notebook-style interface and maximizes collaboration among its users. Let’s break down its benefits:

Cloud-Agnostic: Databricks can run on top of Azure, AWS, and Google Cloud Platform (GCP)—it’s easy to set up in any environment. If your enterprise is already operating in the cloud, chances are it is one of the three, and you can easily add it to your existing subscription. Databricks will not nail you down to a single provider and can be migrated along with the rest of your cloud architecture without operational issues.
Large-Scale Processor: Databricks’ core architecture runs on Apache Spark—an open-source analytics engine with a heavy focus on data parallelism (doing lots of things all at once). The Spark architecture works with a driver/worker node system, which allows you to use many servers as one server. Any number of worker/executor nodes each work on the same job, piece-by-piece, and when each server is finished it takes its output back to the main server driver/master node—assembling everything together for the final output.
Language-Agnostic: We often refer to Databricks notebook-style interface as “Google Docs for programming”. Data scientists and data engineers can easily collaborate when writing code as a team. Data engineers can write code in a language of choice—Python, SQL, Scala, or R. Databricks can also be used to sync with GitHub, DevOps, or other code repositories. Overall, it’s an ideal environment for teams of any size to develop in.

How Does Databricks fit into a Modern Data Architecture?

A modern data architecture should have some flavor(s) of the following three elements:

ELT (extract/load/transform): Ingestion of all data into a central staging area (a data lake). This is a raw copy of what you get from each source system. This creates an un-curated data layer. It can be accessed for lineage tracing or ad-hoc development.
ETL (extract/transform/load): Transformation of each of the raw data sources into a dimensionally modeled format (a data warehouse). This creates a curated data layer—a source of truth.
Analytics/Reporting: A platform for users to access the curated data. There are lots of players in this market—and while Databricks may bring data to the table—data visualization is not its strength and should not be viewed as a replacement for something like Power BI, Tableau, or Qlik.

Databricks has traditionally been used as a data engineering mechanism for ELT/ETL jobs. If you already have a cloud-based data warehouse—this is where Databricks would likely fit in for your data architecture. It enables the movement and transformation of data from source to warehouse and in some cases to the analytics/reporting layer. However, for those looking to pioneer in Databricks’ latest technology trend—the Lakehouse—read on.

The Databricks Lakehouse

In the Lakehouse architecture, Databricks is used for both executing compute tasks (ELT/ETL) as well as for storage (data lake/data warehouse). They refer to this architecture as the “lakehouse“, where Databricks serves as both the data lake and data warehouse. The lakehouse was created to enable users to do everything from business intelligence, SQL analytics, data science, and machine learning on a single platform.

Diagram showing difference between data warehouse, data lake, and data lakehouse.

A new paradigm: Databricks lakehouse is blurring the lines between traditionally independent solutions: data warehouses, data lakes, and ETL processes–creating a one-platform solution for modern data architectures. Photo: Databricks

The main storage component of the Lakehouse architecture is the Delta Lake. Delta Lake addresses data reliability issues in traditional data lakes (supporting transactions, enforcing data quality, lack of consistency, data isolation).

Delta Lake on Databricks allows you to configure data lakes based on your workload patterns and provides optimized layouts and indexes for fast, interactive queries and sits on top of object storage. The format and the compute layer help simplify building big data pipelines and increase the overall efficiency of your pipelines.

Diagram illustrating the Databricks Delta Lake.

Centralized Data Platform: Decoupling storage and compute is the foundation of the lakehouse approach. A Databricks Delta Lake utilizes your existing data lake to provide an organized, reliable, and efficient source of truth. Photo: Databricks

A single platform for massive compute and storage could grab a huge piece of the software market—but it should today be treated less as a best practice and more of an experiment in the art of the possible.

The lakehouse architecture is by no means compulsory to adopt in full to make use of the platform. Use Databricks in a way that best suits your needs and your internal abilities. You don’t need to have a trail-blazing data architecture. Keep it simple—build a scalable architecture based on your specific needs and forget the rest.

How is Databricks Different from Other Tools?

If you need to integrate, clean, de-duplicate, transform your data, or create machine learning applications, there’s many products you might evaluate, and a handful that emerge as leaders. Databricks sets itself apart from competitors because of a lower barrier to entry, primary focus on open-source technology, and a simple but effective UI/UX. Let’s break it down:

Barrier to Entry: Compared to many other ETL tools on the market, most engineers can hit the ground running in Databricks enabled by the ability to write code in language of choice. There is no ‘Databricks scripting language’ they may end up using for a handful of years only for it to become obsolete and fade into the ever-expanding tech graveyard. This adds flexibility to growing any development team.
Open Source: Databricks’ emphasis on open-source technology enhances its value. Data engineers can access open-source software libraries like Koalas (think pandas but distributed in the Apache architecture). Data scientists can use familiar packages like TensorFlow, PyTorch, or scikit-learn. Databricks will continue to grow with open-source libraries.
Interface: The notebook interface makes it effortless to collaborate with other developers on the same piece of code. Being able to work on the same notebook at the same time and see other people’s code in real-time makes pair-programming and debugging sessions far more efficient.

Talk to an expert about your Databricks needs.

Forewarnings From a Hardened Data Engineer

Let’s check back in from our introduction. We’ve described an industry-leading platform for data engineering and machine learning built to scale in the cloud that will surely fit into your enterprise environment. To revisit what we said from the jump: “With every new product introduced, a new ‘silver bullet’ is offered to revolutionize your organization’s data analytics department”. How hard did you bite the hook? Did you consider Databricks as a be-all, end-all answer to revolutionize how your organization uses data? Sure, its powerful, its scalable, and could unlock next level insights—but keep a few things in mind before you drink too much Kool-Aid:

Do the Basic Stuff First: Don’t be tempted by the allure of machine learning right away. Do you have a clearly defined curated layer of descriptive data? If you can’t tell me exactly what your sales numbers are and where to find them—you certainly can’t write a machine learning program to forecast future sales accurately. The operational value of descriptive data will often outweigh a predictive mechanism. Look forward and plan for advanced capability, but don’t overlook the basics.
Data Flow: A wise colleague of mine once told me: data should only flow in one direction (as an English speaker/writer—let’s say left to right—source to target). When flow of data goes right to left—against the grain—you’ve introduced ambiguity about the lineage of the data. If the flow of data can take on many directions and forms, you will have lost control of the environment. Define the ideal data flow, otherwise what could be a ‘data lake’ can become a ‘data oceanswamp’ soon if you’re not careful.
Process Control: So, your team has written a beautiful symphony of code. It performs all the tasks you need and always hits the right notes. But how can you be sure the data science team kicks off jobs in sync with the completion of data engineering jobs? Who’s conducting this orchestra? In the same vein as data flow—you should understand how your Databricks scripts will be orchestrated. In some cases, using an external tool to coordinate all the Databricks jobs can be beneficial.
Enablement: This is not a GUI-driven software. At the end of the day, your data engineers and data scientists will be writing code unique to your use cases. Databricks is simply the vehicle to create, store, and execute said code. Do you have the internal expertise to properly use Databricks? Are you enabling your data engineers to write clean and efficient code? Do you have documentation in place in the event of transition? An investment in your talent will always have a greater return than any software. Invest in talent and Databricks—it’s not an either or.

Getting Started with Databricks Resources

Interested in learning more? There are many places you can get started with Databricks— but we’ve cherry picked some of our favorites:

Databricks Community Edition: The best place to get started is to create a free Databricks community account. This will let you spin up a small cluster to play around and get familiar with some Databricks basics. It’s great to play around in Databricks on your own, however we recommend using this in conjunction with some kind of online course (see below).
Free YouTube Databricks Overview: This video by Adam Marczak provides a clear and concise summary of how to get up and running with Databricks quickly.
Databricks Academy: The best resource we’ve found for learning Databricks is hands-down the Databricks Academy. These courses are put together by Databricks themselves and are designed to get you ready to use Databricks in a business environment. Note that while there are some free courses, much of the training available here is not cheap.
Databricks Community: The community page is an excellent resource to see what other users are doing with Databricks and can be a great first place to go with any technical questions.
Databricks Data Engineering Blog: Databricks’ own blog has a ton of excellent solutions laid out for how to get the most out of Databricks.