Data Cleaning: The Dirty Job You Can’t Ignore

Bad quality data results in bad decision-making. So, how do you ensure data is sufficiently clean for your specific business needs so that you can make better, more informed decisions?

We’ve all been there before—you run a report and realize half the data is garbage. It doesn’t inform good decision-making, and worse, you’ve spent valuable time chasing issues with erroneous data and putting in one-off fixes rather than doing effective analysis. It’s a painful experience.

So, what do you do? Many of our clients say data cleaning (or data cleansing) is the aspect of data and analytics that takes the longest. Some say they’re missing out on including certain data sets because the data is too dirty. Others say it just takes too long to prepare and clean their data. There’s some good news—data cleaning doesn’t need to be a complicated process.

In this blog, we’ll provide best practices on how you can get started with cleansing your data to ensure that it is ready to provide valuable insights for your business users.

Data Cleaning in A Nutshell

Data cleaning—also known as data cleansing—is a subset of the practice of data quality management. It is the process of filling in missing data, fixing inaccurate data, and removing duplicate data. It’s important for every organization to actively manage and monitor the quality of their data because the downstream effect of bad quality data—which includes duplicate data, outdated data, incomplete data, insecure data, inaccurate data, or inconsistent data—can lead to bad decisions, missed opportunities, slowed workflows, and ultimately distrust in the data across the entire business.

What Are Best Practices to Get Started with Data Cleaning?

There are two sides to data quality management and ensuring your data is clean and trustworthy: 1) You need to improve data collection methods to ensure all data being captured is clean from the start and 2) you need to define and apply data cleaning techniques to improve historical data.

The best way to manage future data quality is to focus on improving your data collection and data capture mechanisms to prevent bad data from ever being created at all. Data quality itself can’t be fixed with technology—it involves a closer look at your people and processes. If you do not address the source of your data quality issues, you will consistently create bad data, which will ultimately require more and more data cleansing downstream. Stop the source of the issue in parallel with your considerations to cleaning data after it is collected or created.

So, improving processes and training business users can help to solve data quality issues going forward, but what do you do when you have existing troves of bad data? Let’s face it, data cleansing is tedious, time consuming, and overwhelming. Most people would probably rather do their taxes than clean a mountain of bad data. These are some simple steps that you can take to greatly simplify and focus your data cleaning efforts.

1.) STEP 1: Identify your expectation of quality.

Before you embark on a data cleansing project, it is critical that you start with the end in mind—ask yourself “how clean is clean enough?” The standard of data quality is relative, and data cleaning can be extremely costly, therefore what you do to cleanse your data should be informed by that relativity. Articulate your acceptable threshold of error before you embark on a data cleansing journey. Ask yourself questions like these below to inform how much time, money, and energy you should invest in data cleansing:

How will this data be used, and by whom?
What would the impact of a poor or wrong decision based on this data be?
Is there another source of similar data of higher quality that we can use instead to inform our decisions?
What threshold of error are we willing to accept (1 in 100, 1 in 10,000, 1 in 1,000,000, etc.)?

2.) STEP 2: Identify your standard of truth.

What are you judging the data against to determine that it is good or bad? Who or what can make a quality determination? Some data can’t be fixed without direct human intervention, which means for some organizations that it is too cost prohibitive to fix—either in terms of time and resources, or in some cases, reputation. It might be ok for one organization to ask their customers to provide input to assist in the data cleaning process, but unacceptable for the brand and reputation of another to ask the same from their customers. Knowing all the options for how you can measure the quality of data is helpful to prevent you from inadvertently choosing the most expensive standard of measure.

3.) STEP 3: Identify the appropriate approach to data cleansing.

Scale is an incredibly important factor in determining how you should proceed with a data cleansing project. You may not want to hear this, but if you have a relatively small amount of data to clean and it’s a one-and-done scenario, then you should probably assign it to the right people in your organization and get the job done with manual human intervention. Again, no one wants to clean data, but it may very well be a great use of your human resources to clean data to improve decisions made in an organization. I’ve seen creative ways of efficiently doing this when people are tasked with several hours of data cleansing. Investing hundreds of thousands of dollars on a technology to do the same work may not shield those same people from needing to spend time reviewing changes anyway.

Setting vision in an organization that everyone is responsible for quality data is key to creating a data-driven culture as well—and having everyone take part in cleaning up a mess can be a helpful way to help them remember to be conscientious with data collection. A thousand rows of poor data is not a large-scale project; do not deceive yourself.

Obviously human intervention is not always possible. It is important to ask yourself if you could envision rules being written to clean the data before you try to apply a technology to the problem. If it’s a large-scale repository of data, it is of sufficient value to clean, and it’s something that you determine can be automated and systematically improved, then you can approach it with technology.

4.) STEP 4: Think outside the box with data augmentation.

At the end of the day, even with sophisticated algorithms, data cleaning involves someone or something writing rules to change and correct the data. Even if you have advanced technology in place that will generate rules to recognize patterns to help you do that work of classification and cleansing, it cannot create information from nothing. If you think data is missing to improve the quality of existing data, then you may need to look to augmentation in addition to your data cleansing tools.

Ask yourself, what other data exists that could help us assign the proper corrections to this data? Data augmentation can be very helpful to improve data cleansing efficiency. Some tools come with data to augment your data, but if not, do not assume that your data is the only data available in the organization. Be willing to ask other departments and functions for help. You may even find that someone else has already embarked on the same journey as you, and you can learn from their successes and failures.

5.) STEP 5: Establishing leadership is critical.

As you embark on any data cleansing project, do not underestimate the power of leadership in data quality management. Helping people understand the value of data cleansing, especially if it is a manual process, is critical to inspiring people to give their energy to do the important yet mundane work. If they understand that they are not just properly classifying products but embarking on a transformational journey of data-driven insight creation, you will get better outcomes. Consider showing those assisting with the program what the intended outcomes of the effort are, including answering questions such as:

What dashboard will this improve?
What process will this improve?
What outcome are we influencing?

Data cleaning is the first step to realizing those positive outcomes.

Remember, bad quality data results in bad decision-making. But so does approaching data cleaning without answering some key questions. Data quality management doesn’t need to be difficult, especially when you take the time to address key consideration at the very beginning.

Talk With a Data Analytics Expert

Key Takeaways

Data cleaning is essential for trustworthy analytics, but it often takes the longest and is frequently overlooked or deprioritized.
Improving data collection processes and user training can reduce the need for downstream cleaning and prevent recurring issues.
Before beginning a data cleansing project, define what “clean enough” means by identifying acceptable error thresholds and intended data uses.
Choose a realistic approach based on the scope—some cleaning projects may be best handled manually, while others warrant automated tools.
Data augmentation can help fill in missing or incomplete records by tapping into other internal or external data sources.
Leadership and communication are critical for success—showing the impact of clean data can motivate teams to engage in otherwise tedious tasks.
Framing data cleaning as a step toward insight and transformation helps build a data-driven culture and encourages long-term improvements.