Feb 2, 2024

Don’t Fix Bad Data, Do This Instead

People don’t know what they mean when they talk about data quality.

Photo by No Revisions on Unsplash

Story From The Trenches

A few years ago, our data platform team aimed to pinpoint the primary concerns of our data users. We conducted a survey among individuals interacting with our data platform, and unsurprisingly, the main concern highlighted was data quality.

The initial response, characteristic of our engineering mindset, was to develop data quality tooling. We introduced an internal tool named Contessa. Despite being somewhat cumbersome and necessitating significant manual configuration, Contessa facilitated checks for standard dimensions of data quality, encompassing consistency, timeliness, validity, uniqueness, accuracy and completeness. After running the tool for a couple of months with hundreds of data quality checks we concluded that:

  • Data quality checks occasionally assisted data users in discovering, in a shorter timeframe, that the data was compromised and could not be relied upon.
  • Despite the frequent execution of data quality checks, there was no noticeable improvement in the subjective perception of data quality.
  • For a significant portion of issues, particularly those identified through automated data quality checks such as consistency or validity, no corrective actions were ever taken.

Survey and objective measurement are useful tools, but nothing can replace a discussion over coffee and cake, as Jane Carruthers writes in her book, “The Chief Data Officer’s Playbook”. Indeed, I recommend this to anybody, as one-on-one conversations helped us discover another important angle of the situation. Some of these conversations unfolded as follows:

“Hey, you say, that data quality is poor, what do you mean by that?”

#1 Pricing business analyst: 
“We are working on setting up price for the ancillary product X. In the dataset we use, we are missing data on what was the actual revenue from the product X per each order. We have this dataset , but it contains only expected value of the revenue from X at time of the purchase. We can see also the actual revenue per product, but not at the order granularity.”

#2 Product analyst: 
“There are several touchpoints in our customer’s journey where we redirect our customers to our partner’s website — in case they are interested in purchasing 3rd party product Y. We regularly download the transaction data from our partner and we analyse it to optimise this offering. In 30% of cases we are missing information on which touchpoint our customer left to the partner’s site.”

# 3 Analytics engineer: 
“We prepared a data set containing list of deep links — information on what offers our customers clicked on. However for a certain group of deep links, the date of the offer is not in the format we expect it to be. We had to filter out these records and we do not take them into account.”

There is a common denominator in these cases: our data quality tools could not resolve any of these issues; the solution lay elsewhere.

In case #1, the solution involved extending the actual ancillary revenue data model to the required granularity.

For case #2, frontend developers conducted a review of all the steps where redirection could occur and added a query parameter to the redirection URL.

In case #3, the issue arose because the API call, responsible for fetching the validated departure date before storing the data, was timing out in some instances. Implementing retry logic on the API call by the engineering team resolved the problem.

The main problem was either the data not getting logged or not being put together, connected, or understood as we thought it would.



In his 1861 novel “Great Expectations,” Charles Dickens tells the story of Pip, an orphan with great expectations for love, wealth, and societal status, who encounters numerous disillusionments. A renowned data quality framework has been named “Great Expectations,” a fitting title indeed.

This metaphor aptly describes the challenges faced by data users, including data scientists, analysts, data engineers, product managers, UX researchers, and business decision-makers. Our expectations of data are high, and this often leads to frustration when reality does not meet these expectations.

The underlying issue in cases similar to those mentioned above is that, even though data consumers expect certain outcomes, there are no guarantees that these expectations will be met. For instance, not every record may include the redirect parameter, or all records might not have the date filled in. Furthermore, the people capable of resolving these issues are often unaware of them.

  1. When the expectations of data consumers differ from reality, it manifests as issues with data quality.
  2. The expectations of data consumers are often implicit and not articulated.
  3. Data consumers can apply data cleansing techniques, but in general they have limited means to transform poor-quality data into high-quality data.
  4. Data quality checks serve as a safety net, aiding in the early identification of problems, but they do not prevent issues from occurring in the first place.

Fast Forward

We did not give up, and as we moved forward we understood that even seemingly simple problems require complex solutions. What led to improvements was combination of technical (hard) and cultural (soft) measures.

The essential requirement for any of these measures was to establish a culture of data ownership. This involves setting up a system where every entity generating data is linked to a human organizational unit, such as a team that takes accountability. This unit should be capable of explicitly committing to or rejecting expectations related to the data. This commitment, is technically represented as data contract.

Ensuring data quality relies on two main pillars, soft measures and hard measures which are equally important.

Hard measures

Data quality checks

Data quality checks identify data issues post hoc. They cannot prevent data quality issues from happening, they can, especially when combined with lineage and alerting solutions or even better with circuit breakers, shorten the detection time and increase the chances of mitigating the harm caused by “broken data”. The market is flourishing with tools like Soda.io, Monte Carlo, Great Expectations, Informatica, Talend …. We replaced our in-house made Contessa with one from the market.

Data integration tests

We’ve recently pioneered data integration tests, which differ significantly from standard data quality checks in that they are a proactive measure.

Essentially, these tests are designed to identify potential data quality issues before they occur. In our setup, data from the order processing production system is streamed via Google PubSub service, which is then utilized for trading performance analytics.

Whenever a git merge request is made that alters the production application code, an integration test is automatically initiated. This ensures the data contract remains intact. These tests go beyond merely verifying data schema; they also check the content of the data.

To illustrate, when a new order is placed with a specific price and other details, we expect a corresponding event to be accurately published with the anticipated values. The integration test confirms this using a sample data, thereby safeguarding the integrity and consistency of our data before it impacts the production system.

Integration tests validate even before the release that the data contract will not be broken.

Soft measures

Data Collaboration process

In a nutshell, this measure ensures cooperation among data analysts, product managers, and engineers. Practically, this means:

  • Product managers involve analytics from the outset of product development and feature specification. They do data informed decisions.
  • The domain analytics engineers and analysts assess whether additional data points need to be logged and how this impacts reporting, identify the need for new data publication and signal any potential breaking changes. They stay in sync with domain product managers and domain engineers to familiarize themselves with the backlog and roadmap, and participate in defining and modifying events, datasets, and attributes.
  • Domain engineers understand the impact of technical changes on data structure and logic, they are responsible for keeping the data contract intact by defining data quality and integration tests to validate whether changes in business logic are reflected in analytical plane. They fix the problem at source when the data quality checks reveal the application bug
Data collaboration process among domain data players

Product thinking

There is a shift in mindset we applied, and it bears certain marks of product thinking.

  • Not all the data are equally important, that’s why we categorised them into several tiers in the same way as manufactures categorise products in product tiers. For instance financial reporting data are in the top tier. On the other hand the data such as developers productivity, or candidate interview metrics can be 5% off, and world will not fall apart.
  • Clearly communicate the differences in features and quality among product tiers to manage data users expectations effectively. We sought to bring transparency about the intentional trade-offs to build trust with data consumers.
  • It is important to make educated decision of the investment vs. gain of every new feature and clearly communicate the outcome. For instance, it might be possible to increase the accuracy of our expected net revenue model by 5% , but it will incure additional costs of model development and compute costs by 7%. It is a product decission.
  • Another aspect is that it often occurs that what data users perceive as a data quality issue is viewed from the analytics engineer’s perspective as either a new feature request or a different data product. Whether to implement it is once again a product decision.

Summary

No, we really can’t fix data quality ex post. What we do instead is:

  • Establish a process that enhances our ability to prevent data quality issues and detect them earlier if they do occur.
  • Promote accountability in the hands of data owners. If you’re owner of the data, you’re also responsible for its quality. This involves turning implicit expectations about data quality into explicit standards.
  • Apply product thinking to data. Be aware of investment vs. gain.

Unless otherwise noted, all images are by the author.

Search
Share
Featured articles
Don’t Fix Bad Data, Do This Instead
The Relevance Of Tech Conferences In A Post Pandemic World