Here is how to start with data quality
Why is data quality relevant
Did you know:
- 25% of critical data in large companies is incorrect or incomplete? (Dataflux/SAS)
- 2-3% of customer data becomes inaccurate after aging 1 month? (Butlergroup)
- 20% of turnover is lost due to cost of rework in operational processes and corrections in information reporting? (Gartner)
With years of data analysis experience and now as and enterprise data architect, I found that building a case for data quality management is challenging. When I speak with a data owner, he is convinced his data is in perfect shape. And he is right to be proud of his work. He is however unable to prove the quality of his data due to missing processes and tools that facilitate him in his needs.
When I finally get my hands on an extract and show him the data profile he immediately starts making corrections. And this is how to get started with data quality:
Start with the data
What is data quality
Data needs to meet the process quality requirements to make them operate efficiently. Then there are many exception flows in the business processes that go around the usual happy flows where data is created. This results in gaps in the data quality further down the process chain.
Monitoring these data quality gaps is essential for stable processes.
So first we need to understand what data quality is. Lean Data defines data quality as:
The degree to which data meets the requirements of the processes it is used in
Then the data management body of knowledge (DMBOK) gives us a hand and defines 6 dimensions that represent data quality. So this sounds easy and you may feel you can start right away. You’ll quickly run into the follow-up question how to prioritize which datasets and attributes are important. This is answered by focusing on those deviations in your data that affect your organization. Lean Data uses 4 categories to determine whether a deviation in one of the 6 dimensions is worth to spend time and effort on:
When a deviation in your data affects one of these impact categories you should make this a priority in documenting the business rules. For the business rules you can try to define data quality rules and implement those as mitigations. These mitigations should be implemented at the source where data is created. This should enable you to prevent these deviations affecting your organization in the future.
How to maintain data quality
There are many software vendors that promise you the best data quality analysis & reporting solutions. At a notable price of course.
The real cost is the time and effort it takes to identify & define the rules that need to applied on the data to meet these requirements. This requires close interaction between the process expert and the data analyst. This is a lengthy process before data quality measures can be built into the data quality reporting environment.
What you need first is an easy to use tool
When you purchase one of the top magic quadrant solutions you’ll notice the difficulty to get it set up, connect a dataset and then the technical skills required to process the data. Furthermore, due to IT restrictions you’re not allowed to just install any new software. Here’s a screenshot with a question on usability of one of the major data quality solutions:
In my experience this kind of complexity and limitations has resulted in the data owners not using the data quality tooling. That is why Lean Data divides data quality management in 2 phases:
- Setup through defining and designing data quality measurements
- Embedding through running and maintaining data quality measurements
Although the major vendors have great solutions for the second phase, the first phase is pre-conditional to build the foundation of data quality management. If you loose support there, you won’t get data quality processes embedded.
So which data should you begin with?
To start, you can collect important data objects like your customers, suppliers, products from the source and start profiling the attributes. Obtain the ‘raw’ data from the source where this data is created to ensure no adjustments were made in the ETL which affects the validity of your analysis. You will quickly notice with above dimensions and categories that you’ll be able to define business rules.
Need a hand? Get in touch!