As we all know, the lineage of data quality comes from CRM (Customer Relationship Management) system doing address correction and later moved into MDM (Master Data Management). As business matures, Data Quality (DQ) moves from reactive to “data-in-motion”.
On the other hand, Data Preparation, was traditionally part of Data Mining process, which was done in batch mode and probably by data scientist. Data Preparation involves steps that are taken in order to make data model ready.
Now business is moving from IT driven world to business-rule driven world. What I mean to say is, now data is important only if it maps to business needs. Structure data profile like null, pattern, outlier etc has limited value. Business is looking at DQ to validate complex business rules. It is handled by data steward of business rather than IT managers.
Data Science world is also changing. Model has to be business driven rather than solving some theoretical mathematical problem. Data scientists are working with business users and data stewards to understand business and data. With this new development the boundary line between Data Quality and Data Preparation is getting blurred.
1.) Data Fishing Vs Entity Resolution : From the web “Data dredging, sometimes referred to as data fishing is a data mining practice in which large volumes of data are searched to find any possible relationships between data…”
In DQ, Entity resolution is the process where all the attributes from disparate source are collected and an entity definition is created. As data management moving toward integrated world, both processes will be merged into one.
2.) Area of Mine Vs Data Ownership : In Data Mining, ‘Area of Mine’ defines data / record data scientists are interested for to model. In DQ world, data ownership defines who own the data. In a metadata driven data world, it will be decided by single metadata management system.
3.) Missing Values : Missing values are problematic for both – data preparation and data quality. Only difference is, an inferred value is good enough for data mining while data quality either looks for 100% accuracy ( like geo-encoding), replace with default value or discard the data as dirty data. Whatever the ways may be, both try to mitigate missing values. I am expecting business will define their missing data strategy and how to interpret them.
4.) Noise reduction Vs Data Scrubbing : It is important to reduce noise to create a good model. Data mining has several techniques to reduce noise. Data Quality has several ways to scrub and massage the data. These techniques are executed on data to make it compliance to business.
Summary: As getting insight is increasing becoming business function role while IT works as only facilitator, I am expecting data quality and data preparation processes will be merged into one, that will be managed by data stewards.
Your thoughts !!
About Author : Vivek K Singh is data architect and runs worlds first open source data quality and data preparation tool http://sourceforge.net/projects/dataquality/