Way back in 2006, when I started coding for world’s first open source data quality project (osDQ) http://sourceforge.net/projects/dataquality/ , data quality issues were prevalent. Years later, Businesses have matured, computing power has increased many folds, storage has become cheaper and algorithms have improved. Still, data quality issues are as prevalent, if not more. That requires a serious understanding of data quality issues – how it originates, how it is propagated and more importantly how it can be solved :-
1.) Technical Solution: You will be completely off-hook, if you try to solve data quality problems using brute computing force and advanced algorithms alone. Issues like fuzzy matches, record linking, golden data are best solved by using technology but like viruses, data quality issues mutate and keep coming in different forms. You will be only in reactive mode and never be free of viruses. As and when it comes, you will desperately look out for cure.
2.) Process based solution: Setting up data governance framework, enforcing data policies, modeling business entities, having stewards and an office of chief data officer, certainly help you reduce the data quality issues. Having ISO certification for “data in motion” also helps organisation to a large extent. Even then, most optimistic data practitioner will not certify you “free of data quality” issues.
3.) Enterprise solution : You broke the “data silos”, brought the data to lake, did metadata categorization, created semantic layer, defined ontology – indeed commendable job. Can you say, we are all free from data quality virus and it is not going to comeback ?
All the the above approach are right in their own way and they solve a subset of data quality issues. But they are reactive and not standardized. Let’s take a typical high tech good workflow – imaginary !!
Designed in USA, Manufactured in China, Curated and Tested in India, Assembled and Packaged in USA , Sold in UK. You can see the relevant data move across boundary, languages, enterprises and governments. A company which is doing testing in India, has not influence ( probably they even don’t know who is manufacture is) on the data the chip producing and they can’t loop back to manufacture. A change in data format by chip manufactures will break all quality testing. An enterprise can enforce processed within its premises but in global world, they are no takers.
Data Quality problems are so difficult to solve because it is global, temporal, mutable, non-standard and spanning across multi-agencies and countries.
Good news is, sincere steps are taken in right direction which will solve data quality issues in long run.
Open data Initiative : Governmental and Semi Governmental departments are making their data publicly available. It will enhance standard adoptions and technology based solutions.
Cross-Pollination of data : In the above example, let’s assume manufacturing company is sharing their data with testing companies. It will help to build all data foot prints of chips and also will decrease the data glitches between companies.
Data Monetization: Once Organizations start putting up their data for sale or 3rd party consumption, quality of internal and external data will improve. Metadata and datatype will be publicly available and data will go through many eyes.