Why Data Quality is so difficult to solve ?

Way back in 2006, when I started coding for world’s first open source data quality project (osDQ) http://sourceforge.net/projects/dataquality/ , data quality issues were prevalent. Years later, Businesses have matured, computing power has increased many folds, storage has become cheaper and algorithms have improved. Still, data quality issues are as prevalent, if not more. That requires a serious understanding of data quality issues – how it originates, how it is propagated and more importantly how it can be solved :-

1.) Technical Solution: You will be completely off-hook, if you try to solve data quality problems using brute computing force and advanced algorithms alone. Issues like fuzzy matches, record linking, golden data are best solved by using technology but like viruses, data quality issues mutate and keep coming in different forms. You will be only in reactive mode and never be free of viruses. As and when it comes, you will desperately look out for cure.

2.) Process based solution: Setting up data governance framework, enforcing data policies, modeling business entities, having stewards and an office of chief data officer, certainly help you reduce the data quality issues. Having ISO certification for “data in motion” also helps organisation to a large extent. Even then, most optimistic data practitioner will not certify you “free of data quality” issues.

3.) Enterprise solution : You broke the “data silos”, brought the data to lake, did metadata categorization, created semantic layer, defined ontology – indeed commendable job. Can you say, we are all free from data quality virus and it is not going to comeback ?

All the the above approach are right in their own way and they solve a subset of data quality issues. But they are reactive and not standardized. Let’s take a typical high tech good workflow – imaginary !!

Designed in USA, Manufactured in China, Curated and Tested in India, Assembled and Packaged in USA , Sold in UK. You can see the relevant data move across boundary, languages, enterprises and governments. A company which is doing testing in India, has not influence ( probably they even don’t know who is manufacture is) on the data the chip producing and they can’t loop back to manufacture. A change in data format by chip manufactures will break all quality testing. An enterprise can enforce processed within its premises but in global world, they are no takers.

Data Quality problems are so difficult to solve because it is global, temporal, mutable, non-standard and spanning across multi-agencies and countries.

Good news is, sincere steps are taken in right direction which will solve data quality issues in long run.

Open data Initiative : Governmental and Semi Governmental departments are making their data publicly available. It will enhance standard adoptions and technology based solutions.

Cross-Pollination of data : In the above example, let’s assume manufacturing company is sharing their data with testing companies. It will help to build all data foot prints of chips and also will decrease the data glitches between companies.

Data Monetization: Once Organizations start putting up their data for sale or 3rd party consumption, quality of internal and external data will improve. Metadata and datatype will be publicly available and data will go through many eyes.


Next Generation BI expectation

Let me start this topic by drawing a parallel from search domain – WWW has lots of information and search is a way to get the information  you are looking for. Similarly, a company has multitude of informations, stored in structured and unstructured form, and business intelligence tools are extracting the data for you.If you have followed the search evolution – First Yahoo search was very structured; it used to give information inside categories ( Metadata driven ), then search engines like askJeeves.com allowed you write natural sentences for search and then google optimized it when indexing and improving relevancy.

Business Intelligence companies are following the same pattern. Traditional BI tools are very structured – warehouse, cube, pivot. You can only look data that is inside the mart, and can navigate in very structured way – like roll up, drill down, record linking, dimension navigation. Next generation of BI tools are using big data technology to bring into large volume of data and also providing semantic layer to give a “google search” like interface. some companies call it “smart machine”. Next   generation BI tools will have :-

1.) Elastic Search and Spark / Big data technology: Scalability, Machine Learning, Fuzziness, Connectors, Statistical prediction, Classification will be for granted. Open sources embedded inside tool will make these features, commodity. They will be no more differentiator.

2.) Collaborative, Informative and engaging report : Today’s dull reports will become more collaborative.Think about looking a sales report, where report also embed a video where CEO making sales prediction, you also get your competitor public information, relevant 3rd party information. A report will transform into information portal which will be more engaging and social.

3.) Metadata Consolidation : Focus will shift to metadata from data because data processing will be taken care by platform. Data and metadata from different systems will come to data lake, which using namespace will decide and differentiate data. Business expertise will go into, making entity resolution automatic and data modeling dynamic.

4.) Interpreting business rules : In today’s system, we codify business rules but is not reusable for business intelligence systems. Today it a very cumbersome and time intensive to re-interpret business rules. Next generation BI tools, will extract business rules from CRM, transaction system and validate business rules against data. Business rules models will be more comprehensive and will not live in silos.

5.) Right Information : Certainly machine learning and artificial intelligence is overrated. They will not solve your business problem but certainly they will find out anomalies, outlier, abnormality, cluster, good data, bad data etc, to make you decide better. They will not replace you but will help you.

6.) Reusing existing Data warehouse : Lot of money has already flown into existing warehouse. New generation tools will provide wrapper around EDW to make it search friendly and integrate with datalake – using  indexing, elastic search, multi-facet search etc.

7.) User experience : In today’s world dashboard are personalized, but there is not much of freedom inside dashboard. New BI tools will be responsive in true sense, where entity hopping, 360 degree views, changing dimension centricity on the fly will be provided. Dashboards will also be mapped to User stories to

8.) Trust of data : In spite  of nice visualization, confidence in data is very low. BI tools are getting used to see the trend and bigger picture, but the value of data is taken only as indicative not for operation purpose. Data governance an Data Quality would a big push for next generation BI tools.

Disclaimer : Smart Machine is a term used by dataRPM.com ( a next generation BI tool) to describe their systems which uses advance algorithms to do above mentioned features.

About Author : Vivek Kumar Singh is Business Intelligence professional and manages open source data quality project at http://sourceforge.net/projects/dataquality/

Student Entiry

Why BI projects fail – and the role of Data Architect

During my long stint as business intelligence professional, I have seen many projects fail and of course, smelled some success. Let’s see what is common among successful projects and the unsuccessful ones.

Organisation Structure: Unlike other IT  engineering projects, Business Intelligence projects need strong business interface – a make or break for BI project. Business is divided into groups or sliced by business functions (BU) – but data is not. A typical BI project will run into multiple Business Functions – which means working with two or more VPs and their organisations. If BI project is sponsored by IT it will hit bottleneck with Business. A typical reply will be – a.) we already have this information in XLS b.) Our processes are different , c.) we can not wait so long for data d.) this is done at vendor site

And business is right. Every business functions or unit is different – Granularity is different, business focus is different, workflow is different. IT sponsored BI projects treats all business units same and hence BI projects lose relevance for business. I have seen high rate of success with  MIS related BI projects – because IT is both consumer and sponsor of project. Otherwise IT does not business rules and processes to make a report relevant – that has to come from Business. So it is important business sponsors it and business unit heads or VPs sponsor it.

Enterprise Data warehouse (EDW) : A typical approach to a BI project is, create an enterprise data warehouse and all BU ( business Unit) will take data from there and have their  data mart. Business is complex and putting all rules in Enterprise warehouse has a very high degree of failure. Enterprise data warehouse become too cumbersome to use and invariably business rule will change overtime. Cost of changing ETL jobs are very high and it takes long time to propagate the change from source, to staging area, to target , to reports, to analytic. Business get frustrated and figures out a way to get data in xls sheet and then does it own analysis –  EDW fails.

A leaner and less complex data model is better. Report Developer, ETL developer or Data Integration are not business analyst. They are bound to do mistakes in putting all the rules in all encompassing warehouse. A good amount of business rules can be pushed to report engine and analytic engine. A focused warehouse is more successful and a multi-purpose or generic warehouse.

Data Quality:  In real business, dirty data or incomplete data do come in. If they are feed into warehouse as is, report is faulty and unfortunately it happens most of time. Technical people can understand data type, data structure and raw data quality – like null values, duplicate values, negative values, outliner etc. But they do not understand business implication of that and do not what should be right value. A typically ETL tool will discard those records and report will not be updated. Let’s take an imaginary business rule – where if an existing ( from other business unit) customer walks-in you do not ask him fill-in personal information form and only take customer id. ( The idea is you will collect all the data from other business unit and save time  of customer). When desktop operation feed the data in – he or she feeds only with customer id. If you business processes are not real time ( in most cases they are not ) only customer id goes into CRM system and most probably nightly ETL load will ignore data as dirty data and your report will show one less count.

Data Analyst and Business Analyst have to sit together and profile their data and look into conditions using tools like osDQ – http://sourceforge.net/projects/dataquality/  to get a good understanding of data , before moving onto project.

Role of Data Architect:  Probably he is the first person to know if the project is on track but he has limited visibility. Most of time, data architects belong to IT group and has very limited saying in Business Unit. Unfortunately, today we do not have a process or framework which tell how data architect should talk / show artifacts to business. TOGAF has tried to give some framework, but it very limited.

A good start can be starting from IT Domain architecture where business unit and high level functionality is mapped. Let take an example of company which creates, tests, scores and report K-12 tests.

IT Domain Architecture    Once the business domain is identifies, data architect should create DFD ( data flow document – like image below ) which says which data moves across business domains and which is the data lying within a domain. The data is flowing across business domains ( or units) are the once which are more prone to error – as it changes values across domain.TCMDataFlow

Once the DFD is created , Entity reference model can be created as below where steward can be identified.

Student Entiry

Student Entiry

Once we have formal steward in place from business side, the success rate increases . In my next post I will write in detail about roles and responsibilities of data architect and the process to find out steward.

Good Luck !!