Why AI/ML models fail in Production

In my myriads consulting assignments, I have barely seen an AI/ML model in production. With renewed interest in AI/ML ( Artificial intelligence & Machine learning) and rightly so, enterprises are embracing for smart applications powered by models but their frustration is visible when models could not make it to production to work in tandem with applications.

Applications and models have to work together and in realtime to feed each other. Which means the models has to increasing learn with near real-time smaller data set and provide output to applications which is already in production.

Here is a summary of issues I see :-

Dirty and Stale Data : Typically the data given to data-scientists are months old and from older application where all the attributes are not properly mapped or missing. To add, data are given from raw systems which are incomplete and inconsistent.

Models are generated in silos : Data Engineers, Dev Ops, Service engineering are not involved in model planning. They do not understand model life-cycle and production need. Data-scientists hide their model under the hood, and when it is exposed to production system it falls flat.

Data at scale not considered : Many models are recursive and iterative. When you run or train with smaller dataset ( < 1M rows) it seems like a good candidate for production. When it is exposed to production level data, time and resource consumption exponentially increases. I think, it is largest cause of no-go to production.

Data Scientists not understanding production environment : Creating a model in silos with couple of thousand of data points is different, but to have in production , you need to understand integration points with other systems (applications), CI/CD, backup & HA, hardware topology and most importantly data pipeline and workflow. A model should retrain and refresh at right time and right dataset and give the output to right system.

Selection of code and technology : Data Scientists are comfortable with R/Matlab/Python while most systems in production uses C/C++ , Java, Go, MS technologies. Data Scientist’s language of choice has limitations in production.

Conclusion: The line between application and models will be getting blurred in coming days. Models will be developed as applications and applications will be developed at models. Models will copy CI/CD, Atscale, HA, Integration , real time interactions etc from applications and applications can be retrained , refreshed , self learning, reenforce learning like models.

Feel free to share your thoughts.

About author : Vivek is creator of osDQ – world’s first open source data quality and prep tool. https://sourceforge.net/projects/dataquality/

He has also open sourced apache spark based data pipeline framework.https://sourceforge.net/projects/apache-spark-osdq/

Advertisements
Student Entiry

Why BI projects fail – and the role of Data Architect

During my long stint as business intelligence professional, I have seen many projects fail and of course, smelled some success. Let’s see what is common among successful projects and the unsuccessful ones.

Organisation Structure: Unlike other IT  engineering projects, Business Intelligence projects need strong business interface – a make or break for BI project. Business is divided into groups or sliced by business functions (BU) – but data is not. A typical BI project will run into multiple Business Functions – which means working with two or more VPs and their organisations. If BI project is sponsored by IT it will hit bottleneck with Business. A typical reply will be – a.) we already have this information in XLS b.) Our processes are different , c.) we can not wait so long for data d.) this is done at vendor site

And business is right. Every business functions or unit is different – Granularity is different, business focus is different, workflow is different. IT sponsored BI projects treats all business units same and hence BI projects lose relevance for business. I have seen high rate of success with  MIS related BI projects – because IT is both consumer and sponsor of project. Otherwise IT does not business rules and processes to make a report relevant – that has to come from Business. So it is important business sponsors it and business unit heads or VPs sponsor it.

Enterprise Data warehouse (EDW) : A typical approach to a BI project is, create an enterprise data warehouse and all BU ( business Unit) will take data from there and have their  data mart. Business is complex and putting all rules in Enterprise warehouse has a very high degree of failure. Enterprise data warehouse become too cumbersome to use and invariably business rule will change overtime. Cost of changing ETL jobs are very high and it takes long time to propagate the change from source, to staging area, to target , to reports, to analytic. Business get frustrated and figures out a way to get data in xls sheet and then does it own analysis –  EDW fails.

A leaner and less complex data model is better. Report Developer, ETL developer or Data Integration are not business analyst. They are bound to do mistakes in putting all the rules in all encompassing warehouse. A good amount of business rules can be pushed to report engine and analytic engine. A focused warehouse is more successful and a multi-purpose or generic warehouse.

Data Quality:  In real business, dirty data or incomplete data do come in. If they are feed into warehouse as is, report is faulty and unfortunately it happens most of time. Technical people can understand data type, data structure and raw data quality – like null values, duplicate values, negative values, outliner etc. But they do not understand business implication of that and do not what should be right value. A typically ETL tool will discard those records and report will not be updated. Let’s take an imaginary business rule – where if an existing ( from other business unit) customer walks-in you do not ask him fill-in personal information form and only take customer id. ( The idea is you will collect all the data from other business unit and save time  of customer). When desktop operation feed the data in – he or she feeds only with customer id. If you business processes are not real time ( in most cases they are not ) only customer id goes into CRM system and most probably nightly ETL load will ignore data as dirty data and your report will show one less count.

Data Analyst and Business Analyst have to sit together and profile their data and look into conditions using tools like osDQ – http://sourceforge.net/projects/dataquality/  to get a good understanding of data , before moving onto project.

Role of Data Architect:  Probably he is the first person to know if the project is on track but he has limited visibility. Most of time, data architects belong to IT group and has very limited saying in Business Unit. Unfortunately, today we do not have a process or framework which tell how data architect should talk / show artifacts to business. TOGAF has tried to give some framework, but it very limited.

A good start can be starting from IT Domain architecture where business unit and high level functionality is mapped. Let take an example of company which creates, tests, scores and report K-12 tests.

IT Domain Architecture    Once the business domain is identifies, data architect should create DFD ( data flow document – like image below ) which says which data moves across business domains and which is the data lying within a domain. The data is flowing across business domains ( or units) are the once which are more prone to error – as it changes values across domain.TCMDataFlow

Once the DFD is created , Entity reference model can be created as below where steward can be identified.

Student Entiry

Student Entiry

Once we have formal steward in place from business side, the success rate increases . In my next post I will write in detail about roles and responsibilities of data architect and the process to find out steward.

Good Luck !!