In my myriads consulting assignments, I have barely seen an AI/ML model in production. With renewed interest in AI/ML ( Artificial intelligence & Machine learning) and rightly so, enterprises are embracing for smart applications powered by models but their frustration is visible when models could not make it to production to work in tandem with applications.
Applications and models have to work together and in realtime to feed each other. Which means the models has to increasing learn with near real-time smaller data set and provide output to applications which is already in production.
Here is a summary of issues I see :-
Dirty and Stale Data : Typically the data given to data-scientists are months old and from older application where all the attributes are not properly mapped or missing. To add, data are given from raw systems which are incomplete and inconsistent.
Models are generated in silos : Data Engineers, Dev Ops, Service engineering are not involved in model planning. They do not understand model life-cycle and production need. Data-scientists hide their model under the hood, and when it is exposed to production system it falls flat.
Data at scale not considered : Many models are recursive and iterative. When you run or train with smaller dataset ( < 1M rows) it seems like a good candidate for production. When it is exposed to production level data, time and resource consumption exponentially increases. I think, it is largest cause of no-go to production.
Data Scientists not understanding production environment : Creating a model in silos with couple of thousand of data points is different, but to have in production , you need to understand integration points with other systems (applications), CI/CD, backup & HA, hardware topology and most importantly data pipeline and workflow. A model should retrain and refresh at right time and right dataset and give the output to right system.
Selection of code and technology : Data Scientists are comfortable with R/Matlab/Python while most systems in production uses C/C++ , Java, Go, MS technologies. Data Scientist’s language of choice has limitations in production.
Conclusion: The line between application and models will be getting blurred in coming days. Models will be developed as applications and applications will be developed at models. Models will copy CI/CD, Atscale, HA, Integration , real time interactions etc from applications and applications can be retrained , refreshed , self learning, reenforce learning like models.
Feel free to share your thoughts.
About author : Vivek is creator of osDQ – world’s first open source data quality and prep tool. https://sourceforge.net/projects/dataquality/
He has also open sourced apache spark based data pipeline framework.https://sourceforge.net/projects/apache-spark-osdq/