Well, it sounds like oxymoron – why a data scientist will need data science to help him ; he rather will use data science to help find insights and predicts the future. Data Science has a connotation that it is to be used discovery, insight, prediction and artificial intelligence.
But data science can also be used to do data preparation which will help data scientist to develop right model for discovery, insight, prediction and artificial intelligence.
1.) Regression : When DBA does not know what data scientists are looking for they dump everything on data scientist. Sometimes even hundreds of attributes. Some attributes have influence on model; some are only informational. IoT also dumps lots of machine data which are irrelevant for model.
Data science can help data scientist in finding the relevant attributes. They can run regression algorithms to find out attributes which have effect on outcome.
2.) Missing Values : Missing values are ubiquitous. Depending on model, it can be discarded, have default values or can have advance statistics to generate the missing numbers based on other attributes like time, location, customer behavior.
In some cases, missing values can be detrimental to model. Data science will help data scientist to auto-generate realistic numbers , conforming to model.
3.) Clustering : Some popular clustering algorithms (like K-Mean, nearest neighborhood )require data scientist to provided initial parameters to model, to refine it further. So if the initial number is way off, model will fail.
Data scientists can use binning, basket analysis, outlier, no frill clustering algorithms to figure out tentative initial parameters to build a good model.
4.) Anomaly Detection : Some systems ( like alerting and event correlation) needs anomalies to perform their task. But finding anomalies is like finding needle in haystack.
Data Scientist can use anomalies detection algorithms like support vector machine ( one class), association rule, replicator neural network to filter out in-bound or one class data and get anomalies.
Above examples are only for sake of example. There are many ways data science can help in data preparation and business rule validation. Data science will be extended to data preparation also.
Author : Vivek Singh is data architect and developing world’s fist open source data quality and data preparation tool http://sourceforge.net/projects/dataquality/