Artificial Intelligence: Coming to the Rescue of ITOps

Here is an article I have written how AI can help DevOps :

According to McKinsey’s Global Institute Report of 2018, artificial intelligence (AI) has the potential to create an annual value of $3.5 billion to $5.8 billion across different industry sectors. Today, AI in finance and IT alone accounts for about $100 billion; hence, it is becoming quite the game changer in the IT world.

With the onset of cloud adoption, the world of IT DevOps has changed dramatically. The focus of ITOps is changing to an integrated, service-centric approach that maximizes business services availability. AI can help ITOps in early detection of outages, potential root cause prediction, finding systems and nodes that are susceptible to outages, average resolution time and more. This article highlights a few use cases where AI can be integrated with ITOps, simplifying day-to-day operations and making remediation more robust.


You can read complete article at


Sampling using apache Spark 2.1.1

There has been a debate in big data community “do we need sampling in big data world” ? Now big data platform has storage and processing power to use all dataset for analysis so sampling is not required – is the argument.

This arguments holds true for data discovery where doing analysis on full dataset give more confidence and every anomaly and pattern is captured. However, sampling still saves considerable amount of time in dimension reduction, correlation and model generation. You have to go through hundreds of attributes (with all permutation and combination) to find dependent, independent , correlated variables where representative dataset saves you hour with almost same accuracy.

We have open sourced data sampling code using apache spark 2.1.1 at

Random Sampling : on dataset<row> random sampling is provided by apache spark where user can provide the fraction he needs for sampling.

Dataset<Row> org.apache.spark.sql.Dataset.sample(boolean withReplacement, double fraction)

Stratified Sampling : dataset<row> does not provide stratified sampling so dataset is converted into PairedRDD with key column which need to be stratified and then use samplebyKeyExact. It does many pass to find the exact fraction

for(Row uniqueKey:uniqueKeys){

fractionsMap.merge(uniqueKey.mkString(),fraction, (V1,V2) -> V1);



JavaPairRDD<String, Row> dataPairRDD = SparkHelper.dfToPairRDD(keyColumn, df);
JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKeyExact(false, fractionsMap).values();
Dataset<Row> sampledDF = df.sqlContext().createDataFrame(sampledRDD, df.schema());

return sampledDF;

Keylist Sampling : This is like Stratified sampling but only make one pass to meet the fraction value.

JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKey(false, fractionsMap).values();

Command Line Option :

 -c,--keyColumn <arg>          Key Column for stratified/keylist sampling

 -f,--fraction <arg>           Sample fraction size

 -fm,--fractionMapping <arg>   comma seperated pairs of key,fraction size

 -h,--help                     show this help.

 -i,--input <arg>              Input Folder/File path

 -if,--inputFormat <arg>       input file format

 -o,--output <arg>             Output Folder path

 -of,--outFormat <arg>         output file format

 -t,--type <arg>               Sampling type  - ran


example : ” -f 0.2 -i ./testfile -if csv -o ./outputFile -of csv -t stratified -c key1 -fm ./keymapping”























about Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

A.I. Singularity & Thoroughbred

Assume Singularity has been reached. Now, Humanoids are running the show.

a.) Humanoids bring humans to enthuse their childoids. Childoids proudly bring their humans to child park to show off to other childoids.

b.) In county fair, there is “human race” very popular with childoids. The winner human is scanned and teleported across globe.

c.) In city council meetings, FANGoids have first right to dump their brains. After that other humanoid species can dump or pick.

d.) FANGoids are first order of humanoids. They can teleport anywhere and dump their brains anywhere. They are the ones who decide what to do with rogue humanoids. They are thoroughbred of horse species.

e.) UKoids are only humanoids who maintains “chronology of evaluation” from primitive human brains which had only couple of millions of billions neurons to support their intelligence. To respect their forefathers, UKoids spends fraction of second in restroom, in silent mode, every morning.

f.) “Give me, Give me, Give me Whole foods” is most popular family game. Parent Amazonoids search for organic giga power battery factory in universe, SpaceXoids search for planets which can produce million year recharge, Baiduoids manufactures large scale battery instantly that goes off instantly. Flipkartoids go to Teslaoids abandoned houses and bring batteries.

g.) Googleoid’s kids are very popular among childoids. They have information about every humanoinds part-number and item id and their co-ordinate in the universe.

h.) Some kind humanoids are protesting the inhumane treatment to human with banners like “Human life matters”. They are asking for banning human race in county fairs. Human should not be treated like Robots.

i.) Teslaoids are working on how humanoids can be redesigned to consume less power and their jobs can not be automated while FANGoids are working on what will be next evaluation of humanoids.

j.) When childoids self develop more neurons, father humanoid will blockchain his model number, part number , other ids and transfer to childoids. He then proceeds to “I was here” center along with his-age humanoids, gives his finger print, which flashes his timestamp in universe. He enters the “I was here” park alone to recycle, while his friend humanoids sing the chorus “ Take away, Take away, Take away my food too”.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Data Science for Data Scientist

Well, it sounds like oxymoron – why a data scientist will need data science to help him ; he rather will use data science to help find insights and predicts the future. Data Science has a connotation that it is to be used discovery, insight, prediction and artificial intelligence.

But data science can also be used to do data preparation which will help data scientist to develop right model for discovery, insight, prediction and artificial intelligence.

1.) Regression : When DBA does not know what data scientists are looking for they dump everything on data scientist. Sometimes even hundreds of attributes. Some attributes have influence on model; some are only informational. IoT also dumps lots of machine data which are irrelevant for model.

Data science can help data scientist in finding the relevant attributes. They can run regression algorithms to find out attributes which have effect on outcome.

2.) Missing Values : Missing values are ubiquitous. Depending on model, it can be discarded, have default values or can have advance statistics to generate the missing numbers based on other attributes like time, location, customer behavior.

In some cases, missing values can be detrimental to model. Data science will help data scientist to auto-generate realistic numbers , conforming to model.

3.) Clustering : Some popular clustering algorithms (like K-Mean, nearest neighborhood )require data scientist to provided initial parameters to model, to refine it further. So if the initial number is way off, model will fail.

Data scientists can use binning, basket analysis, outlier, no frill clustering algorithms to figure out tentative initial parameters to build a good model.

4.) Anomaly Detection : Some systems ( like alerting and event correlation) needs anomalies to perform their task. But finding anomalies is like finding needle in haystack.

Data Scientist can use anomalies detection algorithms like support vector machine ( one class), association rule, replicator neural network to filter out in-bound or one class data and get anomalies.

Above examples are only for sake of example. There are many ways data science can help in data preparation and business rule validation. Data science will be extended to data preparation also.

Author : Vivek Singh is data architect and developing world’s fist open source data quality and data preparation tool