Artificial Intelligence: Coming to the Rescue of ITOps

Here is an article I have written how AI can help DevOps :

According to McKinsey’s Global Institute Report of 2018, artificial intelligence (AI) has the potential to create an annual value of $3.5 billion to $5.8 billion across different industry sectors. Today, AI in finance and IT alone accounts for about $100 billion; hence, it is becoming quite the game changer in the IT world.

With the onset of cloud adoption, the world of IT DevOps has changed dramatically. The focus of ITOps is changing to an integrated, service-centric approach that maximizes business services availability. AI can help ITOps in early detection of outages, potential root cause prediction, finding systems and nodes that are susceptible to outages, average resolution time and more. This article highlights a few use cases where AI can be integrated with ITOps, simplifying day-to-day operations and making remediation more robust.


You can read complete article at


Data Lake Vs EDW

I often have this question asked to me – “what is difference between warehouse and data lake ?”. Since, I have worked on both, let me try to answer it. Both are done with business in mind so variation in approach is common. I am listing generic differences.

Storage : EDW is commonly stored in RDBMS either in star schema or snowflake schema. The design of schema is not fluid and often called “early binding” or “schema on write”. Datalake is primary stored on Hadoop,Cloud (S3 etc ) or Hbase / Hive if data is structured. However it completely depends on business what to use for storage. Typically cost for having 1 TB stored in RDBMS would be 10 -15K USD while in Datalake it is should be around 2-3K USD.

Purpose : EDW is created for reporting so it has concept like Cube/Olap/ Hierarchical /Roll up/ Drill down/ Aggregate awareness / Fan out/ Dimensional Navigation etc which EDW is optimized for. Datalake works as storage for all departments of enterprise. It is primary for analytics so data is kept in flat file and sensitive data with encryption. Generally, across enterprise has access to it. It is responsibility of downstream process to optimize data ( data quality, data preparation, joining, dimension reduction etc.) for their need.

Ingestion and Retrieval : Data is ingested in EDW by ETL jobs and mostly in batch mode. There is a staging area which works for cleaning, transformation and aggregation of data. Retrieval is through reporting tools or SQL.

In Datalake, data is ingested primarily by File transfer methods. It is done both, in real time and in batch mode.There is no concept of staging area in datalake, but creator should take care to make sure data is usable and it does become swamp. In my previous article, I have explained how to create a datalake.

Data retrieval can be through file download, restful APIs or generic access to big data SQL if it is structured. Datalake has both structured and unstructured data.

Metadata: One of the most distinct differences between EDW and Datalake is, extensive use of metadata in datalake. Though in EDW also, metadata is used, it is primary for tracking ETL job status and resides in the same name space as EDW, which limit it’s uses.

Datalake’s metadata is the first place of landing for enterprise users who wants to use datalake. It has almost all the informations like data type, data dictionary, time of load, probably values, data quality or transformation, if any. From metadata user decideds what to use and what is available to him. If it is not in metadata, data does not exist for user. Often, it is stored in different namespace.

Business Proposition: While EDW’s primary purpose to generate time-bound reports for executive and power users, datalake is a way to data democratization. Datalake also saves humungous amount of time which is EDW needs to give someone access.

if we want to discuss further, feel free to contact me. On my blog site I have other articles related to datalake which may help further —

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Is ML/AI for big companies only ?

Recently, a friend of mine who owns a SME ( Small and Mid Size Enterprise ) asked me – “Is ML/AI for big companies only ? Can we (SME) also benefit from Machine Learning and Artificial Intelligence wave ” ?

I asked – “why do you think you will not benefit” ?

“Most of use cases talk about ‘terabytes of data, complex algorithms and super human data scientist’ which are beyond SMEs reach. So I guess it is for big companies only” – he replied.

This may be the impression most SME owners might carry, but it is not far from truth. In fact, it is other way. The impact of ML/AI will be more pronounced on SMEs while on big companies it will average out.

Here are the reasons:

1.) Most of ML/AI algorithms needs 50-100 data points to create a model. So it is not terabytes, even KB and MB of data will be good enough for model inputs.

2.) A segment of 50-100 members is good enough to define segment behavior. So a customer base of 500+ is a good use case for segment analysis.

3.) Effect of weather or local events has larger effect on local businesses. It is easier to track impact on SMEs ; for large corporation it averages out.

4.) There are less data silos in SME so unlocking is easier. Data is well understood in SME world while in big corporations it is very complex and often riddled with politics.

5.) SME owners understand their business in totality, while executive of big corporations know only one part of business. Better business understanding and defined metrics lead to better models.

6.) Better personalized service is possible for SMEs where segment size is small and customers are local and loyalty has major impact on business.

7.) Now tools and cloud has taken cost barrier out for SMEs. They should be using AI/ML extensively to drive their business.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Apache spark based classification and prediction model

We have open sourced apache spark based random forest and multilevel perceptron algorithms which can be used for classification and prediction. It can be downloaded from

Use cases :

a.) if you are using huge volume of data (Big data problem) which has large set feature columns. For smaller dataset, you can run apache spark in local mode.

b.) If you want run multiple multi class models together to see which one gives better result. Right now random forest and multilevel perceptron algorithms are implemented. But framework is there to take other algorithms also.

c.) No coding required. Just change the config file and good to go – both for training and classifying/predicting. All you need is java 8

d.) Restful APIs are there to predict/classify along with probability. Easily integratabtle.

e.) If you want to see the accuracy of multiple label columns for dimension reducibility

Overview: This program can be used for both training and classifying purpose. You can train the model and use RESTFul web service to query the model.This program also exposes a RESTFul web service to (jetty and javaspark based) expose classification/prediction as a service.


1. Download the package the package “”

2. Unzip the pre-built distribution and follow the below details

3. Understand the folder structure of release upon unzipping

* spark-classifier_\<version>

* /lib: contains all dependent jar

* /conf: contains, please review this file before running the program

* /model: the default model path where both model would saved (after training) and read (during classification service). You should have write access to this folder

* /spark-classifier-\<version>.jar: the main driver jar

Configuration:Currently it supports Random Forest and Multilayer Perceptron classifiers. Please set the same under “conf/”

# Currently supported algorithm RANDOM_FOREST or MULTILEVEL_PERCEPTRON



It takes Comma(,) separated list of columns for Feature and Label. * in label means it will take all columns to predict. It will skip feature columns if they in in predict or label column too.

classifier.featurecols=Number,Follow up

####list of labels to be predicted

#### '*' will process all the columns

classifier.labelcols=Root Cause

#classifier.labelcols=L1, L2, L3..

Train the model:

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.classifier.ClassifierTrainer

The input file name and output model location can be defined inside `conf/` By default, above command would assume that `conf/` file is correctly setup.

Use the model to predict or classify

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.service.ClassifierService

It will start default jetty server which will accept post requests. After this you may post the RESTFul API http://localhost:4567/classify/<algorithm_name>/<label_name&gt; -d jsonfile

Where \<algorithm_name> can be “randon_forest” or “multilevel_perceptron” and \<label_name> would be the label column name (column for which model was trained) in your training dataset and json file will have feature column and values which are input for prediction or classification

cmd > curl -XPOST http://localhost:4567/classify/random_forest/LABEL1 -d '[{


       "FeatureField2":" FeatureField2VALUE",

       "FeatureField3":" FeatureField3VALUE"}]'
> Response JSON


       "classifiedLabel": "PredictedValue",

       "probability": "0.951814884316891"


Things to Remember

1.)   Presently it takes only txt file with field separator

2.)   Null is replaced by NULLVALUE as null cannot be used in model

3.)   multilevel_perceptron does not give probability of predicted value. This feature is available in latest apache spark version.

4.)   Currently label_name shouldn’t have hyphen ‘-‘ character

5.)   If there is space in label column name use ‘%20’ for space.

If you face any issue feel free to contact us or raise a bug. We are developing an open source platform for integrated data life cycle – with ingestion, DQ, Profiling, Analytics and Prediction , all in one.

About Me : Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Sampling using apache Spark 2.1.1

There has been a debate in big data community “do we need sampling in big data world” ? Now big data platform has storage and processing power to use all dataset for analysis so sampling is not required – is the argument.

This arguments holds true for data discovery where doing analysis on full dataset give more confidence and every anomaly and pattern is captured. However, sampling still saves considerable amount of time in dimension reduction, correlation and model generation. You have to go through hundreds of attributes (with all permutation and combination) to find dependent, independent , correlated variables where representative dataset saves you hour with almost same accuracy.

We have open sourced data sampling code using apache spark 2.1.1 at

Random Sampling : on dataset<row> random sampling is provided by apache spark where user can provide the fraction he needs for sampling.

Dataset<Row> org.apache.spark.sql.Dataset.sample(boolean withReplacement, double fraction)

Stratified Sampling : dataset<row> does not provide stratified sampling so dataset is converted into PairedRDD with key column which need to be stratified and then use samplebyKeyExact. It does many pass to find the exact fraction

for(Row uniqueKey:uniqueKeys){

fractionsMap.merge(uniqueKey.mkString(),fraction, (V1,V2) -> V1);



JavaPairRDD<String, Row> dataPairRDD = SparkHelper.dfToPairRDD(keyColumn, df);
JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKeyExact(false, fractionsMap).values();
Dataset<Row> sampledDF = df.sqlContext().createDataFrame(sampledRDD, df.schema());

return sampledDF;

Keylist Sampling : This is like Stratified sampling but only make one pass to meet the fraction value.

JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKey(false, fractionsMap).values();

Command Line Option :

 -c,--keyColumn <arg>          Key Column for stratified/keylist sampling

 -f,--fraction <arg>           Sample fraction size

 -fm,--fractionMapping <arg>   comma seperated pairs of key,fraction size

 -h,--help                     show this help.

 -i,--input <arg>              Input Folder/File path

 -if,--inputFormat <arg>       input file format

 -o,--output <arg>             Output Folder path

 -of,--outFormat <arg>         output file format

 -t,--type <arg>               Sampling type  - ran


example : ” -f 0.2 -i ./testfile -if csv -o ./outputFile -of csv -t stratified -c key1 -fm ./keymapping”























about Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

A.I. Singularity & Thoroughbred

Assume Singularity has been reached. Now, Humanoids are running the show.

a.) Humanoids bring humans to enthuse their childoids. Childoids proudly bring their humans to child park to show off to other childoids.

b.) In county fair, there is “human race” very popular with childoids. The winner human is scanned and teleported across globe.

c.) In city council meetings, FANGoids have first right to dump their brains. After that other humanoid species can dump or pick.

d.) FANGoids are first order of humanoids. They can teleport anywhere and dump their brains anywhere. They are the ones who decide what to do with rogue humanoids. They are thoroughbred of horse species.

e.) UKoids are only humanoids who maintains “chronology of evaluation” from primitive human brains which had only couple of millions of billions neurons to support their intelligence. To respect their forefathers, UKoids spends fraction of second in restroom, in silent mode, every morning.

f.) “Give me, Give me, Give me Whole foods” is most popular family game. Parent Amazonoids search for organic giga power battery factory in universe, SpaceXoids search for planets which can produce million year recharge, Baiduoids manufactures large scale battery instantly that goes off instantly. Flipkartoids go to Teslaoids abandoned houses and bring batteries.

g.) Googleoid’s kids are very popular among childoids. They have information about every humanoinds part-number and item id and their co-ordinate in the universe.

h.) Some kind humanoids are protesting the inhumane treatment to human with banners like “Human life matters”. They are asking for banning human race in county fairs. Human should not be treated like Robots.

i.) Teslaoids are working on how humanoids can be redesigned to consume less power and their jobs can not be automated while FANGoids are working on what will be next evaluation of humanoids.

j.) When childoids self develop more neurons, father humanoid will blockchain his model number, part number , other ids and transfer to childoids. He then proceeds to “I was here” center along with his-age humanoids, gives his finger print, which flashes his timestamp in universe. He enters the “I was here” park alone to recycle, while his friend humanoids sing the chorus “ Take away, Take away, Take away my food too”.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Big Data QA – Why it is so difficult ?

Recently Big Data QA profile was needed for a project which processes terabyte of machine generated data everyday. Below is the cryptic job description that was sent to recruitment team —

” We are looking for Big data developer, who can write code on Hadoop cluster to validate structured and semi structured data.

Skill sets – apache spark, Hive, Hadoop, data quality ”

Recruitment team was amused.

Are you looking for a developer or QA ? Is it manual testing or automated testing ? Where we should look for these kind of profile ?

Big Data QA is an emerging area and it is different from traditional product or enterprise QA. Big data QA is big data developer who develops code for validating data which is beyond human scale. BDQ (Big data QA) does not fit into traditional QA profiles – automated vs manual ; backend vs frontend ; feature vs non feature etc.

BDQ is required by data products to ensure their processes which manage data lifecycle, are ingesting and emitting right data. The volumes is so huge; it is beyond human to load the data into spreadsheet and validate row by row ; column by column.

As per business requirement, BDQ may write very complex code to validate complex business rules. However some day to day activity of BDQ requires :

  • Profile the data to validate dataset — pre and post business rules
  • Find out data hole to ensure all data are coming
  • Statistical modeling to compare and contrast two or more datasets
  • Implement unified formatting and unified data types
  • Implement multiple type of joins on dataset including fuzzy joins
  • Implement UDF ( user defined functions) for reusable validation logic
  • Find outlier data and how to manage them
  • Find Null and Empty data and how to manage them
  • Validate the naming conventions of datasets
  • Validate the file format ( csv, avro, parquet etc) of datasets
  • Monitor incoming and outgoing dataset and redistribute if it fails
  • Validate the data models for analytics
  • Create sample data and make sure sample is not skewed
  • Create training and test dataset
  • Encrypt and DeCrypt (anonymize) data
  • Implement and monitor compliance rule

By no means, above is an exhaustive list but it is more of a indicative list. BDQ is a prolific programmer who understands data ( like business analyst) well and code for Data Quality domain. These bold combination makes BDQ so hard to find.

It is best to train data engineers to become BDQ by providing courses and exposures on data quality and data management.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”