Is ML/AI for big companies only ?

Recently, a friend of mine who owns a SME ( Small and Mid Size Enterprise ) asked me – “Is ML/AI for big companies only ? Can we (SME) also benefit from Machine Learning and Artificial Intelligence wave ” ?

I asked – “why do you think you will not benefit” ?

“Most of use cases talk about ‘terabytes of data, complex algorithms and super human data scientist’ which are beyond SMEs reach. So I guess it is for big companies only” – he replied.

This may be the impression most SME owners might carry, but it is not far from truth. In fact, it is other way. The impact of ML/AI will be more pronounced on SMEs while on big companies it will average out.

Here are the reasons:

1.) Most of ML/AI algorithms needs 50-100 data points to create a model. So it is not terabytes, even KB and MB of data will be good enough for model inputs.

2.) A segment of 50-100 members is good enough to define segment behavior. So a customer base of 500+ is a good use case for segment analysis.

3.) Effect of weather or local events has larger effect on local businesses. It is easier to track impact on SMEs ; for large corporation it averages out.

4.) There are less data silos in SME so unlocking is easier. Data is well understood in SME world while in big corporations it is very complex and often riddled with politics.

5.) SME owners understand their business in totality, while executive of big corporations know only one part of business. Better business understanding and defined metrics lead to better models.

6.) Better personalized service is possible for SMEs where segment size is small and customers are local and loyalty has major impact on business.

7.) Now tools and cloud has taken cost barrier out for SMEs. They should be using AI/ML extensively to drive their business.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”


Apache spark based classification and prediction model

We have open sourced apache spark based random forest and multilevel perceptron algorithms which can be used for classification and prediction. It can be downloaded from

Use cases :

a.) if you are using huge volume of data (Big data problem) which has large set feature columns. For smaller dataset, you can run apache spark in local mode.

b.) If you want run multiple multi class models together to see which one gives better result. Right now random forest and multilevel perceptron algorithms are implemented. But framework is there to take other algorithms also.

c.) No coding required. Just change the config file and good to go – both for training and classifying/predicting. All you need is java 8

d.) Restful APIs are there to predict/classify along with probability. Easily integratabtle.

e.) If you want to see the accuracy of multiple label columns for dimension reducibility

Overview: This program can be used for both training and classifying purpose. You can train the model and use RESTFul web service to query the model.This program also exposes a RESTFul web service to (jetty and javaspark based) expose classification/prediction as a service.


1. Download the package the package “”

2. Unzip the pre-built distribution and follow the below details

3. Understand the folder structure of release upon unzipping

* spark-classifier_\<version>

* /lib: contains all dependent jar

* /conf: contains, please review this file before running the program

* /model: the default model path where both model would saved (after training) and read (during classification service). You should have write access to this folder

* /spark-classifier-\<version>.jar: the main driver jar

Configuration:Currently it supports Random Forest and Multilayer Perceptron classifiers. Please set the same under “conf/”

# Currently supported algorithm RANDOM_FOREST or MULTILEVEL_PERCEPTRON



It takes Comma(,) separated list of columns for Feature and Label. * in label means it will take all columns to predict. It will skip feature columns if they in in predict or label column too.

classifier.featurecols=Number,Follow up

####list of labels to be predicted

#### '*' will process all the columns

classifier.labelcols=Root Cause

#classifier.labelcols=L1, L2, L3..

Train the model:

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.classifier.ClassifierTrainer

The input file name and output model location can be defined inside `conf/` By default, above command would assume that `conf/` file is correctly setup.

Use the model to predict or classify

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.service.ClassifierService

It will start default jetty server which will accept post requests. After this you may post the RESTFul API http://localhost:4567/classify/<algorithm_name>/<label_name&gt; -d jsonfile

Where \<algorithm_name> can be “randon_forest” or “multilevel_perceptron” and \<label_name> would be the label column name (column for which model was trained) in your training dataset and json file will have feature column and values which are input for prediction or classification

cmd > curl -XPOST http://localhost:4567/classify/random_forest/LABEL1 -d '[{


       "FeatureField2":" FeatureField2VALUE",

       "FeatureField3":" FeatureField3VALUE"}]'
> Response JSON


       "classifiedLabel": "PredictedValue",

       "probability": "0.951814884316891"


Things to Remember

1.)   Presently it takes only txt file with field separator

2.)   Null is replaced by NULLVALUE as null cannot be used in model

3.)   multilevel_perceptron does not give probability of predicted value. This feature is available in latest apache spark version.

4.)   Currently label_name shouldn’t have hyphen ‘-‘ character

5.)   If there is space in label column name use ‘%20’ for space.

If you face any issue feel free to contact us or raise a bug. We are developing an open source platform for integrated data life cycle – with ingestion, DQ, Profiling, Analytics and Prediction , all in one.

About Me : Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Sampling using apache Spark 2.1.1

There has been a debate in big data community “do we need sampling in big data world” ? Now big data platform has storage and processing power to use all dataset for analysis so sampling is not required – is the argument.

This arguments holds true for data discovery where doing analysis on full dataset give more confidence and every anomaly and pattern is captured. However, sampling still saves considerable amount of time in dimension reduction, correlation and model generation. You have to go through hundreds of attributes (with all permutation and combination) to find dependent, independent , correlated variables where representative dataset saves you hour with almost same accuracy.

We have open sourced data sampling code using apache spark 2.1.1 at

Random Sampling : on dataset<row> random sampling is provided by apache spark where user can provide the fraction he needs for sampling.

Dataset<Row> org.apache.spark.sql.Dataset.sample(boolean withReplacement, double fraction)

Stratified Sampling : dataset<row> does not provide stratified sampling so dataset is converted into PairedRDD with key column which need to be stratified and then use samplebyKeyExact. It does many pass to find the exact fraction

for(Row uniqueKey:uniqueKeys){

fractionsMap.merge(uniqueKey.mkString(),fraction, (V1,V2) -> V1);



JavaPairRDD<String, Row> dataPairRDD = SparkHelper.dfToPairRDD(keyColumn, df);
JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKeyExact(false, fractionsMap).values();
Dataset<Row> sampledDF = df.sqlContext().createDataFrame(sampledRDD, df.schema());

return sampledDF;

Keylist Sampling : This is like Stratified sampling but only make one pass to meet the fraction value.

JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKey(false, fractionsMap).values();

Command Line Option :

 -c,--keyColumn <arg>          Key Column for stratified/keylist sampling

 -f,--fraction <arg>           Sample fraction size

 -fm,--fractionMapping <arg>   comma seperated pairs of key,fraction size

 -h,--help                     show this help.

 -i,--input <arg>              Input Folder/File path

 -if,--inputFormat <arg>       input file format

 -o,--output <arg>             Output Folder path

 -of,--outFormat <arg>         output file format

 -t,--type <arg>               Sampling type  - ran


example : ” -f 0.2 -i ./testfile -if csv -o ./outputFile -of csv -t stratified -c key1 -fm ./keymapping”























about Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

A.I. Singularity & Thoroughbred

Assume Singularity has been reached. Now, Humanoids are running the show.

a.) Humanoids bring humans to enthuse their childoids. Childoids proudly bring their humans to child park to show off to other childoids.

b.) In county fair, there is “human race” very popular with childoids. The winner human is scanned and teleported across globe.

c.) In city council meetings, FANGoids have first right to dump their brains. After that other humanoid species can dump or pick.

d.) FANGoids are first order of humanoids. They can teleport anywhere and dump their brains anywhere. They are the ones who decide what to do with rogue humanoids. They are thoroughbred of horse species.

e.) UKoids are only humanoids who maintains “chronology of evaluation” from primitive human brains which had only couple of millions of billions neurons to support their intelligence. To respect their forefathers, UKoids spends fraction of second in restroom, in silent mode, every morning.

f.) “Give me, Give me, Give me Whole foods” is most popular family game. Parent Amazonoids search for organic giga power battery factory in universe, SpaceXoids search for planets which can produce million year recharge, Baiduoids manufactures large scale battery instantly that goes off instantly. Flipkartoids go to Teslaoids abandoned houses and bring batteries.

g.) Googleoid’s kids are very popular among childoids. They have information about every humanoinds part-number and item id and their co-ordinate in the universe.

h.) Some kind humanoids are protesting the inhumane treatment to human with banners like “Human life matters”. They are asking for banning human race in county fairs. Human should not be treated like Robots.

i.) Teslaoids are working on how humanoids can be redesigned to consume less power and their jobs can not be automated while FANGoids are working on what will be next evaluation of humanoids.

j.) When childoids self develop more neurons, father humanoid will blockchain his model number, part number , other ids and transfer to childoids. He then proceeds to “I was here” center along with his-age humanoids, gives his finger print, which flashes his timestamp in universe. He enters the “I was here” park alone to recycle, while his friend humanoids sing the chorus “ Take away, Take away, Take away my food too”.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Big Data QA – Why it is so difficult ?

Recently Big Data QA profile was needed for a project which processes terabyte of machine generated data everyday. Below is the cryptic job description that was sent to recruitment team —

” We are looking for Big data developer, who can write code on Hadoop cluster to validate structured and semi structured data.

Skill sets – apache spark, Hive, Hadoop, data quality ”

Recruitment team was amused.

Are you looking for a developer or QA ? Is it manual testing or automated testing ? Where we should look for these kind of profile ?

Big Data QA is an emerging area and it is different from traditional product or enterprise QA. Big data QA is big data developer who develops code for validating data which is beyond human scale. BDQ (Big data QA) does not fit into traditional QA profiles – automated vs manual ; backend vs frontend ; feature vs non feature etc.

BDQ is required by data products to ensure their processes which manage data lifecycle, are ingesting and emitting right data. The volumes is so huge; it is beyond human to load the data into spreadsheet and validate row by row ; column by column.

As per business requirement, BDQ may write very complex code to validate complex business rules. However some day to day activity of BDQ requires :

  • Profile the data to validate dataset — pre and post business rules
  • Find out data hole to ensure all data are coming
  • Statistical modeling to compare and contrast two or more datasets
  • Implement unified formatting and unified data types
  • Implement multiple type of joins on dataset including fuzzy joins
  • Implement UDF ( user defined functions) for reusable validation logic
  • Find outlier data and how to manage them
  • Find Null and Empty data and how to manage them
  • Validate the naming conventions of datasets
  • Validate the file format ( csv, avro, parquet etc) of datasets
  • Monitor incoming and outgoing dataset and redistribute if it fails
  • Validate the data models for analytics
  • Create sample data and make sure sample is not skewed
  • Create training and test dataset
  • Encrypt and DeCrypt (anonymize) data
  • Implement and monitor compliance rule

By no means, above is an exhaustive list but it is more of a indicative list. BDQ is a prolific programmer who understands data ( like business analyst) well and code for Data Quality domain. These bold combination makes BDQ so hard to find.

It is best to train data engineers to become BDQ by providing courses and exposures on data quality and data management.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Migrating Enterprise DataWare House to Big data

I had many informal chats with friends, who manage traditional EDW at large corporations and want to migrate to big data. Migration decision has many dimensions – technical, financial, present status disruption, what I will get at “To be” architecture, how I will support new architecture etc etc. Let me answer step by step :-

1.) Should I move to big data : As big data is coming out of hype phase, the reality is seeping in. It will not solve all your problems and all your problems are not big data problems.

  • If your data volume is less than 50 GB, have around 15 attributes , already have data model and ETL working , you are not going to gain a lot from migrating to big data.
  • Big data’s biggest benefit comes from storage space. Typical RDBMS, 1TB cost will be around 20K while on hadoop cluster it would be 3K. So if you have to store huge data move to big data. Also, some companies are using hybrid approach where they archive on big data cluster and keep recent data in their EDW.
  • Second important benefit of big data is latency or processing speed. A typical ETL job takes around 2-3 hours to process 1 GB of data on MPP cluster. While a 20 node hadoop cluster will take around 25 minutes to process that ETL.

If you are expecting a surge in actionable data volume ( just not any data – that you can store on file system on cheap storage) and want to reduce the time for data pipeline, then you should think of migrating to big data. Though most of big data technology is open sourced, migration will be costly. Generally, it takes12-18 months to migrate EDW to big data. A typical cost may be:

  • Nodes for Production, Staging and Development – 100 node X 6K = 600K
  • 2 Hadoop Admins – 300K – 400K
  • 5 Hadoop Developers – 750K – 1000K
  • Product Manager / Project Manager / UAT / Validating reports – 500K
  • Vendor license / support / training – 300K

As you can see, it is not cheap. But it will help you manage large data volumes and reduce time for data processing. Also data science team will love large volumes of data ( lowest granular level) – they will try to dig out hidden gem from it. Take you call accordingly.

You have decided to move your EDW to big data. Now how to do it ?

2.) How to move to big data :

I am proponent of incremental change. That way business user also be happy as they see added value in quarters not years. And they will also support migration initiatives.

  • Break datamart Silos : It may sound weird, but first step of migration is, use a data virtualizing software ( if you don’t already have) and connect to all data marts. Talk to business users and see what other attributes they may be interested in, from other data marts. Teiid is very populate open source data virtualization server that I have used.
  • Share Metadata Catalogue : Create a metadata repository. Bring metadata from all data marts. Check for common attributes. Look into Personal Identifiable Informations ( PII). Ask business users and data scientist across domains to mark the attributes they will be interested in. Also look into data lineage or source for common attribute to confirm they come from same source or different sources. Data Quality rules should be implemented here. I have used osDQ for this and I am also contributor to it.
  • Share virtualized EDW to Business users : Business user will see first benefit here where he or she will see attributes across domain that will make his or her analysis better. Based on the attributes, he or she interested , create virtualized EDW for them.
  • Time for Data Lake : Now it is time to design your Datalake on big data. Virtualized EDW and source system should give fairly good idea on what is needed for data lake. My previous article should help —
  • Tee off data pipe line to Data Lake : Don’t cut off data pipeline to EDW, yet. Tee off one pipe and move it to Data Lake on big data. Rewrite or migrate you ETL jobs to big data cluster and move the processed data to new compartment. We have used Apache Spark to write processing jobs on big data cluster. You can use new EDW on big data cluster or take the HDFS files out and put into existing EDW. You can use apache Sqoop for it.
  • Validate old EDW and big data EDW : Let both stream run for couple of months. Validate metadata side by side and data statistics side by side. I have used osDQ for this. If they are matching then cut off data stream to EDW and now your big data in production.

Sounds easy but it is not 🙂 Devil is in details. Feel free to contact if you want to discuss in depth.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

Blog post at :

How to do Diff of Spark dataframe

Apache spark does not provide diff or subtract method for Dataframes. However, it is common requirement to do diff of dataframes – especially where data engineers have to find out what changes from previous values ( dataframe).

Requirements has generally following use cases:

a.) Find out diff (subtract) with complete dataframes

b.) Find out diff (subtract) with primary keys (Single column)

c.) Find out diff (subtract) with composite keys (Mupltiple columns)

Since dataframe does not have substract method here is the following step you need to do

i) First convert dataframe to RDD keeping the schema of dataframe safe.

ii) Create a pairedRDD for key value pair for step b and c

iii.) Use the substract method of RDD and apply the schema on RDD

iv.) Get back your dataframe

	// find the diff between two data sets A -B
	public DataFrame findDiff ( DataFrame left, DataFrame right) {
		if (left == null || right == null ) {
			return null;
		StructType schema = left.schema();
		JavaRDD<Row> leftRDD = left.toJavaRDD();
		JavaRDD<Row> rightRDD = right.toJavaRDD();
		// diff which is there in right but not in left deleted value
		JavaRDD<Row> diffRDD = rightRDD.subtract(leftRDD);
		DataFrame newdf = sqlContext.createDataFrame(diffRDD, schema);
		return newdf;
	// find the diff between two data sets A -B using colname
	public DataFrame findDiff ( DataFrame left, String leftCol, DataFrame right,  String rightCol) {
		if (left == null || right == null ) {
			return null;
		StructType schema = right.schema();
		JavaRDD<Row> leftRDD = left.toJavaRDD();
		JavaRDD<Row> rightRDD = right.toJavaRDD();
		String[] leftColName = left.columns();
		String[] rightColName = right.columns();
		int leftI=0; int rightI=0;
		for (int i=0 ; i < leftColName.length; i++)
			if (leftCol.equals(leftColName[i])) {
				leftI = i; break;
		for (int i=0 ; i < rightColName.length; i++)
			if (rightCol.equals(rightColName[i])) {
				rightI = i; break;
		final int leftIf = leftI;
		final int rightIf = rightI;
		// Now creare paired RDD for substract
		JavaPairRDD<String, Row> leftPair = leftRDD.mapToPair(new PairFunction<Row, String, Row>() {
			private static final long serialVersionUID = 1L;

					public Tuple2<String, Row> call(Row row) throws Exception {
		                return new Tuple2<String, Row>(row.get(leftIf).toString(), row);
		JavaPairRDD<String, Row> rightPair = rightRDD.mapToPair(new PairFunction<Row, String, Row>() {
			private static final long serialVersionUID = 1L;

			public Tuple2<String, Row> call(Row row) throws Exception {
                return new Tuple2<String, Row>(row.get(rightIf).toString(), row);
		// diff which is there in right but not in left deleted value
		// apply schema of right
		JavaPairRDD<String, Row> diffRDD = rightPair.subtractByKey(leftPair);
		JavaRDD<Row> newdataframe= diffRDD.values();
		DataFrame newdf = sqlContext.createDataFrame(newdataframe, schema);
		return newdf;