A.I. Singularity & Thoroughbred

Assume Singularity has been reached. Now, Humanoids are running the show.

a.) Humanoids bring humans to enthuse their childoids. Childoids proudly bring their humans to child park to show off to other childoids.

b.) In county fair, there is “human race” very popular with childoids. The winner human is scanned and teleported across globe.

c.) In city council meetings, FANGoids have first right to dump their brains. After that other humanoid species can dump or pick.

d.) FANGoids are first order of humanoids. They can teleport anywhere and dump their brains anywhere. They are the ones who decide what to do with rogue humanoids. They are thoroughbred of horse species.

e.) UKoids are only humanoids who maintains “chronology of evaluation” from primitive human brains which had only couple of millions of billions neurons to support their intelligence. To respect their forefathers, UKoids spends fraction of second in restroom, in silent mode, every morning.

f.) “Give me, Give me, Give me Whole foods” is most popular family game. Parent Amazonoids search for organic giga power battery factory in universe, SpaceXoids search for planets which can produce million year recharge, Baiduoids manufactures large scale battery instantly that goes off instantly. Flipkartoids go to Teslaoids abandoned houses and bring batteries.

g.) Googleoid’s kids are very popular among childoids. They have information about every humanoinds part-number and item id and their co-ordinate in the universe.

h.) Some kind humanoids are protesting the inhumane treatment to human with banners like “Human life matters”. They are asking for banning human race in county fairs. Human should not be treated like Robots.

i.) Teslaoids are working on how humanoids can be redesigned to consume less power and their jobs can not be automated while FANGoids are working on what will be next evaluation of humanoids.

j.) When childoids self develop more neurons, father humanoid will blockchain his model number, part number , other ids and transfer to childoids. He then proceeds to “I was here” center along with his-age humanoids, gives his finger print, which flashes his timestamp in universe. He enters the “I was here” park alone to recycle, while his friend humanoids sing the chorus “ Take away, Take away, Take away my food too”.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/


Big Data QA – Why it is so difficult ?

Recently Big Data QA profile was needed for a project which processes terabyte of machine generated data everyday. Below is the cryptic job description that was sent to recruitment team —

” We are looking for Big data developer, who can write code on Hadoop cluster to validate structured and semi structured data.

Skill sets – apache spark, Hive, Hadoop, data quality ”

Recruitment team was amused.

Are you looking for a developer or QA ? Is it manual testing or automated testing ? Where we should look for these kind of profile ?

Big Data QA is an emerging area and it is different from traditional product or enterprise QA. Big data QA is big data developer who develops code for validating data which is beyond human scale. BDQ (Big data QA) does not fit into traditional QA profiles – automated vs manual ; backend vs frontend ; feature vs non feature etc.

BDQ is required by data products to ensure their processes which manage data lifecycle, are ingesting and emitting right data. The volumes is so huge; it is beyond human to load the data into spreadsheet and validate row by row ; column by column.

As per business requirement, BDQ may write very complex code to validate complex business rules. However some day to day activity of BDQ requires :

  • Profile the data to validate dataset — pre and post business rules
  • Find out data hole to ensure all data are coming
  • Statistical modeling to compare and contrast two or more datasets
  • Implement unified formatting and unified data types
  • Implement multiple type of joins on dataset including fuzzy joins
  • Implement UDF ( user defined functions) for reusable validation logic
  • Find outlier data and how to manage them
  • Find Null and Empty data and how to manage them
  • Validate the naming conventions of datasets
  • Validate the file format ( csv, avro, parquet etc) of datasets
  • Monitor incoming and outgoing dataset and redistribute if it fails
  • Validate the data models for analytics
  • Create sample data and make sure sample is not skewed
  • Create training and test dataset
  • Encrypt and DeCrypt (anonymize) data
  • Implement and monitor compliance rule

By no means, above is an exhaustive list but it is more of a indicative list. BDQ is a prolific programmer who understands data ( like business analyst) well and code for Data Quality domain. These bold combination makes BDQ so hard to find.

It is best to train data engineers to become BDQ by providing courses and exposures on data quality and data management.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/

Migrating Enterprise DataWare House to Big data

I had many informal chats with friends, who manage traditional EDW at large corporations and want to migrate to big data. Migration decision has many dimensions – technical, financial, present status disruption, what I will get at “To be” architecture, how I will support new architecture etc etc. Let me answer step by step :-

1.) Should I move to big data : As big data is coming out of hype phase, the reality is seeping in. It will not solve all your problems and all your problems are not big data problems.

  • If your data volume is less than 50 GB, have around 15 attributes , already have data model and ETL working , you are not going to gain a lot from migrating to big data.
  • Big data’s biggest benefit comes from storage space. Typical RDBMS, 1TB cost will be around 20K while on hadoop cluster it would be 3K. So if you have to store huge data move to big data. Also, some companies are using hybrid approach where they archive on big data cluster and keep recent data in their EDW.
  • Second important benefit of big data is latency or processing speed. A typical ETL job takes around 2-3 hours to process 1 GB of data on MPP cluster. While a 20 node hadoop cluster will take around 25 minutes to process that ETL.

If you are expecting a surge in actionable data volume ( just not any data – that you can store on file system on cheap storage) and want to reduce the time for data pipeline, then you should think of migrating to big data. Though most of big data technology is open sourced, migration will be costly. Generally, it takes12-18 months to migrate EDW to big data. A typical cost may be:

  • Nodes for Production, Staging and Development – 100 node X 6K = 600K
  • 2 Hadoop Admins – 300K – 400K
  • 5 Hadoop Developers – 750K – 1000K
  • Product Manager / Project Manager / UAT / Validating reports – 500K
  • Vendor license / support / training – 300K

As you can see, it is not cheap. But it will help you manage large data volumes and reduce time for data processing. Also data science team will love large volumes of data ( lowest granular level) – they will try to dig out hidden gem from it. Take you call accordingly.

You have decided to move your EDW to big data. Now how to do it ?

2.) How to move to big data :

I am proponent of incremental change. That way business user also be happy as they see added value in quarters not years. And they will also support migration initiatives.

  • Break datamart Silos : It may sound weird, but first step of migration is, use a data virtualizing software ( if you don’t already have) and connect to all data marts. Talk to business users and see what other attributes they may be interested in, from other data marts. Teiid is very populate open source data virtualization server that I have used.
  • Share Metadata Catalogue : Create a metadata repository. Bring metadata from all data marts. Check for common attributes. Look into Personal Identifiable Informations ( PII). Ask business users and data scientist across domains to mark the attributes they will be interested in. Also look into data lineage or source for common attribute to confirm they come from same source or different sources. Data Quality rules should be implemented here. I have used osDQ for this and I am also contributor to it.
  • Share virtualized EDW to Business users : Business user will see first benefit here where he or she will see attributes across domain that will make his or her analysis better. Based on the attributes, he or she interested , create virtualized EDW for them.
  • Time for Data Lake : Now it is time to design your Datalake on big data. Virtualized EDW and source system should give fairly good idea on what is needed for data lake. My previous article should help — https://www.linkedin.com/pulse/how-create-data-lake-vivek-kumar-singh
  • Tee off data pipe line to Data Lake : Don’t cut off data pipeline to EDW, yet. Tee off one pipe and move it to Data Lake on big data. Rewrite or migrate you ETL jobs to big data cluster and move the processed data to new compartment. We have used Apache Spark to write processing jobs on big data cluster. You can use new EDW on big data cluster or take the HDFS files out and put into existing EDW. You can use apache Sqoop for it.
  • Validate old EDW and big data EDW : Let both stream run for couple of months. Validate metadata side by side and data statistics side by side. I have used osDQ for this. If they are matching then cut off data stream to EDW and now your big data in production.

Sounds easy but it is not 🙂 Devil is in details. Feel free to contact if you want to discuss in depth.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/

Blog post at : https://viveksingh36.wordpress.com/

How to do Diff of Spark dataframe

Apache spark does not provide diff or subtract method for Dataframes. However, it is common requirement to do diff of dataframes – especially where data engineers have to find out what changes from previous values ( dataframe).

Requirements has generally following use cases:

a.) Find out diff (subtract) with complete dataframes

b.) Find out diff (subtract) with primary keys (Single column)

c.) Find out diff (subtract) with composite keys (Mupltiple columns)

Since dataframe does not have substract method here is the following step you need to do

i) First convert dataframe to RDD keeping the schema of dataframe safe.

ii) Create a pairedRDD for key value pair for step b and c

iii.) Use the substract method of RDD and apply the schema on RDD

iv.) Get back your dataframe

	// find the diff between two data sets A -B
	public DataFrame findDiff ( DataFrame left, DataFrame right) {
		if (left == null || right == null ) {
			return null;
		StructType schema = left.schema();
		JavaRDD<Row> leftRDD = left.toJavaRDD();
		JavaRDD<Row> rightRDD = right.toJavaRDD();
		// diff which is there in right but not in left deleted value
		JavaRDD<Row> diffRDD = rightRDD.subtract(leftRDD);
		DataFrame newdf = sqlContext.createDataFrame(diffRDD, schema);
		return newdf;
	// find the diff between two data sets A -B using colname
	public DataFrame findDiff ( DataFrame left, String leftCol, DataFrame right,  String rightCol) {
		if (left == null || right == null ) {
			return null;
		StructType schema = right.schema();
		JavaRDD<Row> leftRDD = left.toJavaRDD();
		JavaRDD<Row> rightRDD = right.toJavaRDD();
		String[] leftColName = left.columns();
		String[] rightColName = right.columns();
		int leftI=0; int rightI=0;
		for (int i=0 ; i < leftColName.length; i++)
			if (leftCol.equals(leftColName[i])) {
				leftI = i; break;
		for (int i=0 ; i < rightColName.length; i++)
			if (rightCol.equals(rightColName[i])) {
				rightI = i; break;
		final int leftIf = leftI;
		final int rightIf = rightI;
		// Now creare paired RDD for substract
		JavaPairRDD<String, Row> leftPair = leftRDD.mapToPair(new PairFunction<Row, String, Row>() {
			private static final long serialVersionUID = 1L;

					public Tuple2<String, Row> call(Row row) throws Exception {
		                return new Tuple2<String, Row>(row.get(leftIf).toString(), row);
		JavaPairRDD<String, Row> rightPair = rightRDD.mapToPair(new PairFunction<Row, String, Row>() {
			private static final long serialVersionUID = 1L;

			public Tuple2<String, Row> call(Row row) throws Exception {
                return new Tuple2<String, Row>(row.get(rightIf).toString(), row);
		// diff which is there in right but not in left deleted value
		// apply schema of right
		JavaPairRDD<String, Row> diffRDD = rightPair.subtractByKey(leftPair);
		JavaRDD<Row> newdataframe= diffRDD.values();
		DataFrame newdf = sqlContext.createDataFrame(newdataframe, schema);
		return newdf;

osDQ releases apache spark based data quality

World’s first open source data quality and data preparation project (osDQ – https://sourceforge.net/projects/dataquality/ ) releases apache spark based data quality and data preparation modules for big data.

Apache Spark based APIs can be downloaded from here : https://sourceforge.net/projects/apache-spark-osdq/

This beta release has following features :


  • ZScore

Functional input – mean and std dev

Return type : dataframe

  • ZeroScore (between 0 and 1)

Functional input – min and max

Return type : dataframe

  • RatioScore (num/denum)

Functional input – ratio number

Return type : dataframe

  • Subtraction Score (a –b)

Functional input – Subtraction number

Return type : dataframe


  • Replacement with key-value pairs

Functional input – hashtable and columns type

Return type : dataframe

  • Replacement Null with default value

Functional input – value

Return type : dataframe

  • Replacement using regression value (linear and multi-linear)

Functional input – No of iterations

Return type : dataframe


  • Removing Null Rows

Functional input – all or any

Return type : dataframe

  • Removing Duplicate Rows

Functional input – all or any

Return type : dataframe


Functional input – DataFrame

Return type: Hashtable<Colname,Hashtatable<Key,Value>>

HashKeys – “count”,”unique”,”nullcount”,”pattern”,”min”,”max”

Fuzzy Join and Replacement :

Function Input – two strings

Return type – cosine similarity ( between -1 to 1)

Summary : osDQ will enhance the project to provide more core APIs for data quality , data preparation and data science. It will save community time to write those functions for big data environment.



Apache Spark ML for Data Quality

Apache Spark is becoming de-facto standard for data processing. Spark platform is over-arching to all aspects of data lifecycle – Ingestion, Discovery, Preparation and Data Science with easy to use, developers friendly APIs.

Availability of large set of statistical and machine leaning based, scalable algorithm in Spark will bring a new perspective to data quality and validation where these algorithms will be used to automatic and machine based data anomaly detection and correction. Though these algorithms are not new, but bring them into data engineering and data architect domain is new. R, SAS, MATLAB etc were confined to data scientists and not popular with data engineering.

Algorithms like Principal Component Analysis, Support Vector Machine, Pairwise comparison, Regression , Edit Distance, K-Mean have to play critical role in automation of data quality and data correction rules. In the following code, I have used Spark Mlib linear regression model to replace null from column A based on regression values from column B ( linear regression model), and a set of columns (multilinear regression model)

This code snippet is for educational purpose only.

1.) Create a dataframe on which you want to apply the rules – inputBean is that object

2.) Create and Train Model – Linear

public LinearRegressionModel doLinearReg(DataFrameProperty inputBean, int numIterations) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol(); // replace null from this column
String regCol = inputBean.getRegCol(); // use this column for regression
DataFrame newdf = df.select(labelCol, regCol);

JavaRDD<LabeledPoint> parseddata = newdf.javaRDD().map(new Function<Row, LabeledPoint>() {
private static final long serialVersionUID = 1L;
public LabeledPoint call(Row r) throws Exception {
Object labVObj = r.get(0);
Object regVarObj = r.get(1);
if (labVObj != null && regVarObj != null) {
double[] regv = new double[] { (Double) regVarObj };

Vector regV = new DenseVector(regv);
return new LabeledPoint((Double) labVObj, regV);
} else {

double[] regv = new double[] { 0.0 };
Vector regV = new DenseVector(regv);
return new LabeledPoint(0.0, regV);
} }

// Building the model
return LinearRegressionWithSGD.train(parseddata.rdd(), numIterations);

3.) Use model to replace Null Value

//LinearRegressionModel model = dqu.doLinearReg(inputBean,20);

//System.out.println(“\n Intercept: ” + model.intercept());

// System.out.println(“Weight :” + model.weights().toString());

public DataFrame replaceNull(DataFrameProperty inputBean, final double intercept, final double weight) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol();
String regCol = inputBean.getRegCol();
String uniqCol = inputBean.getUniqColName();
SQLContext sqlContext = inputBean.getSqlContext();
DataFrame newdf = df.select(labelCol, regCol, uniqCol);

JavaRDD<Row> parseddata = newdf.toJavaRDD().map(new FunctionMap(intercept,weight));

// Generate the schema based on the string of schema
StructField[] fields = new StructField[3];
fields[0] = DataTypes.createStructField(labelCol, DataTypes.DoubleType, true);
fields[1] = DataTypes.createStructField(regCol, DataTypes.DoubleType, true);
fields[2] = DataTypes.createStructField(uniqCol, DataTypes.DoubleType, true);
StructType schema = DataTypes.createStructType(fields);
DataFrame newdf1 = sqlContext.createDataFrame(parseddata, schema);
df.join(newdf1, uniqCol); // After replace join to main dataframe based on unique column 

4.) Create and Train Model – Multi Linear

public LinearRegressionModel doMultiLinearReg(DataFrameProperty inputBean, int numIterations) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol();
String [] inputCols = inputBean.getInputCols(); // multiple columns for regression
DataFrame newdf = df.select(labelCol, inputCols);

JavaRDD<LabeledPoint> parseddata = newdf.javaRDD().map(new Function<Row, LabeledPoint>() {
private static final long serialVersionUID = 1L;
public LabeledPoint call(Row r) throws Exception {
Object labVObj = r.get(0);
int colC = r.size();
double[] regv = new double[colC -1]; // -1 for first index

for (int i =1; i < colC; i++) {
Object regVarObj = r.get(i);
if (regVarObj != null)
regv[i-1] = (Double)regVarObj;
regv[i-1] = 0.0D; // Null replaced with 0.0
Vector regV = new DenseVector(regv);
if (labVObj != null ) {
return new LabeledPoint((Double) labVObj, regV);
} else {
return new LabeledPoint(0.0D, regV);

// Building the model
return LinearRegressionWithSGD.train(parseddata.rdd(), numIterations);

5.) Use for null replacement

DataFrame newdf = dqu.replaceNull(inputBean,model.intercept(),model.weights().toArray()[0]);

6.) Function Map class

public class FunctionMap  implements java.io.Serializable, Function<Row,Row> {

private static final long serialVersionUID = 1L;
double _intercept, _weight;
public FunctionMap(double intercept, double weight) {
_weight = weight;

public Row call(Row r) throws Exception {
Double regv = r.getDouble(1);
if (r.get(0) == null && regv != null) {
double newVal = _intercept + _weight * regv;
return RowFactory.create(newVal, regv, r.getDouble(2));
} else
return r;

Biggest Problem of Big Data – Entity Resolution

Big Data has gone past PoC phase. Different companies are at different stages of implementation. Data Ingestion, Data Storage ( Data Lake and EDW), Data Processing and Data Visualization processes have been quite mature and there are many open source and proprietary software to solve these problems.

One major hurdle Big Data faces today is – Entity Resolution ( defining a business entity form multitude of data sources). In EDW (Enterprise data warehouse) world, data were structured and sources were limited. Also keys of sources were pre and well defined . So RDBMS joins ( inner, outer, left outer, right outer, semi join etc) were enough to merge data from two different systems and tables.

In Big Data world, there is hardly any key or attribute that runs across the sources. Also keys of one system is useless as other systems are completely independent of each other. So business have to define their own logic for merging data from different sources which defines one entity. To make it worse, RDBMS kind of exact match joins should be replaced by fuzzy joins as referential integrity across systems can be ensured.

Following is a practical approach for resolving Entity Resolution:-

1.) Pre-define your entities and super set of  attributes ( coming from all data sources)

2.) Attributes may have multiple related values that should to mapped to same attributes. Plan your storage like this  ( Graph databases work fine for this relationship storage)

3.) Merge data from multiple sources using merge business logic to create a virtual entity with sizable attributes ( we used Apache spark for this )

Mapping attributes


  • Zip, County, State, Lat/long
  • Nearby locations (+/- area)  – high propensity area
  • IP location


  • Date/time stamp
  • Nearby time stamp (+/-) – Event happening before or after a period


  • String match (Fuzzy) – Name, Address, Cause
  • Cardinal Match  – events sharing same or similar key
  • IP Correlation (Primary IP, Secondary IP, IP from same ZIP code)
  • Other business logic related merging rules

You can make you merge model Machine Learning based so it will be leaning over time to do a relevant merge.

4.) Right Sizing Merges Columns

  • Remove transaction columns
  • Remove  database columns
  • Remove technical identifiable columns
  • Remove duplicate Columns

5.) Search this entity in your relational graph database to find ranks of similar entities more than threshold. If the outcome is less than threshold then make a new entry in your entity table with the attributes of the searching entity.

 6.) Take the highest ranked entity and mapped the missing attributes from highest ranked entities. This entity is you final entity for business

How to enhance entities with changing attributes:

Like any practical entity, values of attributes keep changing. You map the attributes values of searching entity to highest ranked entity and see the different of values. let’s say the IP values of an entity is matching to its secondary value over time. Then the secondary IP value becomes primary and primary becomes secondary. Or it is new IP then add one more relationship node with new values.

Ranking algorithm:

Business can assign different weight-age to must have, critical, important and good to have attributes and their matching threshold. This model also matches with secondary or related values so make it more accurate. Let’s say, address is a must match attributes. A customer other attributes are matching but address not matching so other models will reject it but in this model if matches with his or her office address or other address, it will boost the record that is right.