osDQ releases apache spark based data quality

World’s first open source data quality and data preparation project (osDQ – https://sourceforge.net/projects/dataquality/ ) releases apache spark based data quality and data preparation modules for big data.

Apache Spark based APIs can be downloaded from here : https://sourceforge.net/projects/apache-spark-osdq/

This beta release has following features :

Normalization:

  • ZScore

Functional input – mean and std dev

Return type : dataframe

  • ZeroScore (between 0 and 1)

Functional input – min and max

Return type : dataframe

  • RatioScore (num/denum)

Functional input – ratio number

Return type : dataframe

  • Subtraction Score (a –b)

Functional input – Subtraction number

Return type : dataframe

Replacement:

  • Replacement with key-value pairs

Functional input – hashtable and columns type

Return type : dataframe

  • Replacement Null with default value

Functional input – value

Return type : dataframe

  • Replacement using regression value (linear and multi-linear)

Functional input – No of iterations

Return type : dataframe

Remove:

  • Removing Null Rows

Functional input – all or any

Return type : dataframe

  • Removing Duplicate Rows

Functional input – all or any

Return type : dataframe

Profiling:

Functional input – DataFrame

Return type: Hashtable<Colname,Hashtatable<Key,Value>>

HashKeys – “count”,”unique”,”nullcount”,”pattern”,”min”,”max”

Fuzzy Join and Replacement :

Function Input – two strings

Return type – cosine similarity ( between -1 to 1)

Summary : osDQ will enhance the project to provide more core APIs for data quality , data preparation and data science. It will save community time to write those functions for big data environment.

 

 

Advertisements

Apache Spark ML for Data Quality

Apache Spark is becoming de-facto standard for data processing. Spark platform is over-arching to all aspects of data lifecycle – Ingestion, Discovery, Preparation and Data Science with easy to use, developers friendly APIs.

Availability of large set of statistical and machine leaning based, scalable algorithm in Spark will bring a new perspective to data quality and validation where these algorithms will be used to automatic and machine based data anomaly detection and correction. Though these algorithms are not new, but bring them into data engineering and data architect domain is new. R, SAS, MATLAB etc were confined to data scientists and not popular with data engineering.

Algorithms like Principal Component Analysis, Support Vector Machine, Pairwise comparison, Regression , Edit Distance, K-Mean have to play critical role in automation of data quality and data correction rules. In the following code, I have used Spark Mlib linear regression model to replace null from column A based on regression values from column B ( linear regression model), and a set of columns (multilinear regression model)

This code snippet is for educational purpose only.

1.) Create a dataframe on which you want to apply the rules – inputBean is that object

2.) Create and Train Model – Linear

public LinearRegressionModel doLinearReg(DataFrameProperty inputBean, int numIterations) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol(); // replace null from this column
String regCol = inputBean.getRegCol(); // use this column for regression
DataFrame newdf = df.select(labelCol, regCol);

JavaRDD<LabeledPoint> parseddata = newdf.javaRDD().map(new Function<Row, LabeledPoint>() {
private static final long serialVersionUID = 1L;
public LabeledPoint call(Row r) throws Exception {
Object labVObj = r.get(0);
Object regVarObj = r.get(1);
if (labVObj != null && regVarObj != null) {
double[] regv = new double[] { (Double) regVarObj };

Vector regV = new DenseVector(regv);
return new LabeledPoint((Double) labVObj, regV);
} else {

double[] regv = new double[] { 0.0 };
Vector regV = new DenseVector(regv);
return new LabeledPoint(0.0, regV);
} }
});

// Building the model
return LinearRegressionWithSGD.train(parseddata.rdd(), numIterations);
}

3.) Use model to replace Null Value

//LinearRegressionModel model = dqu.doLinearReg(inputBean,20);

//System.out.println(“\n Intercept: ” + model.intercept());

// System.out.println(“Weight :” + model.weights().toString());

public DataFrame replaceNull(DataFrameProperty inputBean, final double intercept, final double weight) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol();
String regCol = inputBean.getRegCol();
String uniqCol = inputBean.getUniqColName();
SQLContext sqlContext = inputBean.getSqlContext();
DataFrame newdf = df.select(labelCol, regCol, uniqCol);

JavaRDD<Row> parseddata = newdf.toJavaRDD().map(new FunctionMap(intercept,weight));

// Generate the schema based on the string of schema
StructField[] fields = new StructField[3];
fields[0] = DataTypes.createStructField(labelCol, DataTypes.DoubleType, true);
fields[1] = DataTypes.createStructField(regCol, DataTypes.DoubleType, true);
fields[2] = DataTypes.createStructField(uniqCol, DataTypes.DoubleType, true);
StructType schema = DataTypes.createStructType(fields);
DataFrame newdf1 = sqlContext.createDataFrame(parseddata, schema);
df.join(newdf1, uniqCol); // After replace join to main dataframe based on unique column 
returndf;
}

4.) Create and Train Model – Multi Linear

public LinearRegressionModel doMultiLinearReg(DataFrameProperty inputBean, int numIterations) {

DataFrame df = inputBean.getDataFrame();
String labelCol = inputBean.getLabelCol();
String [] inputCols = inputBean.getInputCols(); // multiple columns for regression
DataFrame newdf = df.select(labelCol, inputCols);

JavaRDD<LabeledPoint> parseddata = newdf.javaRDD().map(new Function<Row, LabeledPoint>() {
private static final long serialVersionUID = 1L;
public LabeledPoint call(Row r) throws Exception {
Object labVObj = r.get(0);
int colC = r.size();
double[] regv = new double[colC -1]; // -1 for first index

for (int i =1; i < colC; i++) {
Object regVarObj = r.get(i);
if (regVarObj != null)
regv[i-1] = (Double)regVarObj;
else
regv[i-1] = 0.0D; // Null replaced with 0.0
}
Vector regV = new DenseVector(regv);
if (labVObj != null ) {
return new LabeledPoint((Double) labVObj, regV);
} else {
return new LabeledPoint(0.0D, regV);
}}
});

// Building the model
return LinearRegressionWithSGD.train(parseddata.rdd(), numIterations);
}

5.) Use for null replacement

DataFrame newdf = dqu.replaceNull(inputBean,model.intercept(),model.weights().toArray()[0]);

6.) Function Map class

public class FunctionMap  implements java.io.Serializable, Function<Row,Row> {

private static final long serialVersionUID = 1L;
double _intercept, _weight;
public FunctionMap(double intercept, double weight) {
_intercept=intercept;
_weight = weight;
}

public Row call(Row r) throws Exception {
Double regv = r.getDouble(1);
if (r.get(0) == null && regv != null) {
double newVal = _intercept + _weight * regv;
return RowFactory.create(newVal, regv, r.getDouble(2));
} else
return r;
}
}

Biggest Problem of Big Data – Entity Resolution

Big Data has gone past PoC phase. Different companies are at different stages of implementation. Data Ingestion, Data Storage ( Data Lake and EDW), Data Processing and Data Visualization processes have been quite mature and there are many open source and proprietary software to solve these problems.

One major hurdle Big Data faces today is – Entity Resolution ( defining a business entity form multitude of data sources). In EDW (Enterprise data warehouse) world, data were structured and sources were limited. Also keys of sources were pre and well defined . So RDBMS joins ( inner, outer, left outer, right outer, semi join etc) were enough to merge data from two different systems and tables.

In Big Data world, there is hardly any key or attribute that runs across the sources. Also keys of one system is useless as other systems are completely independent of each other. So business have to define their own logic for merging data from different sources which defines one entity. To make it worse, RDBMS kind of exact match joins should be replaced by fuzzy joins as referential integrity across systems can be ensured.

Following is a practical approach for resolving Entity Resolution:-

1.) Pre-define your entities and super set of  attributes ( coming from all data sources)

2.) Attributes may have multiple related values that should to mapped to same attributes. Plan your storage like this  ( Graph databases work fine for this relationship storage)

3.) Merge data from multiple sources using merge business logic to create a virtual entity with sizable attributes ( we used Apache spark for this )

Mapping attributes

Location

  • Zip, County, State, Lat/long
  • Nearby locations (+/- area)  – high propensity area
  • IP location

Time

  • Date/time stamp
  • Nearby time stamp (+/-) – Event happening before or after a period

Attributes

  • String match (Fuzzy) – Name, Address, Cause
  • Cardinal Match  – events sharing same or similar key
  • IP Correlation (Primary IP, Secondary IP, IP from same ZIP code)
  • Other business logic related merging rules

You can make you merge model Machine Learning based so it will be leaning over time to do a relevant merge.

4.) Right Sizing Merges Columns

  • Remove transaction columns
  • Remove  database columns
  • Remove technical identifiable columns
  • Remove duplicate Columns

5.) Search this entity in your relational graph database to find ranks of similar entities more than threshold. If the outcome is less than threshold then make a new entry in your entity table with the attributes of the searching entity.

 6.) Take the highest ranked entity and mapped the missing attributes from highest ranked entities. This entity is you final entity for business

How to enhance entities with changing attributes:

Like any practical entity, values of attributes keep changing. You map the attributes values of searching entity to highest ranked entity and see the different of values. let’s say the IP values of an entity is matching to its secondary value over time. Then the secondary IP value becomes primary and primary becomes secondary. Or it is new IP then add one more relationship node with new values.

Ranking algorithm:

Business can assign different weight-age to must have, critical, important and good to have attributes and their matching threshold. This model also matches with secondary or related values so make it more accurate. Let’s say, address is a must match attributes. A customer other attributes are matching but address not matching so other models will reject it but in this model if matches with his or her office address or other address, it will boost the record that is right.

Inbound and Outbound data movement from Data Lake

Description: Enterprise data is sitting in Enterprise Data Lake. Only internal employees have access to data for analysis and insight. This paper discusses conceptual framework, where 3rd party data can be brought in for “fusion” with enterprise data to find new insights or how, enterprises data can be made available to 3rd party for “monetization”.

Abstract: Businesses are in agreement that data sharing is good for eco-system. One hand it increases top-line by adding new revenue channel by monetizing data, on other hand, hitherto unknown data brings new insight. The problem is,

  • technology is not ready to infuse massive 3rd party data in automated way and
  • Data Lake can’t emit data to 3rd

There are actions items about information mapping, data curation, security and compliance, data movement from internal storage to public facing storage, which need to be fixed to enable the framework.

This paper discusses a conceptual framework (a process flow) which can be followed while creating inbound and outbound data flow framework on Data Lake.

Business Process Flow for Data Movement:

  • Information Mapping
  • Data Curation
  • Entity Resolution
  • Security and Compliance

 

 

  • Information Mapping :

a.) Business Entity discovery

b.) Business Attributes discovery

c.) Sharable attributes

d.)  Joinable attributes

e.) Aggregation level

f.) Time to share

g.) Time to fetch

  • Data Curation:

a.) Schema mapping

b.) Missing value replacement

c.) Dirty data drop

d.) Fuzzy Joining of data

e.) inbound and outbound dataset creation

  • Entity Resolution :

a.) Finding the business entities across data sets

b.) Finding attributes which can affect behavior of entities

c.) Logic of correlation

d.) Logic for join

  • Security & Compliance:

a.) Deletion of personal identifiable data

b.) Masking of critical data

c.) Validating compliance

Technical consideration:

a.) downloadable or API based,

b.) format of download

c.) choice of ETL tool

d.) choice of EAI tool

e.) internal storage Vs public facing storage

f.) scalability

Business consideration:

a.) What to expose

b.) Whom to expose, registered user or to public,

c.) What is monetization policy (by download, by advertisement or by API usages?)

 

Conclusion: Every business is unique. This paper tries to bring out a conceptual guideline on how to fusion 3rd party data into your system and monetize your data by providing to 3rd party. Companies are struggling to find out a way to monetize data and also bring more insight into data.

A data sharing ecosystem will be a big boost to companies.

Why Data Quality is so difficult to solve ?

Way back in 2006, when I started coding for world’s first open source data quality project (osDQ) http://sourceforge.net/projects/dataquality/ , data quality issues were prevalent. Years later, Businesses have matured, computing power has increased many folds, storage has become cheaper and algorithms have improved. Still, data quality issues are as prevalent, if not more. That requires a serious understanding of data quality issues – how it originates, how it is propagated and more importantly how it can be solved :-

1.) Technical Solution: You will be completely off-hook, if you try to solve data quality problems using brute computing force and advanced algorithms alone. Issues like fuzzy matches, record linking, golden data are best solved by using technology but like viruses, data quality issues mutate and keep coming in different forms. You will be only in reactive mode and never be free of viruses. As and when it comes, you will desperately look out for cure.

2.) Process based solution: Setting up data governance framework, enforcing data policies, modeling business entities, having stewards and an office of chief data officer, certainly help you reduce the data quality issues. Having ISO certification for “data in motion” also helps organisation to a large extent. Even then, most optimistic data practitioner will not certify you “free of data quality” issues.

3.) Enterprise solution : You broke the “data silos”, brought the data to lake, did metadata categorization, created semantic layer, defined ontology – indeed commendable job. Can you say, we are all free from data quality virus and it is not going to comeback ?

All the the above approach are right in their own way and they solve a subset of data quality issues. But they are reactive and not standardized. Let’s take a typical high tech good workflow – imaginary !!

Designed in USA, Manufactured in China, Curated and Tested in India, Assembled and Packaged in USA , Sold in UK. You can see the relevant data move across boundary, languages, enterprises and governments. A company which is doing testing in India, has not influence ( probably they even don’t know who is manufacture is) on the data the chip producing and they can’t loop back to manufacture. A change in data format by chip manufactures will break all quality testing. An enterprise can enforce processed within its premises but in global world, they are no takers.

Data Quality problems are so difficult to solve because it is global, temporal, mutable, non-standard and spanning across multi-agencies and countries.

Good news is, sincere steps are taken in right direction which will solve data quality issues in long run.

Open data Initiative : Governmental and Semi Governmental departments are making their data publicly available. It will enhance standard adoptions and technology based solutions.

Cross-Pollination of data : In the above example, let’s assume manufacturing company is sharing their data with testing companies. It will help to build all data foot prints of chips and also will decrease the data glitches between companies.

Data Monetization: Once Organizations start putting up their data for sale or 3rd party consumption, quality of internal and external data will improve. Metadata and datatype will be publicly available and data will go through many eyes.

Is Data Preparation part of Data Quality ?

As we all know, the lineage of data quality comes from CRM (Customer Relationship Management) system doing address correction and later moved into MDM (Master Data Management). As business matures, Data Quality (DQ) moves from reactive to “data-in-motion”.

On the other hand, Data Preparation, was traditionally part of Data Mining process, which was done in batch mode and probably by data scientist. Data Preparation involves steps that are taken in order to make data model ready.

Now business is moving from IT driven world to business-rule driven world. What I mean to say is, now data is important only if it maps to business needs. Structure data profile like null, pattern, outlier etc has limited value. Business is looking at DQ to validate complex business rules. It is handled by data steward of business rather than IT managers.

Data Science world is also changing. Model has to be business driven rather than solving some theoretical mathematical problem. Data scientists are working with business users and data stewards to understand business and data. With this new development the boundary line between Data Quality and Data Preparation is getting blurred.

1.) Data Fishing Vs Entity Resolution : From the web “Data dredging, sometimes referred to as data fishing is a data mining practice in which large volumes of data are searched to find any possible relationships between data…”

In DQ, Entity resolution is the process where all the attributes from disparate source are collected and an entity definition is created. As data management moving toward integrated world, both processes will be merged into one.

2.) Area of Mine Vs Data Ownership : In Data Mining, ‘Area of Mine’ defines data / record data scientists are interested for to model. In DQ world, data ownership defines who own the data. In a metadata driven data world, it will be decided by single metadata management system.

3.) Missing Values : Missing values are problematic for both – data preparation and data quality. Only difference is, an inferred value is good enough for data mining while data quality either looks for 100% accuracy ( like geo-encoding), replace with default value or discard the data as dirty data. Whatever the ways may be, both try to mitigate missing values. I am expecting business will define their missing data strategy and how to interpret them.

4.) Noise reduction Vs Data Scrubbing : It is important to reduce noise to create a good model. Data mining has several techniques to reduce noise. Data Quality has several ways to scrub and massage the data. These techniques are executed on data to make it compliance to business.

Summary: As getting insight is increasing becoming business function role while IT works as only facilitator, I am expecting data quality and data preparation processes will be merged into one, that will be managed by data stewards.

Your thoughts !!

About Author : Vivek K Singh is data architect and runs worlds first open source data quality and data preparation tool http://sourceforge.net/projects/dataquality/

Data Science for Data Scientist

Well, it sounds like oxymoron – why a data scientist will need data science to help him ; he rather will use data science to help find insights and predicts the future. Data Science has a connotation that it is to be used discovery, insight, prediction and artificial intelligence.

But data science can also be used to do data preparation which will help data scientist to develop right model for discovery, insight, prediction and artificial intelligence.

1.) Regression : When DBA does not know what data scientists are looking for they dump everything on data scientist. Sometimes even hundreds of attributes. Some attributes have influence on model; some are only informational. IoT also dumps lots of machine data which are irrelevant for model.

Data science can help data scientist in finding the relevant attributes. They can run regression algorithms to find out attributes which have effect on outcome.

2.) Missing Values : Missing values are ubiquitous. Depending on model, it can be discarded, have default values or can have advance statistics to generate the missing numbers based on other attributes like time, location, customer behavior.

In some cases, missing values can be detrimental to model. Data science will help data scientist to auto-generate realistic numbers , conforming to model.

3.) Clustering : Some popular clustering algorithms (like K-Mean, nearest neighborhood )require data scientist to provided initial parameters to model, to refine it further. So if the initial number is way off, model will fail.

Data scientists can use binning, basket analysis, outlier, no frill clustering algorithms to figure out tentative initial parameters to build a good model.

4.) Anomaly Detection : Some systems ( like alerting and event correlation) needs anomalies to perform their task. But finding anomalies is like finding needle in haystack.

Data Scientist can use anomalies detection algorithms like support vector machine ( one class), association rule, replicator neural network to filter out in-bound or one class data and get anomalies.

Above examples are only for sake of example. There are many ways data science can help in data preparation and business rule validation. Data science will be extended to data preparation also.

Author : Vivek Singh is data architect and developing world’s fist open source data quality and data preparation tool http://sourceforge.net/projects/dataquality/