Inbound and Outbound data movement from Data Lake

Description: Enterprise data is sitting in Enterprise Data Lake. Only internal employees have access to data for analysis and insight. This paper discusses conceptual framework, where 3rd party data can be brought in for “fusion” with enterprise data to find new insights or how, enterprises data can be made available to 3rd party for “monetization”.

Abstract: Businesses are in agreement that data sharing is good for eco-system. One hand it increases top-line by adding new revenue channel by monetizing data, on other hand, hitherto unknown data brings new insight. The problem is,

  • technology is not ready to infuse massive 3rd party data in automated way and
  • Data Lake can’t emit data to 3rd

There are actions items about information mapping, data curation, security and compliance, data movement from internal storage to public facing storage, which need to be fixed to enable the framework.

This paper discusses a conceptual framework (a process flow) which can be followed while creating inbound and outbound data flow framework on Data Lake.

Business Process Flow for Data Movement:

  • Information Mapping
  • Data Curation
  • Entity Resolution
  • Security and Compliance



  • Information Mapping :

a.) Business Entity discovery

b.) Business Attributes discovery

c.) Sharable attributes

d.)  Joinable attributes

e.) Aggregation level

f.) Time to share

g.) Time to fetch

  • Data Curation:

a.) Schema mapping

b.) Missing value replacement

c.) Dirty data drop

d.) Fuzzy Joining of data

e.) inbound and outbound dataset creation

  • Entity Resolution :

a.) Finding the business entities across data sets

b.) Finding attributes which can affect behavior of entities

c.) Logic of correlation

d.) Logic for join

  • Security & Compliance:

a.) Deletion of personal identifiable data

b.) Masking of critical data

c.) Validating compliance

Technical consideration:

a.) downloadable or API based,

b.) format of download

c.) choice of ETL tool

d.) choice of EAI tool

e.) internal storage Vs public facing storage

f.) scalability

Business consideration:

a.) What to expose

b.) Whom to expose, registered user or to public,

c.) What is monetization policy (by download, by advertisement or by API usages?)


Conclusion: Every business is unique. This paper tries to bring out a conceptual guideline on how to fusion 3rd party data into your system and monetize your data by providing to 3rd party. Companies are struggling to find out a way to monetize data and also bring more insight into data.

A data sharing ecosystem will be a big boost to companies.

Entity Resolution and Event Correlation – Datalake DQ

DQ ( Data Quality) historically started with missing values and then moved into address correction and data enrichment ( Geo Encoding, Standardization etc.) Data Quality tools have been successfully solved the traditional data quality problems; like the ones discussed above.

So far, DQ was single source and single domain. With the advent of data lake, DQ has to adopt to new strategy. Event Correlation and Entity Resolution are going to be crucial for data lake validation. DQ tools have to provide these 2 must features for data lake.

Entity Resolution : Data lake will hold data from multiple sources and domains. It would be critical to create right entities from the data set. Following will be prime components of Entity Resolution (ER)

a.) Fuzzy Join : we have so many joins ( inner, outer, left outer, semi , equi etc ) supported today but they match exact values. Dimension from multiple sources may not have exact match  ( like name or address). Fuzzy join will match values which are similar but may not match exactly – like John Smith and John Smithe

b.) Algorithm for picking dimension values : Datalake will contain data from multiple CRMs, domains. While matching dimensional values, there will conflict which one to pick – let say SalesForce has different address, Sales mart has different address, the data you bought have different address. The entity should have one master address. ER algorithm will pick the right value based on timeliness, validity of source, most common occurrence etc.

c.) Entity Classification: Once the Entity Unique id and master dimensions are identifies, next step involves classifying the entity using business rules. These entity may be outdated, inactive or have little relevance. Once entity is classified and tagged, it can be used for further analysis or can be put in historical datalake. An entity with missing critical dimensional value will be dumped in dirty datalake for further investigation.

Event Correlation:  Theoretically, event is also an entity but I am putting it different header because it is temporal in nature and the algorithms used for correlation events would be different.

a.) Range Bound Correlation : Hardly two correlated event will occur at same time. One event will fire another event which may lag in time or place or in both. Along with event identifier fields, range bound dimension will be used to correlate events. Business rules will decide the  width of boundary.

b.) Event aggregation : An event can fire many sub events and super events. All these events has to suppressed into one related event. Event Correlation (EC) algorithms will map all these events into related event and cause and bring into human readable format.

c.) Noise reduction:  Aggregated event may be a false event or noise. Business rules will decide will event should be carried forward ( assuming they have strong correlation with business )and which should be dropped. Events will also go through business classification to rank their importance.

Conclusion:  Datalake will bring new challenges to Data Quality which will go through transformation to solve new problems. DQ will move from :

i)Single Source   –> DataLake

ii)Structure Analysis –> Mapping Entity

iii) Operational —> Analysis

compression logic

Dew Computing and Data Compression

In my last post Dew Computing and Smart Machine, I explained what is “Dew Computing” and how peer-to-peer communication will bring Cloud computing to Fog Computing and finally to Dew Computing.

Dew Computing will be network intense, where millions of IoT ( Internet of Things) devices will talk to each other and share data. In this blog, I will discuss about data compression which will ease network and improve data quality for IoT devices.

Today, we have multitudes of compression algorithms available which do good job as saving space on disk and in-memory calculation – and that precisely their objective is. but we have limited compression algorithms available to do compression “over the network” that will be crucial for peer-to-peer or Dew Computing.

Nature of Data : IoT devices will be emitting data at regular time interval, where barring some measure value,  all other dimensional data ( like location, make, network info etc) will be same. Unfortunately, static data will also be in the same emitted dataframe, which will choke network with duplicate values. Also the granularity of time intervals will be in milli-seconds, seconds or minute so measure values are also expected not be very abrupt. An out of range value will trigger data quality concerns.

Logic for Compression: If the static data is separated from dynamic data and send once, it will reduce network load tremendously. The compression algorithm will received snapshot (boiler-plate) of data-frames at regular intervals and only changed values will be conveyed over the network, continuously. The algorithm will know how to build full data frame ( if needed) from incremental values. An example below

compression logic

Time Series Compression

Data Quality Trigger :  The algorithm can be configured to check if dynamic values are range bound and within accepted norm of variables from previous value. Compression Algorithm anyway checks the time frame stream of values for creating incremental data and has access to historical value. A data quality check here will save a round trip to server to validate the data – saves both latency and load



Conclusion : An understanding and consortium of IoT device maker will define the protocols of time series compression algorithms which will hugely benefit all involved parties.

Vivek Singh is chief contributor to Open Source Data Quality and Data Preparation tool.

Why Data Quality is so difficult to solve ?

Way back in 2006, when I started coding for world’s first open source data quality project (osDQ) , data quality issues were prevalent. Years later, Businesses have matured, computing power has increased many folds, storage has become cheaper and algorithms have improved. Still, data quality issues are as prevalent, if not more. That requires a serious understanding of data quality issues – how it originates, how it is propagated and more importantly how it can be solved :-

1.) Technical Solution: You will be completely off-hook, if you try to solve data quality problems using brute computing force and advanced algorithms alone. Issues like fuzzy matches, record linking, golden data are best solved by using technology but like viruses, data quality issues mutate and keep coming in different forms. You will be only in reactive mode and never be free of viruses. As and when it comes, you will desperately look out for cure.

2.) Process based solution: Setting up data governance framework, enforcing data policies, modeling business entities, having stewards and an office of chief data officer, certainly help you reduce the data quality issues. Having ISO certification for “data in motion” also helps organisation to a large extent. Even then, most optimistic data practitioner will not certify you “free of data quality” issues.

3.) Enterprise solution : You broke the “data silos”, brought the data to lake, did metadata categorization, created semantic layer, defined ontology – indeed commendable job. Can you say, we are all free from data quality virus and it is not going to comeback ?

All the the above approach are right in their own way and they solve a subset of data quality issues. But they are reactive and not standardized. Let’s take a typical high tech good workflow – imaginary !!

Designed in USA, Manufactured in China, Curated and Tested in India, Assembled and Packaged in USA , Sold in UK. You can see the relevant data move across boundary, languages, enterprises and governments. A company which is doing testing in India, has not influence ( probably they even don’t know who is manufacture is) on the data the chip producing and they can’t loop back to manufacture. A change in data format by chip manufactures will break all quality testing. An enterprise can enforce processed within its premises but in global world, they are no takers.

Data Quality problems are so difficult to solve because it is global, temporal, mutable, non-standard and spanning across multi-agencies and countries.

Good news is, sincere steps are taken in right direction which will solve data quality issues in long run.

Open data Initiative : Governmental and Semi Governmental departments are making their data publicly available. It will enhance standard adoptions and technology based solutions.

Cross-Pollination of data : In the above example, let’s assume manufacturing company is sharing their data with testing companies. It will help to build all data foot prints of chips and also will decrease the data glitches between companies.

Data Monetization: Once Organizations start putting up their data for sale or 3rd party consumption, quality of internal and external data will improve. Metadata and datatype will be publicly available and data will go through many eyes.

Next Generation BI expectation

Let me start this topic by drawing a parallel from search domain – WWW has lots of information and search is a way to get the information  you are looking for. Similarly, a company has multitude of informations, stored in structured and unstructured form, and business intelligence tools are extracting the data for you.If you have followed the search evolution – First Yahoo search was very structured; it used to give information inside categories ( Metadata driven ), then search engines like allowed you write natural sentences for search and then google optimized it when indexing and improving relevancy.

Business Intelligence companies are following the same pattern. Traditional BI tools are very structured – warehouse, cube, pivot. You can only look data that is inside the mart, and can navigate in very structured way – like roll up, drill down, record linking, dimension navigation. Next generation of BI tools are using big data technology to bring into large volume of data and also providing semantic layer to give a “google search” like interface. some companies call it “smart machine”. Next   generation BI tools will have :-

1.) Elastic Search and Spark / Big data technology: Scalability, Machine Learning, Fuzziness, Connectors, Statistical prediction, Classification will be for granted. Open sources embedded inside tool will make these features, commodity. They will be no more differentiator.

2.) Collaborative, Informative and engaging report : Today’s dull reports will become more collaborative.Think about looking a sales report, where report also embed a video where CEO making sales prediction, you also get your competitor public information, relevant 3rd party information. A report will transform into information portal which will be more engaging and social.

3.) Metadata Consolidation : Focus will shift to metadata from data because data processing will be taken care by platform. Data and metadata from different systems will come to data lake, which using namespace will decide and differentiate data. Business expertise will go into, making entity resolution automatic and data modeling dynamic.

4.) Interpreting business rules : In today’s system, we codify business rules but is not reusable for business intelligence systems. Today it a very cumbersome and time intensive to re-interpret business rules. Next generation BI tools, will extract business rules from CRM, transaction system and validate business rules against data. Business rules models will be more comprehensive and will not live in silos.

5.) Right Information : Certainly machine learning and artificial intelligence is overrated. They will not solve your business problem but certainly they will find out anomalies, outlier, abnormality, cluster, good data, bad data etc, to make you decide better. They will not replace you but will help you.

6.) Reusing existing Data warehouse : Lot of money has already flown into existing warehouse. New generation tools will provide wrapper around EDW to make it search friendly and integrate with datalake – using  indexing, elastic search, multi-facet search etc.

7.) User experience : In today’s world dashboard are personalized, but there is not much of freedom inside dashboard. New BI tools will be responsive in true sense, where entity hopping, 360 degree views, changing dimension centricity on the fly will be provided. Dashboards will also be mapped to User stories to

8.) Trust of data : In spite  of nice visualization, confidence in data is very low. BI tools are getting used to see the trend and bigger picture, but the value of data is taken only as indicative not for operation purpose. Data governance an Data Quality would a big push for next generation BI tools.

Disclaimer : Smart Machine is a term used by ( a next generation BI tool) to describe their systems which uses advance algorithms to do above mentioned features.

About Author : Vivek Kumar Singh is Business Intelligence professional and manages open source data quality project at

Dew Computing and Smart Machines

Smart Machine and Smart Insight are the terms coined by DataRPM – a cognitive and self-service Business Intelligence company. They call it smart machine because the machine learns the behavior of customer using artificial intelligence machine learning algorithms. yes, it is smart machine – smart machine on cloud.

Cloud computing has cluster of machines which stores and computes petabyte of data. But as it suggests all the data has to come to cloud for computing, which has its own pros and cons. With big data technology and virtualization, cloud was natural choice.

Then comes Fog Computing ; I guess a term coined by Cisco, where calculation and computing is done at router, end point or last mile level. Argument was, fog is a layer far below cloud and it is far thinner than cloud. So router, end points and last mile works as fog computers and primary has smart algorithms for distribution and load balancing of data. Fog is homogeneous too. As wifi getting ubiquitous, fog computing is natural choice.

As IoT (Internet of Things) is gaining acceptance, billions of devices and sensor will be in fields talking to each other. They will need very real time smartness – and I call it Dew Computing.

” Like Dew it will be at ground zero, condensed and all over the place. Dew Computing will make each machines , smart machine to solve the problems of IoT scale.”

Characteristics of Dew Computing :

a.) Smart software of in size of KBs : These smart machines will work in very low RAM – in the range of 512 KB. So the images of the software will be in low KBs so that it can be loaded into RAM of these probes, devices or sensors.

b.) Compression logic optimized for time series : Today’s big data compression logic is optimized for full data compression. Smart machines will not have storage to have complete data set.They will have compression logic optimized for time series, where only delta is stored and algorithm will know how to rebuild itself, if required.

c.) Self Regulatory : Smart machines will have their threshold preset or will have algorithms to adjust. If values reaches threshold, smart machine will exception itself and give reason for error. It will save time as otherwise data analytic has to find outlier and then mechanic to find out reason for failure.

I think future smart machines will look like above as dew spread over field. It is likeDew Computing.

Your thoughts !!

Author : Vivek Singh is contributor to world’s first open source data quality tool and data preparation tool .

Is Data Preparation part of Data Quality ?

As we all know, the lineage of data quality comes from CRM (Customer Relationship Management) system doing address correction and later moved into MDM (Master Data Management). As business matures, Data Quality (DQ) moves from reactive to “data-in-motion”.

On the other hand, Data Preparation, was traditionally part of Data Mining process, which was done in batch mode and probably by data scientist. Data Preparation involves steps that are taken in order to make data model ready.

Now business is moving from IT driven world to business-rule driven world. What I mean to say is, now data is important only if it maps to business needs. Structure data profile like null, pattern, outlier etc has limited value. Business is looking at DQ to validate complex business rules. It is handled by data steward of business rather than IT managers.

Data Science world is also changing. Model has to be business driven rather than solving some theoretical mathematical problem. Data scientists are working with business users and data stewards to understand business and data. With this new development the boundary line between Data Quality and Data Preparation is getting blurred.

1.) Data Fishing Vs Entity Resolution : From the web “Data dredging, sometimes referred to as data fishing is a data mining practice in which large volumes of data are searched to find any possible relationships between data…”

In DQ, Entity resolution is the process where all the attributes from disparate source are collected and an entity definition is created. As data management moving toward integrated world, both processes will be merged into one.

2.) Area of Mine Vs Data Ownership : In Data Mining, ‘Area of Mine’ defines data / record data scientists are interested for to model. In DQ world, data ownership defines who own the data. In a metadata driven data world, it will be decided by single metadata management system.

3.) Missing Values : Missing values are problematic for both – data preparation and data quality. Only difference is, an inferred value is good enough for data mining while data quality either looks for 100% accuracy ( like geo-encoding), replace with default value or discard the data as dirty data. Whatever the ways may be, both try to mitigate missing values. I am expecting business will define their missing data strategy and how to interpret them.

4.) Noise reduction Vs Data Scrubbing : It is important to reduce noise to create a good model. Data mining has several techniques to reduce noise. Data Quality has several ways to scrub and massage the data. These techniques are executed on data to make it compliance to business.

Summary: As getting insight is increasing becoming business function role while IT works as only facilitator, I am expecting data quality and data preparation processes will be merged into one, that will be managed by data stewards.

Your thoughts !!

About Author : Vivek K Singh is data architect and runs worlds first open source data quality and data preparation tool