How to create Data Lake

As Datalake is getting mainstream, obvious questions are – a.) how to make it & b.) will it work ?

The latter came mostly from the failed Business Intelligence project sponsors, who still feel the pain. ( Here is my blog about it – )

Then they are further question about time and resources. Do I need to throw away my EDW ( enterprise data warehouse) ? Is Hadoop (big data) must for Datalake ? What are the tools to create DataLake ? ?

In this blog, I am trying to answers these questions from a practitioner perspective.

DataLake is evolution not disruption : As storage is getting cheaper, data practitioners thought of co-locating raw data along with processed data so that data engineers and data scientists  have quick access to raw data. This shift changes “Extraction” to “Ingestion” where data is loaded “AS IS”. This shift also bring some changes to ETL tools  – see here ( )

But it does not require to throw away your existing ETL jobs and EDW. If designed carefully they can be reused in datalake. Also Hadoop is not must. Datalake can be created on RDBMS, noSQL, HDFS or any fileSystem. Hadoop/HDFS being cheapest, is the preferred choice but if you have unlimited corporate licences for any of the above, you can use it for dataLake.

How to build it : Divide you datalake into 4 logical and physical spaces.


i) Raw Space : ( It is combination of Extraction and Staging of traditional warehouse with state information)

a.) load data “as is”

b.) define folder structure for different sources

c.) Meta Information : Time of load, load volume, load time, load status

d.) Useful for Auditing and Data Lineage

ii) Qualified Space :( It is combination of Transformation and Joining of traditional warehouse with data dictionary)

a.) run data quality and Entity Resolution on “as is” data

b.) define folder structure for different partitions ( time based, region based, verticals based)

c.) Meta Information : Data Type, Expected values, manipulable,

d.) Useful for Insight and Prediction

iii) Data warehouse Space :( You can reuse existing EDW here)

a.) load recent data into EDW

b.) define reporting and discovery parameters

c.) Meta Information : Data Dictionary, Granularity and Latency

d.) Useful for operational reports and discovery

iv) Collaborative Space :( In this space you want to expose you data to 3rd party or fetch data from 3rd party )

a.) find out the data to be exposed

b.) define security and data format

c.) Meta Information : Data Dictionary, Aggregation level, data format

d.) Useful for data Monetization and Data Fusion

Conclusion : Above are the general guideline for creation of datalake. As every business is different, each datalake is different. Choice of tools, storage will vary according to your requirements. Though storage is cheap, processing of huge amount data will need lots of CPU and RAM – which is serious money. So be careful with volume of data. Though generally it is told; bring everything into dataLake – start small and see value before bringing everything.


Inbound and Outbound data movement from Data Lake

Description: Enterprise data is sitting in Enterprise Data Lake. Only internal employees have access to data for analysis and insight. This paper discusses conceptual framework, where 3rd party data can be brought in for “fusion” with enterprise data to find new insights or how, enterprises data can be made available to 3rd party for “monetization”.

Abstract: Businesses are in agreement that data sharing is good for eco-system. One hand it increases top-line by adding new revenue channel by monetizing data, on other hand, hitherto unknown data brings new insight. The problem is,

  • technology is not ready to infuse massive 3rd party data in automated way and
  • Data Lake can’t emit data to 3rd

There are actions items about information mapping, data curation, security and compliance, data movement from internal storage to public facing storage, which need to be fixed to enable the framework.

This paper discusses a conceptual framework (a process flow) which can be followed while creating inbound and outbound data flow framework on Data Lake.

Business Process Flow for Data Movement:

  • Information Mapping
  • Data Curation
  • Entity Resolution
  • Security and Compliance



  • Information Mapping :

a.) Business Entity discovery

b.) Business Attributes discovery

c.) Sharable attributes

d.)  Joinable attributes

e.) Aggregation level

f.) Time to share

g.) Time to fetch

  • Data Curation:

a.) Schema mapping

b.) Missing value replacement

c.) Dirty data drop

d.) Fuzzy Joining of data

e.) inbound and outbound dataset creation

  • Entity Resolution :

a.) Finding the business entities across data sets

b.) Finding attributes which can affect behavior of entities

c.) Logic of correlation

d.) Logic for join

  • Security & Compliance:

a.) Deletion of personal identifiable data

b.) Masking of critical data

c.) Validating compliance

Technical consideration:

a.) downloadable or API based,

b.) format of download

c.) choice of ETL tool

d.) choice of EAI tool

e.) internal storage Vs public facing storage

f.) scalability

Business consideration:

a.) What to expose

b.) Whom to expose, registered user or to public,

c.) What is monetization policy (by download, by advertisement or by API usages?)


Conclusion: Every business is unique. This paper tries to bring out a conceptual guideline on how to fusion 3rd party data into your system and monetize your data by providing to 3rd party. Companies are struggling to find out a way to monetize data and also bring more insight into data.

A data sharing ecosystem will be a big boost to companies.

Entity Resolution and Event Correlation – Datalake DQ

DQ ( Data Quality) historically started with missing values and then moved into address correction and data enrichment ( Geo Encoding, Standardization etc.) Data Quality tools have been successfully solved the traditional data quality problems; like the ones discussed above.

So far, DQ was single source and single domain. With the advent of data lake, DQ has to adopt to new strategy. Event Correlation and Entity Resolution are going to be crucial for data lake validation. DQ tools have to provide these 2 must features for data lake.

Entity Resolution : Data lake will hold data from multiple sources and domains. It would be critical to create right entities from the data set. Following will be prime components of Entity Resolution (ER)

a.) Fuzzy Join : we have so many joins ( inner, outer, left outer, semi , equi etc ) supported today but they match exact values. Dimension from multiple sources may not have exact match  ( like name or address). Fuzzy join will match values which are similar but may not match exactly – like John Smith and John Smithe

b.) Algorithm for picking dimension values : Datalake will contain data from multiple CRMs, domains. While matching dimensional values, there will conflict which one to pick – let say SalesForce has different address, Sales mart has different address, the data you bought have different address. The entity should have one master address. ER algorithm will pick the right value based on timeliness, validity of source, most common occurrence etc.

c.) Entity Classification: Once the Entity Unique id and master dimensions are identifies, next step involves classifying the entity using business rules. These entity may be outdated, inactive or have little relevance. Once entity is classified and tagged, it can be used for further analysis or can be put in historical datalake. An entity with missing critical dimensional value will be dumped in dirty datalake for further investigation.

Event Correlation:  Theoretically, event is also an entity but I am putting it different header because it is temporal in nature and the algorithms used for correlation events would be different.

a.) Range Bound Correlation : Hardly two correlated event will occur at same time. One event will fire another event which may lag in time or place or in both. Along with event identifier fields, range bound dimension will be used to correlate events. Business rules will decide the  width of boundary.

b.) Event aggregation : An event can fire many sub events and super events. All these events has to suppressed into one related event. Event Correlation (EC) algorithms will map all these events into related event and cause and bring into human readable format.

c.) Noise reduction:  Aggregated event may be a false event or noise. Business rules will decide will event should be carried forward ( assuming they have strong correlation with business )and which should be dropped. Events will also go through business classification to rank their importance.

Conclusion:  Datalake will bring new challenges to Data Quality which will go through transformation to solve new problems. DQ will move from :

i)Single Source   –> DataLake

ii)Structure Analysis –> Mapping Entity

iii) Operational —> Analysis

compression logic

Dew Computing and Data Compression

In my last post Dew Computing and Smart Machine, I explained what is “Dew Computing” and how peer-to-peer communication will bring Cloud computing to Fog Computing and finally to Dew Computing.

Dew Computing will be network intense, where millions of IoT ( Internet of Things) devices will talk to each other and share data. In this blog, I will discuss about data compression which will ease network and improve data quality for IoT devices.

Today, we have multitudes of compression algorithms available which do good job as saving space on disk and in-memory calculation – and that precisely their objective is. but we have limited compression algorithms available to do compression “over the network” that will be crucial for peer-to-peer or Dew Computing.

Nature of Data : IoT devices will be emitting data at regular time interval, where barring some measure value,  all other dimensional data ( like location, make, network info etc) will be same. Unfortunately, static data will also be in the same emitted dataframe, which will choke network with duplicate values. Also the granularity of time intervals will be in milli-seconds, seconds or minute so measure values are also expected not be very abrupt. An out of range value will trigger data quality concerns.

Logic for Compression: If the static data is separated from dynamic data and send once, it will reduce network load tremendously. The compression algorithm will received snapshot (boiler-plate) of data-frames at regular intervals and only changed values will be conveyed over the network, continuously. The algorithm will know how to build full data frame ( if needed) from incremental values. An example below

compression logic

Time Series Compression

Data Quality Trigger :  The algorithm can be configured to check if dynamic values are range bound and within accepted norm of variables from previous value. Compression Algorithm anyway checks the time frame stream of values for creating incremental data and has access to historical value. A data quality check here will save a round trip to server to validate the data – saves both latency and load



Conclusion : An understanding and consortium of IoT device maker will define the protocols of time series compression algorithms which will hugely benefit all involved parties.

Vivek Singh is chief contributor to Open Source Data Quality and Data Preparation tool.

Why Data Quality is so difficult to solve ?

Way back in 2006, when I started coding for world’s first open source data quality project (osDQ) , data quality issues were prevalent. Years later, Businesses have matured, computing power has increased many folds, storage has become cheaper and algorithms have improved. Still, data quality issues are as prevalent, if not more. That requires a serious understanding of data quality issues – how it originates, how it is propagated and more importantly how it can be solved :-

1.) Technical Solution: You will be completely off-hook, if you try to solve data quality problems using brute computing force and advanced algorithms alone. Issues like fuzzy matches, record linking, golden data are best solved by using technology but like viruses, data quality issues mutate and keep coming in different forms. You will be only in reactive mode and never be free of viruses. As and when it comes, you will desperately look out for cure.

2.) Process based solution: Setting up data governance framework, enforcing data policies, modeling business entities, having stewards and an office of chief data officer, certainly help you reduce the data quality issues. Having ISO certification for “data in motion” also helps organisation to a large extent. Even then, most optimistic data practitioner will not certify you “free of data quality” issues.

3.) Enterprise solution : You broke the “data silos”, brought the data to lake, did metadata categorization, created semantic layer, defined ontology – indeed commendable job. Can you say, we are all free from data quality virus and it is not going to comeback ?

All the the above approach are right in their own way and they solve a subset of data quality issues. But they are reactive and not standardized. Let’s take a typical high tech good workflow – imaginary !!

Designed in USA, Manufactured in China, Curated and Tested in India, Assembled and Packaged in USA , Sold in UK. You can see the relevant data move across boundary, languages, enterprises and governments. A company which is doing testing in India, has not influence ( probably they even don’t know who is manufacture is) on the data the chip producing and they can’t loop back to manufacture. A change in data format by chip manufactures will break all quality testing. An enterprise can enforce processed within its premises but in global world, they are no takers.

Data Quality problems are so difficult to solve because it is global, temporal, mutable, non-standard and spanning across multi-agencies and countries.

Good news is, sincere steps are taken in right direction which will solve data quality issues in long run.

Open data Initiative : Governmental and Semi Governmental departments are making their data publicly available. It will enhance standard adoptions and technology based solutions.

Cross-Pollination of data : In the above example, let’s assume manufacturing company is sharing their data with testing companies. It will help to build all data foot prints of chips and also will decrease the data glitches between companies.

Data Monetization: Once Organizations start putting up their data for sale or 3rd party consumption, quality of internal and external data will improve. Metadata and datatype will be publicly available and data will go through many eyes.

Next Generation BI expectation

Let me start this topic by drawing a parallel from search domain – WWW has lots of information and search is a way to get the information  you are looking for. Similarly, a company has multitude of informations, stored in structured and unstructured form, and business intelligence tools are extracting the data for you.If you have followed the search evolution – First Yahoo search was very structured; it used to give information inside categories ( Metadata driven ), then search engines like allowed you write natural sentences for search and then google optimized it when indexing and improving relevancy.

Business Intelligence companies are following the same pattern. Traditional BI tools are very structured – warehouse, cube, pivot. You can only look data that is inside the mart, and can navigate in very structured way – like roll up, drill down, record linking, dimension navigation. Next generation of BI tools are using big data technology to bring into large volume of data and also providing semantic layer to give a “google search” like interface. some companies call it “smart machine”. Next   generation BI tools will have :-

1.) Elastic Search and Spark / Big data technology: Scalability, Machine Learning, Fuzziness, Connectors, Statistical prediction, Classification will be for granted. Open sources embedded inside tool will make these features, commodity. They will be no more differentiator.

2.) Collaborative, Informative and engaging report : Today’s dull reports will become more collaborative.Think about looking a sales report, where report also embed a video where CEO making sales prediction, you also get your competitor public information, relevant 3rd party information. A report will transform into information portal which will be more engaging and social.

3.) Metadata Consolidation : Focus will shift to metadata from data because data processing will be taken care by platform. Data and metadata from different systems will come to data lake, which using namespace will decide and differentiate data. Business expertise will go into, making entity resolution automatic and data modeling dynamic.

4.) Interpreting business rules : In today’s system, we codify business rules but is not reusable for business intelligence systems. Today it a very cumbersome and time intensive to re-interpret business rules. Next generation BI tools, will extract business rules from CRM, transaction system and validate business rules against data. Business rules models will be more comprehensive and will not live in silos.

5.) Right Information : Certainly machine learning and artificial intelligence is overrated. They will not solve your business problem but certainly they will find out anomalies, outlier, abnormality, cluster, good data, bad data etc, to make you decide better. They will not replace you but will help you.

6.) Reusing existing Data warehouse : Lot of money has already flown into existing warehouse. New generation tools will provide wrapper around EDW to make it search friendly and integrate with datalake – using  indexing, elastic search, multi-facet search etc.

7.) User experience : In today’s world dashboard are personalized, but there is not much of freedom inside dashboard. New BI tools will be responsive in true sense, where entity hopping, 360 degree views, changing dimension centricity on the fly will be provided. Dashboards will also be mapped to User stories to

8.) Trust of data : In spite  of nice visualization, confidence in data is very low. BI tools are getting used to see the trend and bigger picture, but the value of data is taken only as indicative not for operation purpose. Data governance an Data Quality would a big push for next generation BI tools.

Disclaimer : Smart Machine is a term used by ( a next generation BI tool) to describe their systems which uses advance algorithms to do above mentioned features.

About Author : Vivek Kumar Singh is Business Intelligence professional and manages open source data quality project at

Dew Computing and Smart Machines

Smart Machine and Smart Insight are the terms coined by DataRPM – a cognitive and self-service Business Intelligence company. They call it smart machine because the machine learns the behavior of customer using artificial intelligence machine learning algorithms. yes, it is smart machine – smart machine on cloud.

Cloud computing has cluster of machines which stores and computes petabyte of data. But as it suggests all the data has to come to cloud for computing, which has its own pros and cons. With big data technology and virtualization, cloud was natural choice.

Then comes Fog Computing ; I guess a term coined by Cisco, where calculation and computing is done at router, end point or last mile level. Argument was, fog is a layer far below cloud and it is far thinner than cloud. So router, end points and last mile works as fog computers and primary has smart algorithms for distribution and load balancing of data. Fog is homogeneous too. As wifi getting ubiquitous, fog computing is natural choice.

As IoT (Internet of Things) is gaining acceptance, billions of devices and sensor will be in fields talking to each other. They will need very real time smartness – and I call it Dew Computing.

” Like Dew it will be at ground zero, condensed and all over the place. Dew Computing will make each machines , smart machine to solve the problems of IoT scale.”

Characteristics of Dew Computing :

a.) Smart software of in size of KBs : These smart machines will work in very low RAM – in the range of 512 KB. So the images of the software will be in low KBs so that it can be loaded into RAM of these probes, devices or sensors.

b.) Compression logic optimized for time series : Today’s big data compression logic is optimized for full data compression. Smart machines will not have storage to have complete data set.They will have compression logic optimized for time series, where only delta is stored and algorithm will know how to rebuild itself, if required.

c.) Self Regulatory : Smart machines will have their threshold preset or will have algorithms to adjust. If values reaches threshold, smart machine will exception itself and give reason for error. It will save time as otherwise data analytic has to find outlier and then mechanic to find out reason for failure.

I think future smart machines will look like above as dew spread over field. It is likeDew Computing.

Your thoughts !!

Author : Vivek Singh is contributor to world’s first open source data quality tool and data preparation tool .