How to create Data Lake

As Datalake is getting mainstream, obvious questions are – a.) how to make it & b.) will it work ?

The latter came mostly from the failed Business Intelligence project sponsors, who still feel the pain. ( Here is my blog about it – )

Then they are further question about time and resources. Do I need to throw away my EDW ( enterprise data warehouse) ? Is Hadoop (big data) must for Datalake ? What are the tools to create DataLake ? ?

In this blog, I am trying to answers these questions from a practitioner perspective.

DataLake is evolution not disruption : As storage is getting cheaper, data practitioners thought of co-locating raw data along with processed data so that data engineers and data scientists  have quick access to raw data. This shift changes “Extraction” to “Ingestion” where data is loaded “AS IS”. This shift also bring some changes to ETL tools  – see here ( )

But it does not require to throw away your existing ETL jobs and EDW. If designed carefully they can be reused in datalake. Also Hadoop is not must. Datalake can be created on RDBMS, noSQL, HDFS or any fileSystem. Hadoop/HDFS being cheapest, is the preferred choice but if you have unlimited corporate licences for any of the above, you can use it for dataLake.

How to build it : Divide you datalake into 4 logical and physical spaces.


i) Raw Space : ( It is combination of Extraction and Staging of traditional warehouse with state information)

a.) load data “as is”

b.) define folder structure for different sources

c.) Meta Information : Time of load, load volume, load time, load status

d.) Useful for Auditing and Data Lineage

ii) Qualified Space :( It is combination of Transformation and Joining of traditional warehouse with data dictionary)

a.) run data quality and Entity Resolution on “as is” data

b.) define folder structure for different partitions ( time based, region based, verticals based)

c.) Meta Information : Data Type, Expected values, manipulable,

d.) Useful for Insight and Prediction

iii) Data warehouse Space :( You can reuse existing EDW here)

a.) load recent data into EDW

b.) define reporting and discovery parameters

c.) Meta Information : Data Dictionary, Granularity and Latency

d.) Useful for operational reports and discovery

iv) Collaborative Space :( In this space you want to expose you data to 3rd party or fetch data from 3rd party )

a.) find out the data to be exposed

b.) define security and data format

c.) Meta Information : Data Dictionary, Aggregation level, data format

d.) Useful for data Monetization and Data Fusion

Conclusion : Above are the general guideline for creation of datalake. As every business is different, each datalake is different. Choice of tools, storage will vary according to your requirements. Though storage is cheap, processing of huge amount data will need lots of CPU and RAM – which is serious money. So be careful with volume of data. Though generally it is told; bring everything into dataLake – start small and see value before bringing everything.

Entity Resolution and Event Correlation – Datalake DQ

DQ ( Data Quality) historically started with missing values and then moved into address correction and data enrichment ( Geo Encoding, Standardization etc.) Data Quality tools have been successfully solved the traditional data quality problems; like the ones discussed above.

So far, DQ was single source and single domain. With the advent of data lake, DQ has to adopt to new strategy. Event Correlation and Entity Resolution are going to be crucial for data lake validation. DQ tools have to provide these 2 must features for data lake.

Entity Resolution : Data lake will hold data from multiple sources and domains. It would be critical to create right entities from the data set. Following will be prime components of Entity Resolution (ER)

a.) Fuzzy Join : we have so many joins ( inner, outer, left outer, semi , equi etc ) supported today but they match exact values. Dimension from multiple sources may not have exact match  ( like name or address). Fuzzy join will match values which are similar but may not match exactly – like John Smith and John Smithe

b.) Algorithm for picking dimension values : Datalake will contain data from multiple CRMs, domains. While matching dimensional values, there will conflict which one to pick – let say SalesForce has different address, Sales mart has different address, the data you bought have different address. The entity should have one master address. ER algorithm will pick the right value based on timeliness, validity of source, most common occurrence etc.

c.) Entity Classification: Once the Entity Unique id and master dimensions are identifies, next step involves classifying the entity using business rules. These entity may be outdated, inactive or have little relevance. Once entity is classified and tagged, it can be used for further analysis or can be put in historical datalake. An entity with missing critical dimensional value will be dumped in dirty datalake for further investigation.

Event Correlation:  Theoretically, event is also an entity but I am putting it different header because it is temporal in nature and the algorithms used for correlation events would be different.

a.) Range Bound Correlation : Hardly two correlated event will occur at same time. One event will fire another event which may lag in time or place or in both. Along with event identifier fields, range bound dimension will be used to correlate events. Business rules will decide the  width of boundary.

b.) Event aggregation : An event can fire many sub events and super events. All these events has to suppressed into one related event. Event Correlation (EC) algorithms will map all these events into related event and cause and bring into human readable format.

c.) Noise reduction:  Aggregated event may be a false event or noise. Business rules will decide will event should be carried forward ( assuming they have strong correlation with business )and which should be dropped. Events will also go through business classification to rank their importance.

Conclusion:  Datalake will bring new challenges to Data Quality which will go through transformation to solve new problems. DQ will move from :

i)Single Source   –> DataLake

ii)Structure Analysis –> Mapping Entity

iii) Operational —> Analysis

Next Generation BI expectation

Let me start this topic by drawing a parallel from search domain – WWW has lots of information and search is a way to get the information  you are looking for. Similarly, a company has multitude of informations, stored in structured and unstructured form, and business intelligence tools are extracting the data for you.If you have followed the search evolution – First Yahoo search was very structured; it used to give information inside categories ( Metadata driven ), then search engines like allowed you write natural sentences for search and then google optimized it when indexing and improving relevancy.

Business Intelligence companies are following the same pattern. Traditional BI tools are very structured – warehouse, cube, pivot. You can only look data that is inside the mart, and can navigate in very structured way – like roll up, drill down, record linking, dimension navigation. Next generation of BI tools are using big data technology to bring into large volume of data and also providing semantic layer to give a “google search” like interface. some companies call it “smart machine”. Next   generation BI tools will have :-

1.) Elastic Search and Spark / Big data technology: Scalability, Machine Learning, Fuzziness, Connectors, Statistical prediction, Classification will be for granted. Open sources embedded inside tool will make these features, commodity. They will be no more differentiator.

2.) Collaborative, Informative and engaging report : Today’s dull reports will become more collaborative.Think about looking a sales report, where report also embed a video where CEO making sales prediction, you also get your competitor public information, relevant 3rd party information. A report will transform into information portal which will be more engaging and social.

3.) Metadata Consolidation : Focus will shift to metadata from data because data processing will be taken care by platform. Data and metadata from different systems will come to data lake, which using namespace will decide and differentiate data. Business expertise will go into, making entity resolution automatic and data modeling dynamic.

4.) Interpreting business rules : In today’s system, we codify business rules but is not reusable for business intelligence systems. Today it a very cumbersome and time intensive to re-interpret business rules. Next generation BI tools, will extract business rules from CRM, transaction system and validate business rules against data. Business rules models will be more comprehensive and will not live in silos.

5.) Right Information : Certainly machine learning and artificial intelligence is overrated. They will not solve your business problem but certainly they will find out anomalies, outlier, abnormality, cluster, good data, bad data etc, to make you decide better. They will not replace you but will help you.

6.) Reusing existing Data warehouse : Lot of money has already flown into existing warehouse. New generation tools will provide wrapper around EDW to make it search friendly and integrate with datalake – using  indexing, elastic search, multi-facet search etc.

7.) User experience : In today’s world dashboard are personalized, but there is not much of freedom inside dashboard. New BI tools will be responsive in true sense, where entity hopping, 360 degree views, changing dimension centricity on the fly will be provided. Dashboards will also be mapped to User stories to

8.) Trust of data : In spite  of nice visualization, confidence in data is very low. BI tools are getting used to see the trend and bigger picture, but the value of data is taken only as indicative not for operation purpose. Data governance an Data Quality would a big push for next generation BI tools.

Disclaimer : Smart Machine is a term used by ( a next generation BI tool) to describe their systems which uses advance algorithms to do above mentioned features.

About Author : Vivek Kumar Singh is Business Intelligence professional and manages open source data quality project at

Data Lake:Boon or Bane for Data Scientist

One of leading arguments to support data lake is, it will help data scientists because a.) it will keep data in its most granular form and b.) it will source data from multiple systems so that data scientist can cross reference data for better insights.

Both arguments are valid but they are double-edge swords. In a bid to over achieve, data lake may turn into a cesspool, if not done properly. Let look into what data scientists need and what is provided by data lake.

1.) Where to drink from – Eventually all systems will dump data into data lake. As there will be no pre-defined schema, data would be dump at a place, time and format, that is convenient to system. Also, storage of data lake will be primary nosql, hadoop or some large distributed system, from getting data will be not easy.

On the other side, Data Scientists are conversant with tools like SAS, R, Matlab, Rapid Miner, Mahout etc and use scripting languages to analyse the data. They will find it They will find it difficult to coordinate with multitude of dumping systems to find out how, what, when , where and why they dump data.

Metadata repository, governance framework and data profiling and data preparation tools like osDQ should help Data Scientist to find out data, they are interested in.

2.) Is it healthy drink – Once the data is located, next logical question would be, is is Valid data ?

Since data lake does not enforce point to point validation like ETL and data warehouse, validating data would be a tricky job. By nature, data lake is all inclusive and would not differentiate between dirty and clean data, aged and recent data, secure and insecure data, complete or incomplete data. It will the job of data scientists to filter the data that is relevant for their models.

So Data scientists would have to work with data architects and use data preparation tool to do that. If there is lack in coordination, not only modelling will take long time , most probably of low confidence.

3.) Too much to drink from – In one of my post “WHY BI PROJECTS – AND THE ROLE OF DATA ARCHITECT” , I mentioned some Business Intelligence projects fails, they become over ambitious, and try to define all business rules in ETL layer. Same ambition can damage data scientists also. Too much data around will lead to complex models and over fitting of models. A complex model does not always result into better insight.

Moderation will be required.

Conclusion: Fluid storage and data lake is the future. As this technology matures, utility tools will be available to data scientist to do their job efficiently and accurately.

Data Lake and E.T.L – Big Data Plugins

A lot has been written about structure and constituents of Data Lake and what Data lake should contain. Equal volume of artifacts are available on shortcomings of Data Lake and what makes it toxic. In this blog, I am not discussing positives and negatives of Data Lake but How Data Lake initiative will change the ETL ( Extraction, Transformation and Load) workflows and may be the acronym ETL itself.

1.)  There is no T (transformation) between E (extraction) and L (load) : Data Lake stores data in most granular form and sometime in raw form itself. There is minimum transformation required in-between. There is combination of Extraction and Loading which is called Ingestion. Data in its rawest and most granular form, ingested into Lake.

So, the Data Lake architects are looking for tools like Storm, Flume, Chukwa, Sqoop, Message Queues, Zoo keeper, kafka etc which can do massive write operations with minimum transformations in-between. Traditional ETL tools have a strong transformation layer which is not suited for ingestion job.

2.) There are different Transformations : Traditional ETL tools’ transformations are primary focused on Data warehouse transformations like Slow Changing Dimensions, Year/Month till Date, Aggregate Awareness, Maps etc , Data Lake will focus on following transformations.

i) Latency Normalization : Multitude of Sources will dump data into lake which will have different latency – some real time, some delayed by minutes , some by hours and others even by days. One of the important transformations, needed by Data Lake would be “Normalize the Latency factor” so that records of different latency can be collaborated.

ii) Format and Semantic Transformer : Records from different sources will have different formats and nomenclature. Internationalization and Localization will add more complexity to format and nomenclature. Data lake will need strong transformations for formatting and semantic layer mapping.

iii) Structural Analysis : Data is dumped into Lake, without much consideration. Data might not be well formed and validated. Null, Outlier, Noise, DeDup, and Some Cleaning is required to make data usable. Data Quality will be integral part of transformation rather than an afterthought.

iv) Entity Resolution and Record Linking : Records, dumped into Data Lake are not atomic. A lot more focus will be on advance entity resolution and record linking transformations. Simple Map type of transformation will not sufficient. Unstructured data will need text based mining and ontology to resolve and define an Entity. Record linking will not be based on simple Primary Key and Foreign Key relationship, rather business rules will decide how to link records.

v) Meta Data Tagging: Data lake is fluid storage. Unlike pre-built schema definition of Data mart or Data warehouse, Data Lake will be a black for Information Retrieves. Meta data tagging is required to tell what is available and where it is available. Meta data tagging will make or break Data Lake initiatives.

3.) Workflow is different : Extraction , Transformation , Load (E.T.L) will be changing to Ingestion, Quality, Munging , Publishing (I.Q.M.P). New set of tools will focus to do more processing on cluster rather in-memory or inside tool’s container.

 4.) Technology is different : I will write more about it in next blog as it a big area of interest for Data Architects. Data marts and warehouses where hosted mostly on RDBMS while fluid design of Data lake makes it more suited for Big Data technologies like HDFS, Cassandra, Hbase etc. Also lot of unstructured and semi structured data  which will hosted which will be indexed and searchable. A search tool (like Lucene, Elastic Search etc) will become a must for Data Lake initiatives.

Summary: As Big Data and Data lake is getting good traction, ETL tools will change. They might provide plugins, where the bundle and enrich transformations which are suited for Data Lake and optimized for  big data processing.

About Blogger: Vivek Singh is an open source evangelist, chief commiter of Open Source Data Quality (osDQ) and Data Architect.

Disruptive ETL – Next Frontier of ETL

Both proprietary and open source ETL tools have been there for decades and have been doing fine. Some of data integration projects were successful while other failed. More than tools, I would blame data architects and company culture for failure – as some of them have been very secretive about data while other did not want to share. ETL tools were doing what they were supposed to do – Extraction from data sources, Transformation into staging area and Load into warehouse schema

Now the million dollar question would be – what next ? Technology is changing, storage is cheap, massive parallel processing is possible, reporting is becoming schema less and search based – so what will be future of ETL ? To answer this question, first we need to analyse the ETL features today as most of so called innovations are either extension of new features or fulfill new requirements. Contemporary ETLs tools are focused on :

1.) Extraction from many data source in batch mode – i.e full load, incremental load, redistributed load etc.
2.) Very heavy on Transformation – i.e Year till date (YTD), Slow changing dimensions, mapping, aggregates etc.
3.) Load into warehouse – i.e star schema, fact tables, dimension tables, aggregate table etc.

So what to going to change in next couple of year. Let proceed in reverse order to start with downstream consumers

Load into warehouse : With the advent of data lake concept as against datawarehouse , Hadoop and nosql as storage as against to RDBMS, and schema-less reporting against cubes and dimensional modelling ,this is certainly going to change. Data architects certainly will not want to silos their data into pre-built schema and want to give more flexibility to end users.
Data scientists do not like aggregation because granularity and lots of information is lost. They hate taking feed from data marts or data warehouse.

Coming days, ETL will focus loading data into datalake kind of system which is metadata and tag driven and will focus less on pre-built schema and aggregation load.

Very heavy on Transformation : This is bread and butter of contemporary ETL tools. They do all kind of transformations but going forward probably all they is not needed. Lot more transformation and formatting will be done by reporting layer and reporting layer will also be built to process massive and big data, hence the aggregation transformation will be redundant.

Coming days, ETL tools will be focusing more on data quality and data munging.

Extraction from many data source in batch mode: I do not see many changes there as data sources keep adding and we need to extract data from there. ETL tools will add new adapters to take real time feeds and data stream. There are tools which already have build adapters and working on it.

I am sure ETL tool will reinvent themselves and adapt to new changes.

ETL will become EQMP Extraction, Quality, Munging and Publishing

Vivek Singh is data architect, Evangelist and main contributor of osDQ – Open Source Data Quality and Profiling