Data Lake and E.T.L – Big Data Plugins

A lot has been written about structure and constituents of Data Lake and what Data lake should contain. Equal volume of artifacts are available on shortcomings of Data Lake and what makes it toxic. In this blog, I am not discussing positives and negatives of Data Lake but How Data Lake initiative will change the ETL ( Extraction, Transformation and Load) workflows and may be the acronym ETL itself.

1.)  There is no T (transformation) between E (extraction) and L (load) : Data Lake stores data in most granular form and sometime in raw form itself. There is minimum transformation required in-between. There is combination of Extraction and Loading which is called Ingestion. Data in its rawest and most granular form, ingested into Lake.

So, the Data Lake architects are looking for tools like Storm, Flume, Chukwa, Sqoop, Message Queues, Zoo keeper, kafka etc which can do massive write operations with minimum transformations in-between. Traditional ETL tools have a strong transformation layer which is not suited for ingestion job.

2.) There are different Transformations : Traditional ETL tools’ transformations are primary focused on Data warehouse transformations like Slow Changing Dimensions, Year/Month till Date, Aggregate Awareness, Maps etc , Data Lake will focus on following transformations.

i) Latency Normalization : Multitude of Sources will dump data into lake which will have different latency – some real time, some delayed by minutes , some by hours and others even by days. One of the important transformations, needed by Data Lake would be “Normalize the Latency factor” so that records of different latency can be collaborated.

ii) Format and Semantic Transformer : Records from different sources will have different formats and nomenclature. Internationalization and Localization will add more complexity to format and nomenclature. Data lake will need strong transformations for formatting and semantic layer mapping.

iii) Structural Analysis : Data is dumped into Lake, without much consideration. Data might not be well formed and validated. Null, Outlier, Noise, DeDup, and Some Cleaning is required to make data usable. Data Quality will be integral part of transformation rather than an afterthought.

iv) Entity Resolution and Record Linking : Records, dumped into Data Lake are not atomic. A lot more focus will be on advance entity resolution and record linking transformations. Simple Map type of transformation will not sufficient. Unstructured data will need text based mining and ontology to resolve and define an Entity. Record linking will not be based on simple Primary Key and Foreign Key relationship, rather business rules will decide how to link records.

v) Meta Data Tagging: Data lake is fluid storage. Unlike pre-built schema definition of Data mart or Data warehouse, Data Lake will be a black for Information Retrieves. Meta data tagging is required to tell what is available and where it is available. Meta data tagging will make or break Data Lake initiatives.

3.) Workflow is different : Extraction , Transformation , Load (E.T.L) will be changing to Ingestion, Quality, Munging , Publishing (I.Q.M.P). New set of tools will focus to do more processing on cluster rather in-memory or inside tool’s container.

 4.) Technology is different : I will write more about it in next blog as it a big area of interest for Data Architects. Data marts and warehouses where hosted mostly on RDBMS while fluid design of Data lake makes it more suited for Big Data technologies like HDFS, Cassandra, Hbase etc. Also lot of unstructured and semi structured data  which will hosted which will be indexed and searchable. A search tool (like Lucene, Elastic Search etc) will become a must for Data Lake initiatives.

Summary: As Big Data and Data lake is getting good traction, ETL tools will change. They might provide plugins, where the bundle and enrich transformations which are suited for Data Lake and optimized for  big data processing.

About Blogger: Vivek Singh is an open source evangelist, chief commiter of Open Source Data Quality (osDQ) and Data Architect.


Disruptive ETL – Next Frontier of ETL

Both proprietary and open source ETL tools have been there for decades and have been doing fine. Some of data integration projects were successful while other failed. More than tools, I would blame data architects and company culture for failure – as some of them have been very secretive about data while other did not want to share. ETL tools were doing what they were supposed to do – Extraction from data sources, Transformation into staging area and Load into warehouse schema

Now the million dollar question would be – what next ? Technology is changing, storage is cheap, massive parallel processing is possible, reporting is becoming schema less and search based – so what will be future of ETL ? To answer this question, first we need to analyse the ETL features today as most of so called innovations are either extension of new features or fulfill new requirements. Contemporary ETLs tools are focused on :

1.) Extraction from many data source in batch mode – i.e full load, incremental load, redistributed load etc.
2.) Very heavy on Transformation – i.e Year till date (YTD), Slow changing dimensions, mapping, aggregates etc.
3.) Load into warehouse – i.e star schema, fact tables, dimension tables, aggregate table etc.

So what to going to change in next couple of year. Let proceed in reverse order to start with downstream consumers

Load into warehouse : With the advent of data lake concept as against datawarehouse , Hadoop and nosql as storage as against to RDBMS, and schema-less reporting against cubes and dimensional modelling ,this is certainly going to change. Data architects certainly will not want to silos their data into pre-built schema and want to give more flexibility to end users.
Data scientists do not like aggregation because granularity and lots of information is lost. They hate taking feed from data marts or data warehouse.

Coming days, ETL will focus loading data into datalake kind of system which is metadata and tag driven and will focus less on pre-built schema and aggregation load.

Very heavy on Transformation : This is bread and butter of contemporary ETL tools. They do all kind of transformations but going forward probably all they is not needed. Lot more transformation and formatting will be done by reporting layer and reporting layer will also be built to process massive and big data, hence the aggregation transformation will be redundant.

Coming days, ETL tools will be focusing more on data quality and data munging.

Extraction from many data source in batch mode: I do not see many changes there as data sources keep adding and we need to extract data from there. ETL tools will add new adapters to take real time feeds and data stream. There are tools which already have build adapters and working on it.

I am sure ETL tool will reinvent themselves and adapt to new changes.

ETL will become EQMP Extraction, Quality, Munging and Publishing

Vivek Singh is data architect, Evangelist and main contributor of osDQ – Open Source Data Quality and Profiling