Data Lake and E.T.L – Big Data Plugins

A lot has been written about structure and constituents of Data Lake and what Data lake should contain. Equal volume of artifacts are available on shortcomings of Data Lake and what makes it toxic. In this blog, I am not discussing positives and negatives of Data Lake but How Data Lake initiative will change the ETL ( Extraction, Transformation and Load) workflows and may be the acronym ETL itself.

1.)  There is no T (transformation) between E (extraction) and L (load) : Data Lake stores data in most granular form and sometime in raw form itself. There is minimum transformation required in-between. There is combination of Extraction and Loading which is called Ingestion. Data in its rawest and most granular form, ingested into Lake.

So, the Data Lake architects are looking for tools like Storm, Flume, Chukwa, Sqoop, Message Queues, Zoo keeper, kafka etc which can do massive write operations with minimum transformations in-between. Traditional ETL tools have a strong transformation layer which is not suited for ingestion job.

2.) There are different Transformations : Traditional ETL tools’ transformations are primary focused on Data warehouse transformations like Slow Changing Dimensions, Year/Month till Date, Aggregate Awareness, Maps etc , Data Lake will focus on following transformations.

i) Latency Normalization : Multitude of Sources will dump data into lake which will have different latency – some real time, some delayed by minutes , some by hours and others even by days. One of the important transformations, needed by Data Lake would be “Normalize the Latency factor” so that records of different latency can be collaborated.

ii) Format and Semantic Transformer : Records from different sources will have different formats and nomenclature. Internationalization and Localization will add more complexity to format and nomenclature. Data lake will need strong transformations for formatting and semantic layer mapping.

iii) Structural Analysis : Data is dumped into Lake, without much consideration. Data might not be well formed and validated. Null, Outlier, Noise, DeDup, and Some Cleaning is required to make data usable. Data Quality will be integral part of transformation rather than an afterthought.

iv) Entity Resolution and Record Linking : Records, dumped into Data Lake are not atomic. A lot more focus will be on advance entity resolution and record linking transformations. Simple Map type of transformation will not sufficient. Unstructured data will need text based mining and ontology to resolve and define an Entity. Record linking will not be based on simple Primary Key and Foreign Key relationship, rather business rules will decide how to link records.

v) Meta Data Tagging: Data lake is fluid storage. Unlike pre-built schema definition of Data mart or Data warehouse, Data Lake will be a black for Information Retrieves. Meta data tagging is required to tell what is available and where it is available. Meta data tagging will make or break Data Lake initiatives.

3.) Workflow is different : Extraction , Transformation , Load (E.T.L) will be changing to Ingestion, Quality, Munging , Publishing (I.Q.M.P). New set of tools will focus to do more processing on cluster rather in-memory or inside tool’s container.

 4.) Technology is different : I will write more about it in next blog as it a big area of interest for Data Architects. Data marts and warehouses where hosted mostly on RDBMS while fluid design of Data lake makes it more suited for Big Data technologies like HDFS, Cassandra, Hbase etc. Also lot of unstructured and semi structured data  which will hosted which will be indexed and searchable. A search tool (like Lucene, Elastic Search etc) will become a must for Data Lake initiatives.

Summary: As Big Data and Data lake is getting good traction, ETL tools will change. They might provide plugins, where the bundle and enrich transformations which are suited for Data Lake and optimized for  big data processing.

About Blogger: Vivek Singh is an open source evangelist, chief commiter of Open Source Data Quality (osDQ) and Data Architect.

Advertisements

One thought on “Data Lake and E.T.L – Big Data Plugins

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s