Disruptive ETL – Next Frontier of ETL

Both proprietary and open source ETL tools have been there for decades and have been doing fine. Some of data integration projects were successful while other failed. More than tools, I would blame data architects and company culture for failure – as some of them have been very secretive about data while other did not want to share. ETL tools were doing what they were supposed to do – Extraction from data sources, Transformation into staging area and Load into warehouse schema

Now the million dollar question would be – what next ? Technology is changing, storage is cheap, massive parallel processing is possible, reporting is becoming schema less and search based – so what will be future of ETL ? To answer this question, first we need to analyse the ETL features today as most of so called innovations are either extension of new features or fulfill new requirements. Contemporary ETLs tools are focused on :

1.) Extraction from many data source in batch mode – i.e full load, incremental load, redistributed load etc.
2.) Very heavy on Transformation – i.e Year till date (YTD), Slow changing dimensions, mapping, aggregates etc.
3.) Load into warehouse – i.e star schema, fact tables, dimension tables, aggregate table etc.

So what to going to change in next couple of year. Let proceed in reverse order to start with downstream consumers

Load into warehouse : With the advent of data lake concept as against datawarehouse , Hadoop and nosql as storage as against to RDBMS, and schema-less reporting against cubes and dimensional modelling ,this is certainly going to change. Data architects certainly will not want to silos their data into pre-built schema and want to give more flexibility to end users.
Data scientists do not like aggregation because granularity and lots of information is lost. They hate taking feed from data marts or data warehouse.

Coming days, ETL will focus loading data into datalake kind of system which is metadata and tag driven and will focus less on pre-built schema and aggregation load.

Very heavy on Transformation : This is bread and butter of contemporary ETL tools. They do all kind of transformations but going forward probably all they is not needed. Lot more transformation and formatting will be done by reporting layer and reporting layer will also be built to process massive and big data, hence the aggregation transformation will be redundant.

Coming days, ETL tools will be focusing more on data quality and data munging.

Extraction from many data source in batch mode: I do not see many changes there as data sources keep adding and we need to extract data from there. ETL tools will add new adapters to take real time feeds and data stream. There are tools which already have build adapters and working on it.

I am sure ETL tool will reinvent themselves and adapt to new changes.

ETL will become EQMP Extraction, Quality, Munging and Publishing

Vivek Singh is data architect, Evangelist and main contributor of osDQ – Open Source Data Quality and Profiling


One thought on “Disruptive ETL – Next Frontier of ETL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s