As Datalake is getting mainstream, obvious questions are – a.) how to make it & b.) will it work ?
The latter came mostly from the failed Business Intelligence project sponsors, who still feel the pain. ( Here is my blog about it – https://viveksingh36.wordpress.com/2014/11/12/why-bi-projects-fail-and-the-role-of-data-architect/ )
Then they are further question about time and resources. Do I need to throw away my EDW ( enterprise data warehouse) ? Is Hadoop (big data) must for Datalake ? What are the tools to create DataLake ? ?
In this blog, I am trying to answers these questions from a practitioner perspective.
DataLake is evolution not disruption : As storage is getting cheaper, data practitioners thought of co-locating raw data along with processed data so that data engineers and data scientists have quick access to raw data. This shift changes “Extraction” to “Ingestion” where data is loaded “AS IS”. This shift also bring some changes to ETL tools – see here ( https://viveksingh36.wordpress.com/2015/01/08/data-lake-and-e-t-l-big-data-plugins/ )
But it does not require to throw away your existing ETL jobs and EDW. If designed carefully they can be reused in datalake. Also Hadoop is not must. Datalake can be created on RDBMS, noSQL, HDFS or any fileSystem. Hadoop/HDFS being cheapest, is the preferred choice but if you have unlimited corporate licences for any of the above, you can use it for dataLake.
How to build it : Divide you datalake into 4 logical and physical spaces.
i) Raw Space : ( It is combination of Extraction and Staging of traditional warehouse with state information)
a.) load data “as is”
b.) define folder structure for different sources
c.) Meta Information : Time of load, load volume, load time, load status
d.) Useful for Auditing and Data Lineage
ii) Qualified Space :( It is combination of Transformation and Joining of traditional warehouse with data dictionary)
a.) run data quality and Entity Resolution on “as is” data
b.) define folder structure for different partitions ( time based, region based, verticals based)
c.) Meta Information : Data Type, Expected values, manipulable,
d.) Useful for Insight and Prediction
iii) Data warehouse Space :( You can reuse existing EDW here)
a.) load recent data into EDW
b.) define reporting and discovery parameters
c.) Meta Information : Data Dictionary, Granularity and Latency
d.) Useful for operational reports and discovery
iv) Collaborative Space :( In this space you want to expose you data to 3rd party or fetch data from 3rd party )
a.) find out the data to be exposed
b.) define security and data format
c.) Meta Information : Data Dictionary, Aggregation level, data format
d.) Useful for data Monetization and Data Fusion
Conclusion : Above are the general guideline for creation of datalake. As every business is different, each datalake is different. Choice of tools, storage will vary according to your requirements. Though storage is cheap, processing of huge amount data will need lots of CPU and RAM – which is serious money. So be careful with volume of data. Though generally it is told; bring everything into dataLake – start small and see value before bringing everything.