I often have this question asked to me – “what is difference between warehouse and data lake ?”. Since, I have worked on both, let me try to answer it. Both are done with business in mind so variation in approach is common. I am listing generic differences.
Storage : EDW is commonly stored in RDBMS either in star schema or snowflake schema. The design of schema is not fluid and often called “early binding” or “schema on write”. Datalake is primary stored on Hadoop,Cloud (S3 etc ) or Hbase / Hive if data is structured. However it completely depends on business what to use for storage. Typically cost for having 1 TB stored in RDBMS would be 10 -15K USD while in Datalake it is should be around 2-3K USD.
Purpose : EDW is created for reporting so it has concept like Cube/Olap/ Hierarchical /Roll up/ Drill down/ Aggregate awareness / Fan out/ Dimensional Navigation etc which EDW is optimized for. Datalake works as storage for all departments of enterprise. It is primary for analytics so data is kept in flat file and sensitive data with encryption. Generally, across enterprise has access to it. It is responsibility of downstream process to optimize data ( data quality, data preparation, joining, dimension reduction etc.) for their need.
Ingestion and Retrieval : Data is ingested in EDW by ETL jobs and mostly in batch mode. There is a staging area which works for cleaning, transformation and aggregation of data. Retrieval is through reporting tools or SQL.
In Datalake, data is ingested primarily by File transfer methods. It is done both, in real time and in batch mode.There is no concept of staging area in datalake, but creator should take care to make sure data is usable and it does become swamp. In my previous article, I have explained how to create a datalake.
Data retrieval can be through file download, restful APIs or generic access to big data SQL if it is structured. Datalake has both structured and unstructured data.
Metadata: One of the most distinct differences between EDW and Datalake is, extensive use of metadata in datalake. Though in EDW also, metadata is used, it is primary for tracking ETL job status and resides in the same name space as EDW, which limit it’s uses.
Datalake’s metadata is the first place of landing for enterprise users who wants to use datalake. It has almost all the informations like data type, data dictionary, time of load, probably values, data quality or transformation, if any. From metadata user decideds what to use and what is available to him. If it is not in metadata, data does not exist for user. Often, it is stored in different namespace.
Business Proposition: While EDW’s primary purpose to generate time-bound reports for executive and power users, datalake is a way to data democratization. Datalake also saves humungous amount of time which is EDW needs to give someone access.
if we want to discuss further, feel free to contact me. On my blog site I have other articles related to datalake which may help further — https://viveksingh36.wordpress.com/
Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/
Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/