One of leading arguments to support data lake is, it will help data scientists because a.) it will keep data in its most granular form and b.) it will source data from multiple systems so that data scientist can cross reference data for better insights.
Both arguments are valid but they are double-edge swords. In a bid to over achieve, data lake may turn into a cesspool, if not done properly. Let look into what data scientists need and what is provided by data lake.
1.) Where to drink from – Eventually all systems will dump data into data lake. As there will be no pre-defined schema, data would be dump at a place, time and format, that is convenient to system. Also, storage of data lake will be primary nosql, hadoop or some large distributed system, from getting data will be not easy.
On the other side, Data Scientists are conversant with tools like SAS, R, Matlab, Rapid Miner, Mahout etc and use scripting languages to analyse the data. They will find it They will find it difficult to coordinate with multitude of dumping systems to find out how, what, when , where and why they dump data.
Metadata repository, governance framework and data profiling and data preparation tools like osDQ should help Data Scientist to find out data, they are interested in.
2.) Is it healthy drink – Once the data is located, next logical question would be, is is Valid data ?
Since data lake does not enforce point to point validation like ETL and data warehouse, validating data would be a tricky job. By nature, data lake is all inclusive and would not differentiate between dirty and clean data, aged and recent data, secure and insecure data, complete or incomplete data. It will the job of data scientists to filter the data that is relevant for their models.
So Data scientists would have to work with data architects and use data preparation tool to do that. If there is lack in coordination, not only modelling will take long time , most probably of low confidence.
3.) Too much to drink from – In one of my post “WHY BI PROJECTS – AND THE ROLE OF DATA ARCHITECT” , I mentioned some Business Intelligence projects fails, they become over ambitious, and try to define all business rules in ETL layer. Same ambition can damage data scientists also. Too much data around will lead to complex models and over fitting of models. A complex model does not always result into better insight.
Moderation will be required.
Conclusion: Fluid storage and data lake is the future. As this technology matures, utility tools will be available to data scientist to do their job efficiently and accurately.