Migrating Enterprise DataWare House to Big data

I had many informal chats with friends, who manage traditional EDW at large corporations and want to migrate to big data. Migration decision has many dimensions – technical, financial, present status disruption, what I will get at “To be” architecture, how I will support new architecture etc etc. Let me answer step by step :-

1.) Should I move to big data : As big data is coming out of hype phase, the reality is seeping in. It will not solve all your problems and all your problems are not big data problems.

  • If your data volume is less than 50 GB, have around 15 attributes , already have data model and ETL working , you are not going to gain a lot from migrating to big data.
  • Big data’s biggest benefit comes from storage space. Typical RDBMS, 1TB cost will be around 20K while on hadoop cluster it would be 3K. So if you have to store huge data move to big data. Also, some companies are using hybrid approach where they archive on big data cluster and keep recent data in their EDW.
  • Second important benefit of big data is latency or processing speed. A typical ETL job takes around 2-3 hours to process 1 GB of data on MPP cluster. While a 20 node hadoop cluster will take around 25 minutes to process that ETL.

If you are expecting a surge in actionable data volume ( just not any data – that you can store on file system on cheap storage) and want to reduce the time for data pipeline, then you should think of migrating to big data. Though most of big data technology is open sourced, migration will be costly. Generally, it takes12-18 months to migrate EDW to big data. A typical cost may be:

  • Nodes for Production, Staging and Development – 100 node X 6K = 600K
  • 2 Hadoop Admins – 300K – 400K
  • 5 Hadoop Developers – 750K – 1000K
  • Product Manager / Project Manager / UAT / Validating reports – 500K
  • Vendor license / support / training – 300K

As you can see, it is not cheap. But it will help you manage large data volumes and reduce time for data processing. Also data science team will love large volumes of data ( lowest granular level) – they will try to dig out hidden gem from it. Take you call accordingly.

You have decided to move your EDW to big data. Now how to do it ?

2.) How to move to big data :

I am proponent of incremental change. That way business user also be happy as they see added value in quarters not years. And they will also support migration initiatives.

  • Break datamart Silos : It may sound weird, but first step of migration is, use a data virtualizing software ( if you don’t already have) and connect to all data marts. Talk to business users and see what other attributes they may be interested in, from other data marts. Teiid is very populate open source data virtualization server that I have used.
  • Share Metadata Catalogue : Create a metadata repository. Bring metadata from all data marts. Check for common attributes. Look into Personal Identifiable Informations ( PII). Ask business users and data scientist across domains to mark the attributes they will be interested in. Also look into data lineage or source for common attribute to confirm they come from same source or different sources. Data Quality rules should be implemented here. I have used osDQ for this and I am also contributor to it.
  • Share virtualized EDW to Business users : Business user will see first benefit here where he or she will see attributes across domain that will make his or her analysis better. Based on the attributes, he or she interested , create virtualized EDW for them.
  • Time for Data Lake : Now it is time to design your Datalake on big data. Virtualized EDW and source system should give fairly good idea on what is needed for data lake. My previous article should help — https://www.linkedin.com/pulse/how-create-data-lake-vivek-kumar-singh
  • Tee off data pipe line to Data Lake : Don’t cut off data pipeline to EDW, yet. Tee off one pipe and move it to Data Lake on big data. Rewrite or migrate you ETL jobs to big data cluster and move the processed data to new compartment. We have used Apache Spark to write processing jobs on big data cluster. You can use new EDW on big data cluster or take the HDFS files out and put into existing EDW. You can use apache Sqoop for it.
  • Validate old EDW and big data EDW : Let both stream run for couple of months. Validate metadata side by side and data statistics side by side. I have used osDQ for this. If they are matching then cut off data stream to EDW and now your big data in production.

Sounds easy but it is not 🙂 Devil is in details. Feel free to contact if you want to discuss in depth.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/

Blog post at : https://viveksingh36.wordpress.com/