Migrating Enterprise DataWare House to Big data

I had many informal chats with friends, who manage traditional EDW at large corporations and want to migrate to big data. Migration decision has many dimensions – technical, financial, present status disruption, what I will get at “To be” architecture, how I will support new architecture etc etc. Let me answer step by step :-

1.) Should I move to big data : As big data is coming out of hype phase, the reality is seeping in. It will not solve all your problems and all your problems are not big data problems.

  • If your data volume is less than 50 GB, have around 15 attributes , already have data model and ETL working , you are not going to gain a lot from migrating to big data.
  • Big data’s biggest benefit comes from storage space. Typical RDBMS, 1TB cost will be around 20K while on hadoop cluster it would be 3K. So if you have to store huge data move to big data. Also, some companies are using hybrid approach where they archive on big data cluster and keep recent data in their EDW.
  • Second important benefit of big data is latency or processing speed. A typical ETL job takes around 2-3 hours to process 1 GB of data on MPP cluster. While a 20 node hadoop cluster will take around 25 minutes to process that ETL.

If you are expecting a surge in actionable data volume ( just not any data – that you can store on file system on cheap storage) and want to reduce the time for data pipeline, then you should think of migrating to big data. Though most of big data technology is open sourced, migration will be costly. Generally, it takes12-18 months to migrate EDW to big data. A typical cost may be:

  • Nodes for Production, Staging and Development – 100 node X 6K = 600K
  • 2 Hadoop Admins – 300K – 400K
  • 5 Hadoop Developers – 750K – 1000K
  • Product Manager / Project Manager / UAT / Validating reports – 500K
  • Vendor license / support / training – 300K

As you can see, it is not cheap. But it will help you manage large data volumes and reduce time for data processing. Also data science team will love large volumes of data ( lowest granular level) – they will try to dig out hidden gem from it. Take you call accordingly.

You have decided to move your EDW to big data. Now how to do it ?

2.) How to move to big data :

I am proponent of incremental change. That way business user also be happy as they see added value in quarters not years. And they will also support migration initiatives.

  • Break datamart Silos : It may sound weird, but first step of migration is, use a data virtualizing software ( if you don’t already have) and connect to all data marts. Talk to business users and see what other attributes they may be interested in, from other data marts. Teiid is very populate open source data virtualization server that I have used.
  • Share Metadata Catalogue : Create a metadata repository. Bring metadata from all data marts. Check for common attributes. Look into Personal Identifiable Informations ( PII). Ask business users and data scientist across domains to mark the attributes they will be interested in. Also look into data lineage or source for common attribute to confirm they come from same source or different sources. Data Quality rules should be implemented here. I have used osDQ for this and I am also contributor to it.
  • Share virtualized EDW to Business users : Business user will see first benefit here where he or she will see attributes across domain that will make his or her analysis better. Based on the attributes, he or she interested , create virtualized EDW for them.
  • Time for Data Lake : Now it is time to design your Datalake on big data. Virtualized EDW and source system should give fairly good idea on what is needed for data lake. My previous article should help — https://www.linkedin.com/pulse/how-create-data-lake-vivek-kumar-singh
  • Tee off data pipe line to Data Lake : Don’t cut off data pipeline to EDW, yet. Tee off one pipe and move it to Data Lake on big data. Rewrite or migrate you ETL jobs to big data cluster and move the processed data to new compartment. We have used Apache Spark to write processing jobs on big data cluster. You can use new EDW on big data cluster or take the HDFS files out and put into existing EDW. You can use apache Sqoop for it.
  • Validate old EDW and big data EDW : Let both stream run for couple of months. Validate metadata side by side and data statistics side by side. I have used osDQ for this. If they are matching then cut off data stream to EDW and now your big data in production.

Sounds easy but it is not 🙂 Devil is in details. Feel free to contact if you want to discuss in depth.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/

Blog post at : https://viveksingh36.wordpress.com/


Biggest Problem of Big Data – Entity Resolution

Big Data has gone past PoC phase. Different companies are at different stages of implementation. Data Ingestion, Data Storage ( Data Lake and EDW), Data Processing and Data Visualization processes have been quite mature and there are many open source and proprietary software to solve these problems.

One major hurdle Big Data faces today is – Entity Resolution ( defining a business entity form multitude of data sources). In EDW (Enterprise data warehouse) world, data were structured and sources were limited. Also keys of sources were pre and well defined . So RDBMS joins ( inner, outer, left outer, right outer, semi join etc) were enough to merge data from two different systems and tables.

In Big Data world, there is hardly any key or attribute that runs across the sources. Also keys of one system is useless as other systems are completely independent of each other. So business have to define their own logic for merging data from different sources which defines one entity. To make it worse, RDBMS kind of exact match joins should be replaced by fuzzy joins as referential integrity across systems can be ensured.

Following is a practical approach for resolving Entity Resolution:-

1.) Pre-define your entities and super set of  attributes ( coming from all data sources)

2.) Attributes may have multiple related values that should to mapped to same attributes. Plan your storage like this  ( Graph databases work fine for this relationship storage)

3.) Merge data from multiple sources using merge business logic to create a virtual entity with sizable attributes ( we used Apache spark for this )

Mapping attributes


  • Zip, County, State, Lat/long
  • Nearby locations (+/- area)  – high propensity area
  • IP location


  • Date/time stamp
  • Nearby time stamp (+/-) – Event happening before or after a period


  • String match (Fuzzy) – Name, Address, Cause
  • Cardinal Match  – events sharing same or similar key
  • IP Correlation (Primary IP, Secondary IP, IP from same ZIP code)
  • Other business logic related merging rules

You can make you merge model Machine Learning based so it will be leaning over time to do a relevant merge.

4.) Right Sizing Merges Columns

  • Remove transaction columns
  • Remove  database columns
  • Remove technical identifiable columns
  • Remove duplicate Columns

5.) Search this entity in your relational graph database to find ranks of similar entities more than threshold. If the outcome is less than threshold then make a new entry in your entity table with the attributes of the searching entity.

 6.) Take the highest ranked entity and mapped the missing attributes from highest ranked entities. This entity is you final entity for business

How to enhance entities with changing attributes:

Like any practical entity, values of attributes keep changing. You map the attributes values of searching entity to highest ranked entity and see the different of values. let’s say the IP values of an entity is matching to its secondary value over time. Then the secondary IP value becomes primary and primary becomes secondary. Or it is new IP then add one more relationship node with new values.

Ranking algorithm:

Business can assign different weight-age to must have, critical, important and good to have attributes and their matching threshold. This model also matches with secondary or related values so make it more accurate. Let’s say, address is a must match attributes. A customer other attributes are matching but address not matching so other models will reject it but in this model if matches with his or her office address or other address, it will boost the record that is right.

Big Data: Demystified

Big Data, besides being a challenge is also a huge opportunity in terms of generating insights from new and varied types of data enabling businesses to become more agile than before. A simple, graphic and contemporary definition of Big Data would be ‘all the machine generated data, which gets populated rapidly alongside all the types of data that have complexities rather than size or volume’. Examples of Big Data would be pertinent to industries such as e-Commerce, Telecom, Social Media and BFSI with the types of data being dealt with including call logs, Web logs, Internet text/documents/search indexing, sensor networks, RFID, social network data etc.

A typical example would be a Web commerce enterprise, which would need to gain near-real-time insights into its customers’ behavioral patterns and trends in order to influence their marketing campaigns, delivery model, pricing as well as its products or service offerings. Big Data can and, more often than not, it does include a variety of ‘unusual’ data types. In that sense, it can be either structured or unstructured. The former would include high volume transaction data such as call data records in telcos, retail transaction data and pharmaceutical drug test data. Unstructured data is more difficult to process and this includes semi-structured data such as XML and HTML besides unstructured data like text, image, rich media, Web logs etc. Challenges here include the capture, storage, search, sharing, analysis and visualization of data sets. Though Size (Volume) is key to the primary definition (typically more than a couple of Terabytes) of Big Data, the other critical dimensions are:

Latency (Velocity): For time-sensitive processes such as detecting fraud, Big Data must be used as it streams into your enterprise in order to maximize its value.

Variety: Big Data is any type of data; it includes structured as well as unstructured data in the form of text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Complexity (Multi-Dimensionality): Big often refers to complexity rather than volume. Big Data can be very small and not all large datasets are huge.

The term Big Data is now applied more broadly to cover platforms having faster, cheaper and distributed processing power, clustered computing, lower cost of storage with in-built fault tolerance and network using commodity (cost-effective) hardware. Hadoop is one such platform. It is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Hbase and Zookeeper that extend the power of Hadoop.

One of our clients that generates about 2 TB/day of networking data wanted to conduct a cost-analysis of this data. It also wanted to store the data for 18 months. The data was time-series and it would fit into the definition of Big Data. It was not streaming data and, therefore, we recommended GridFTP (an open source solution) as the data transport layer and suggested that the client set up a 50 node (2 cpu -quad core, 8 GB RAM and 1 TB disk space) Apache Hadoop grid for this purpose. Data was compressed using bzip compression logic. Since it was structured data, we used Pig Latin to do the data quality and transformation. The client also wanted an SQL like interface for doing the analysis. The output of Pig was fed into Hive and Hive tables were created where the client was able to run SQL like statements. With the help of the Unix crontab utility, jobs were scheduled and new data was appended to Hive Tables after the completion of each job. The client is also looking at pre-defined, canned as well as ad hoc reporting. We at GrayMatter are well placed in the Big Data space having been a firm advocate and believer in open source systems and technologies from our inception when it was little known and not as mature as it is today.

Proven recommendations for Big Data projects

Unstructured Search Solr (MR version of Lucene)
Structured search (Key-Val pair) Cassandra(Not Hadoop supported), Hbase
Document Search MangoDB, NoSQL, Solr
Transformation, Data Quality Pig , Java MR code, Hadoop Streaming
SQL like analysis Hive, HBase
Stream input Scribe, Flume
Student Entiry

Why BI projects fail – and the role of Data Architect

During my long stint as business intelligence professional, I have seen many projects fail and of course, smelled some success. Let’s see what is common among successful projects and the unsuccessful ones.

Organisation Structure: Unlike other IT  engineering projects, Business Intelligence projects need strong business interface – a make or break for BI project. Business is divided into groups or sliced by business functions (BU) – but data is not. A typical BI project will run into multiple Business Functions – which means working with two or more VPs and their organisations. If BI project is sponsored by IT it will hit bottleneck with Business. A typical reply will be – a.) we already have this information in XLS b.) Our processes are different , c.) we can not wait so long for data d.) this is done at vendor site

And business is right. Every business functions or unit is different – Granularity is different, business focus is different, workflow is different. IT sponsored BI projects treats all business units same and hence BI projects lose relevance for business. I have seen high rate of success with  MIS related BI projects – because IT is both consumer and sponsor of project. Otherwise IT does not business rules and processes to make a report relevant – that has to come from Business. So it is important business sponsors it and business unit heads or VPs sponsor it.

Enterprise Data warehouse (EDW) : A typical approach to a BI project is, create an enterprise data warehouse and all BU ( business Unit) will take data from there and have their  data mart. Business is complex and putting all rules in Enterprise warehouse has a very high degree of failure. Enterprise data warehouse become too cumbersome to use and invariably business rule will change overtime. Cost of changing ETL jobs are very high and it takes long time to propagate the change from source, to staging area, to target , to reports, to analytic. Business get frustrated and figures out a way to get data in xls sheet and then does it own analysis –  EDW fails.

A leaner and less complex data model is better. Report Developer, ETL developer or Data Integration are not business analyst. They are bound to do mistakes in putting all the rules in all encompassing warehouse. A good amount of business rules can be pushed to report engine and analytic engine. A focused warehouse is more successful and a multi-purpose or generic warehouse.

Data Quality:  In real business, dirty data or incomplete data do come in. If they are feed into warehouse as is, report is faulty and unfortunately it happens most of time. Technical people can understand data type, data structure and raw data quality – like null values, duplicate values, negative values, outliner etc. But they do not understand business implication of that and do not what should be right value. A typically ETL tool will discard those records and report will not be updated. Let’s take an imaginary business rule – where if an existing ( from other business unit) customer walks-in you do not ask him fill-in personal information form and only take customer id. ( The idea is you will collect all the data from other business unit and save time  of customer). When desktop operation feed the data in – he or she feeds only with customer id. If you business processes are not real time ( in most cases they are not ) only customer id goes into CRM system and most probably nightly ETL load will ignore data as dirty data and your report will show one less count.

Data Analyst and Business Analyst have to sit together and profile their data and look into conditions using tools like osDQ – http://sourceforge.net/projects/dataquality/  to get a good understanding of data , before moving onto project.

Role of Data Architect:  Probably he is the first person to know if the project is on track but he has limited visibility. Most of time, data architects belong to IT group and has very limited saying in Business Unit. Unfortunately, today we do not have a process or framework which tell how data architect should talk / show artifacts to business. TOGAF has tried to give some framework, but it very limited.

A good start can be starting from IT Domain architecture where business unit and high level functionality is mapped. Let take an example of company which creates, tests, scores and report K-12 tests.

IT Domain Architecture    Once the business domain is identifies, data architect should create DFD ( data flow document – like image below ) which says which data moves across business domains and which is the data lying within a domain. The data is flowing across business domains ( or units) are the once which are more prone to error – as it changes values across domain.TCMDataFlow

Once the DFD is created , Entity reference model can be created as below where steward can be identified.

Student Entiry

Student Entiry

Once we have formal steward in place from business side, the success rate increases . In my next post I will write in detail about roles and responsibilities of data architect and the process to find out steward.

Good Luck !!

Disruptive ETL – Next Frontier of ETL

Both proprietary and open source ETL tools have been there for decades and have been doing fine. Some of data integration projects were successful while other failed. More than tools, I would blame data architects and company culture for failure – as some of them have been very secretive about data while other did not want to share. ETL tools were doing what they were supposed to do – Extraction from data sources, Transformation into staging area and Load into warehouse schema

Now the million dollar question would be – what next ? Technology is changing, storage is cheap, massive parallel processing is possible, reporting is becoming schema less and search based – so what will be future of ETL ? To answer this question, first we need to analyse the ETL features today as most of so called innovations are either extension of new features or fulfill new requirements. Contemporary ETLs tools are focused on :

1.) Extraction from many data source in batch mode – i.e full load, incremental load, redistributed load etc.
2.) Very heavy on Transformation – i.e Year till date (YTD), Slow changing dimensions, mapping, aggregates etc.
3.) Load into warehouse – i.e star schema, fact tables, dimension tables, aggregate table etc.

So what to going to change in next couple of year. Let proceed in reverse order to start with downstream consumers

Load into warehouse : With the advent of data lake concept as against datawarehouse , Hadoop and nosql as storage as against to RDBMS, and schema-less reporting against cubes and dimensional modelling ,this is certainly going to change. Data architects certainly will not want to silos their data into pre-built schema and want to give more flexibility to end users.
Data scientists do not like aggregation because granularity and lots of information is lost. They hate taking feed from data marts or data warehouse.

Coming days, ETL will focus loading data into datalake kind of system which is metadata and tag driven and will focus less on pre-built schema and aggregation load.

Very heavy on Transformation : This is bread and butter of contemporary ETL tools. They do all kind of transformations but going forward probably all they is not needed. Lot more transformation and formatting will be done by reporting layer and reporting layer will also be built to process massive and big data, hence the aggregation transformation will be redundant.

Coming days, ETL tools will be focusing more on data quality and data munging.

Extraction from many data source in batch mode: I do not see many changes there as data sources keep adding and we need to extract data from there. ETL tools will add new adapters to take real time feeds and data stream. There are tools which already have build adapters and working on it.

I am sure ETL tool will reinvent themselves and adapt to new changes.

ETL will become EQMP Extraction, Quality, Munging and Publishing

Vivek Singh is data architect, Evangelist and main contributor of osDQ – Open Source Data Quality and Profiling