Migrating Enterprise DataWare House to Big data

I had many informal chats with friends, who manage traditional EDW at large corporations and want to migrate to big data. Migration decision has many dimensions – technical, financial, present status disruption, what I will get at “To be” architecture, how I will support new architecture etc etc. Let me answer step by step :-

1.) Should I move to big data : As big data is coming out of hype phase, the reality is seeping in. It will not solve all your problems and all your problems are not big data problems.

  • If your data volume is less than 50 GB, have around 15 attributes , already have data model and ETL working , you are not going to gain a lot from migrating to big data.
  • Big data’s biggest benefit comes from storage space. Typical RDBMS, 1TB cost will be around 20K while on hadoop cluster it would be 3K. So if you have to store huge data move to big data. Also, some companies are using hybrid approach where they archive on big data cluster and keep recent data in their EDW.
  • Second important benefit of big data is latency or processing speed. A typical ETL job takes around 2-3 hours to process 1 GB of data on MPP cluster. While a 20 node hadoop cluster will take around 25 minutes to process that ETL.

If you are expecting a surge in actionable data volume ( just not any data – that you can store on file system on cheap storage) and want to reduce the time for data pipeline, then you should think of migrating to big data. Though most of big data technology is open sourced, migration will be costly. Generally, it takes12-18 months to migrate EDW to big data. A typical cost may be:

  • Nodes for Production, Staging and Development – 100 node X 6K = 600K
  • 2 Hadoop Admins – 300K – 400K
  • 5 Hadoop Developers – 750K – 1000K
  • Product Manager / Project Manager / UAT / Validating reports – 500K
  • Vendor license / support / training – 300K

As you can see, it is not cheap. But it will help you manage large data volumes and reduce time for data processing. Also data science team will love large volumes of data ( lowest granular level) – they will try to dig out hidden gem from it. Take you call accordingly.

You have decided to move your EDW to big data. Now how to do it ?

2.) How to move to big data :

I am proponent of incremental change. That way business user also be happy as they see added value in quarters not years. And they will also support migration initiatives.

  • Break datamart Silos : It may sound weird, but first step of migration is, use a data virtualizing software ( if you don’t already have) and connect to all data marts. Talk to business users and see what other attributes they may be interested in, from other data marts. Teiid is very populate open source data virtualization server that I have used.
  • Share Metadata Catalogue : Create a metadata repository. Bring metadata from all data marts. Check for common attributes. Look into Personal Identifiable Informations ( PII). Ask business users and data scientist across domains to mark the attributes they will be interested in. Also look into data lineage or source for common attribute to confirm they come from same source or different sources. Data Quality rules should be implemented here. I have used osDQ for this and I am also contributor to it.
  • Share virtualized EDW to Business users : Business user will see first benefit here where he or she will see attributes across domain that will make his or her analysis better. Based on the attributes, he or she interested , create virtualized EDW for them.
  • Time for Data Lake : Now it is time to design your Datalake on big data. Virtualized EDW and source system should give fairly good idea on what is needed for data lake. My previous article should help — https://www.linkedin.com/pulse/how-create-data-lake-vivek-kumar-singh
  • Tee off data pipe line to Data Lake : Don’t cut off data pipeline to EDW, yet. Tee off one pipe and move it to Data Lake on big data. Rewrite or migrate you ETL jobs to big data cluster and move the processed data to new compartment. We have used Apache Spark to write processing jobs on big data cluster. You can use new EDW on big data cluster or take the HDFS files out and put into existing EDW. You can use apache Sqoop for it.
  • Validate old EDW and big data EDW : Let both stream run for couple of months. Validate metadata side by side and data statistics side by side. I have used osDQ for this. If they are matching then cut off data stream to EDW and now your big data in production.

Sounds easy but it is not 🙂 Devil is in details. Feel free to contact if you want to discuss in depth.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/

Blog post at : https://viveksingh36.wordpress.com/


Is Data Preparation part of Data Quality ?

As we all know, the lineage of data quality comes from CRM (Customer Relationship Management) system doing address correction and later moved into MDM (Master Data Management). As business matures, Data Quality (DQ) moves from reactive to “data-in-motion”.

On the other hand, Data Preparation, was traditionally part of Data Mining process, which was done in batch mode and probably by data scientist. Data Preparation involves steps that are taken in order to make data model ready.

Now business is moving from IT driven world to business-rule driven world. What I mean to say is, now data is important only if it maps to business needs. Structure data profile like null, pattern, outlier etc has limited value. Business is looking at DQ to validate complex business rules. It is handled by data steward of business rather than IT managers.

Data Science world is also changing. Model has to be business driven rather than solving some theoretical mathematical problem. Data scientists are working with business users and data stewards to understand business and data. With this new development the boundary line between Data Quality and Data Preparation is getting blurred.

1.) Data Fishing Vs Entity Resolution : From the web “Data dredging, sometimes referred to as data fishing is a data mining practice in which large volumes of data are searched to find any possible relationships between data…”

In DQ, Entity resolution is the process where all the attributes from disparate source are collected and an entity definition is created. As data management moving toward integrated world, both processes will be merged into one.

2.) Area of Mine Vs Data Ownership : In Data Mining, ‘Area of Mine’ defines data / record data scientists are interested for to model. In DQ world, data ownership defines who own the data. In a metadata driven data world, it will be decided by single metadata management system.

3.) Missing Values : Missing values are problematic for both – data preparation and data quality. Only difference is, an inferred value is good enough for data mining while data quality either looks for 100% accuracy ( like geo-encoding), replace with default value or discard the data as dirty data. Whatever the ways may be, both try to mitigate missing values. I am expecting business will define their missing data strategy and how to interpret them.

4.) Noise reduction Vs Data Scrubbing : It is important to reduce noise to create a good model. Data mining has several techniques to reduce noise. Data Quality has several ways to scrub and massage the data. These techniques are executed on data to make it compliance to business.

Summary: As getting insight is increasing becoming business function role while IT works as only facilitator, I am expecting data quality and data preparation processes will be merged into one, that will be managed by data stewards.

Your thoughts !!

About Author : Vivek K Singh is data architect and runs worlds first open source data quality and data preparation tool http://sourceforge.net/projects/dataquality/

Data Lake:Boon or Bane for Data Scientist

One of leading arguments to support data lake is, it will help data scientists because a.) it will keep data in its most granular form and b.) it will source data from multiple systems so that data scientist can cross reference data for better insights.

Both arguments are valid but they are double-edge swords. In a bid to over achieve, data lake may turn into a cesspool, if not done properly. Let look into what data scientists need and what is provided by data lake.

1.) Where to drink from – Eventually all systems will dump data into data lake. As there will be no pre-defined schema, data would be dump at a place, time and format, that is convenient to system. Also, storage of data lake will be primary nosql, hadoop or some large distributed system, from getting data will be not easy.

On the other side, Data Scientists are conversant with tools like SAS, R, Matlab, Rapid Miner, Mahout etc and use scripting languages to analyse the data. They will find it They will find it difficult to coordinate with multitude of dumping systems to find out how, what, when , where and why they dump data.

Metadata repository, governance framework and data profiling and data preparation tools like osDQ should help Data Scientist to find out data, they are interested in.

2.) Is it healthy drink – Once the data is located, next logical question would be, is is Valid data ?

Since data lake does not enforce point to point validation like ETL and data warehouse, validating data would be a tricky job. By nature, data lake is all inclusive and would not differentiate between dirty and clean data, aged and recent data, secure and insecure data, complete or incomplete data. It will the job of data scientists to filter the data that is relevant for their models.

So Data scientists would have to work with data architects and use data preparation tool to do that. If there is lack in coordination, not only modelling will take long time , most probably of low confidence.

3.) Too much to drink from – In one of my post “WHY BI PROJECTS – AND THE ROLE OF DATA ARCHITECT” , I mentioned some Business Intelligence projects fails, they become over ambitious, and try to define all business rules in ETL layer. Same ambition can damage data scientists also. Too much data around will lead to complex models and over fitting of models. A complex model does not always result into better insight.

Moderation will be required.

Conclusion: Fluid storage and data lake is the future. As this technology matures, utility tools will be available to data scientist to do their job efficiently and accurately.

Education Engagement Analysis

Vivek K. Singh and Amish Choudhary
Open Source Evangelist

Data Quality is a proven technology for Data enrichment and record linking. We are extending Data Quality and Analytics techniques to delve into student’s demographic and teacher’s engagement data for Indian education system. In this paper, we are proposing a process of collecting education data and do data quality using open source data quality tool (osDQ) also called “Aggregate Profiler”: http://sourceforge.net/projects/dataquality/. Record linking techniques are used to define correlation for engagement and demographic analysis.

In the paper “Integrated Data Quality (DQ) along Data Life cycle[3]”, author proposed a model for Data life cycle consisting of eight Phases. In this paper we applied the data life cycle phases to Indian education system using Open Source Data Quality and Profiling tool. The tool provides capability for data cleaning and data analysis to get correlation of data. While the process allows deriving various inferences based on data, we are using the process to correlate the impact on student grades of the commute distance and schedule load of teachers.

Figure 1 Data Management Framework Process

2. Background
During our visit to various schools and parent interviews we felt the need to analyze the student’s demographic data and teacher’s engagement data to see the correlation on students grades of the time spend during the commute and teacher’s calendar. Indian education system is one of the most cumbersome due to sheer volume of data, non-uniform standard for collecting and storing the data. Hence we are proposing using Data Management Framework to cleanse and analyze the data. The tool provide
• Big data support – HIVE thrift server support
• Format Creation, Format Matching ( Phone, Date, String and Number), Format standardization
• Fuzzy Logic based similarity check, Cardinality check between tables and files
• Export and import from XML, XLS or CSV format, PDF export
• File Analysis, Regex search, Standardization, DB search
• Complete DB Scan, SQL interface, Data Dictionary, Schema Comparison
• Statistical Analysis, Reporting ( dimension and measure based)
• Pattern Matching , DeDuplication, Case matching, Basket Analysis, Distribution Chart
• Data generation and Data masking features
• Meta Data Information, Reverse engineering of Data Model
• Timeliness analysis , String length analysis
• Mysql, Oracle,Postgres,MS Access, Db2, SQL Server certified

3. Methodology
Data management framework defines eight phases as shown as Rim in Figure 1 Data Management Framework Process.

3.1 Data Definition Phase

We started by profiling the data. Figure 2 shows the profile of data like Record count, Pattern information and Distribution chart using Aggregate profiler tool. We defined the data nomenclature and entity identification by doing semantic validation using tools Fuzzy logic based similarity check, standardization and Case formatting feature. The following entities were identified: Students, Teachers, Grade, Teacher’s calendar, Time, Student home pin code, School’s pin code.

Figure 2 Data Profile

3.2 Data Capture Phase
During data capture phase we used Aggregate Profiler tool’s Mysql JDBC connectors to get Student data. Teacher calendar data was imported from spreadsheet using Import file option using First row column name option. From the captured data we identified relevant Entity Attributes to be captured and discarded. Validation of data sources was done for authenticity of data.

3.3 Data Validation and Formatting Phase
Most of data cleaning was done during this phase. We decided to use the tool feature to provide database comparison. Also we defined the range rules to find outliners like too large distance or students in hostels. Timeliness of data was used to weed out old and stale data like teacher’s calendar.

Figure 3 Validation and Formatting Mechanisms

3.4 Data Movement and Sharing Phase
During this phase each department was treated as separate domains. Data that was shared across department boundary like student identification was also identified. Similarly reference data attributes that can change was identifed during this phase and migration plan were put in place. For e.g. student can change residence during or between the sessions.
Record linking was done using fuzzy logic based similarity check (Figure 3) we massaged demographic data to provide uniform data – like locations “Electronic City” and “ECity” where mapped to same location/ Also Name Correlation was done using same logic – Carol and Carole were mapped to same teacher.

3.5 Data Modeling and Storage Phase
After cleaning the data and applying various filtering, validation and formatting rules, relevant data pertaining to student and school demographics, student grade and teachers schedule calendar was stored in relevant demograhic tables in Mysql database.

Student Entiry

Student Entiry

Figure 4 Student Entity Data

3.6 Data Archival Phase
We did not archive any information

3.7 Data Retrieval Phase
During this phase we provided Data access layer using Aggregate profiler’s JDBC connectors Figure 5. Tools provide connector to several RDBMS and Hadoop Hive. We have stored out data in mysql so we used mysql connector. If dataset is very large (Volume, Variety and Velocity) Hive can be used. RDBMS was faster than big data.
Figure 5 Data Connectors

3.8 Data Presentation Phase
During this phase we did data analysis to determine correlation between the data. Figure 6 shows the bar chart of Average grade of student as a function of commuting distance

Figure 6 Average Student Grade vs. Distance

4. Results
Applying the data management framework we found a strong correlation between commuting distance and student grades Figure 6. However there was a weak correlation between teacher’s load and student grades Figure 7.

Figure 7 Average Grades vs. Teaching Hours
5. Conclusion
The strong correlation between commuting distance and student grade can be explained due to student fatigue factor and amount of time Indian parents get to teach the students at home.
However in Average Grade vs. Teaching hours spend by teachers a lot of other factors like subject taught and skill level of teachers also comes into play.
The data management framework and Aggregate Profiler Data Quality tool allows the data quality to be applied across multiple domains for cleaning, analysing and presenting data across different domains. It can be used by education institutions to do meaningful analysis of data by focusing on subset of data and identify area of improvents.

[1] Singh, Vivek. K. 2012. Integrated Data Quality along Data Life Cycle. SiliconIndia (http://bi.siliconindia.com/bi-expert/Integrated-Data-Quality-DQ-along-Data-Life-Cycle-Part-1-eid-423.html)
[2] Gibson, Neal and Talburt, John 2012. Hive: Crowdsourcing education data. In Journal of Computing Sciences in Colleges Volume 25 Issue 5 (May 2010), 72-78.