Big Data QA – Why it is so difficult ?

Recently Big Data QA profile was needed for a project which processes terabyte of machine generated data everyday. Below is the cryptic job description that was sent to recruitment team —

” We are looking for Big data developer, who can write code on Hadoop cluster to validate structured and semi structured data.

Skill sets – apache spark, Hive, Hadoop, data quality ”

Recruitment team was amused.

Are you looking for a developer or QA ? Is it manual testing or automated testing ? Where we should look for these kind of profile ?

Big Data QA is an emerging area and it is different from traditional product or enterprise QA. Big data QA is big data developer who develops code for validating data which is beyond human scale. BDQ (Big data QA) does not fit into traditional QA profiles – automated vs manual ; backend vs frontend ; feature vs non feature etc.

BDQ is required by data products to ensure their processes which manage data lifecycle, are ingesting and emitting right data. The volumes is so huge; it is beyond human to load the data into spreadsheet and validate row by row ; column by column.

As per business requirement, BDQ may write very complex code to validate complex business rules. However some day to day activity of BDQ requires :

  • Profile the data to validate dataset — pre and post business rules
  • Find out data hole to ensure all data are coming
  • Statistical modeling to compare and contrast two or more datasets
  • Implement unified formatting and unified data types
  • Implement multiple type of joins on dataset including fuzzy joins
  • Implement UDF ( user defined functions) for reusable validation logic
  • Find outlier data and how to manage them
  • Find Null and Empty data and how to manage them
  • Validate the naming conventions of datasets
  • Validate the file format ( csv, avro, parquet etc) of datasets
  • Monitor incoming and outgoing dataset and redistribute if it fails
  • Validate the data models for analytics
  • Create sample data and make sure sample is not skewed
  • Create training and test dataset
  • Encrypt and DeCrypt (anonymize) data
  • Implement and monitor compliance rule

By no means, above is an exhaustive list but it is more of a indicative list. BDQ is a prolific programmer who understands data ( like business analyst) well and code for Data Quality domain. These bold combination makes BDQ so hard to find.

It is best to train data engineers to become BDQ by providing courses and exposures on data quality and data management.

Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project

Author of fiction book “The Reverse Journey”

How to create Data Lake

As Datalake is getting mainstream, obvious questions are – a.) how to make it & b.) will it work ?

The latter came mostly from the failed Business Intelligence project sponsors, who still feel the pain. ( Here is my blog about it – )

Then they are further question about time and resources. Do I need to throw away my EDW ( enterprise data warehouse) ? Is Hadoop (big data) must for Datalake ? What are the tools to create DataLake ? ?

In this blog, I am trying to answers these questions from a practitioner perspective.

DataLake is evolution not disruption : As storage is getting cheaper, data practitioners thought of co-locating raw data along with processed data so that data engineers and data scientists  have quick access to raw data. This shift changes “Extraction” to “Ingestion” where data is loaded “AS IS”. This shift also bring some changes to ETL tools  – see here ( )

But it does not require to throw away your existing ETL jobs and EDW. If designed carefully they can be reused in datalake. Also Hadoop is not must. Datalake can be created on RDBMS, noSQL, HDFS or any fileSystem. Hadoop/HDFS being cheapest, is the preferred choice but if you have unlimited corporate licences for any of the above, you can use it for dataLake.

How to build it : Divide you datalake into 4 logical and physical spaces.


i) Raw Space : ( It is combination of Extraction and Staging of traditional warehouse with state information)

a.) load data “as is”

b.) define folder structure for different sources

c.) Meta Information : Time of load, load volume, load time, load status

d.) Useful for Auditing and Data Lineage

ii) Qualified Space :( It is combination of Transformation and Joining of traditional warehouse with data dictionary)

a.) run data quality and Entity Resolution on “as is” data

b.) define folder structure for different partitions ( time based, region based, verticals based)

c.) Meta Information : Data Type, Expected values, manipulable,

d.) Useful for Insight and Prediction

iii) Data warehouse Space :( You can reuse existing EDW here)

a.) load recent data into EDW

b.) define reporting and discovery parameters

c.) Meta Information : Data Dictionary, Granularity and Latency

d.) Useful for operational reports and discovery

iv) Collaborative Space :( In this space you want to expose you data to 3rd party or fetch data from 3rd party )

a.) find out the data to be exposed

b.) define security and data format

c.) Meta Information : Data Dictionary, Aggregation level, data format

d.) Useful for data Monetization and Data Fusion

Conclusion : Above are the general guideline for creation of datalake. As every business is different, each datalake is different. Choice of tools, storage will vary according to your requirements. Though storage is cheap, processing of huge amount data will need lots of CPU and RAM – which is serious money. So be careful with volume of data. Though generally it is told; bring everything into dataLake – start small and see value before bringing everything.

Data Lake:Boon or Bane for Data Scientist

One of leading arguments to support data lake is, it will help data scientists because a.) it will keep data in its most granular form and b.) it will source data from multiple systems so that data scientist can cross reference data for better insights.

Both arguments are valid but they are double-edge swords. In a bid to over achieve, data lake may turn into a cesspool, if not done properly. Let look into what data scientists need and what is provided by data lake.

1.) Where to drink from – Eventually all systems will dump data into data lake. As there will be no pre-defined schema, data would be dump at a place, time and format, that is convenient to system. Also, storage of data lake will be primary nosql, hadoop or some large distributed system, from getting data will be not easy.

On the other side, Data Scientists are conversant with tools like SAS, R, Matlab, Rapid Miner, Mahout etc and use scripting languages to analyse the data. They will find it They will find it difficult to coordinate with multitude of dumping systems to find out how, what, when , where and why they dump data.

Metadata repository, governance framework and data profiling and data preparation tools like osDQ should help Data Scientist to find out data, they are interested in.

2.) Is it healthy drink – Once the data is located, next logical question would be, is is Valid data ?

Since data lake does not enforce point to point validation like ETL and data warehouse, validating data would be a tricky job. By nature, data lake is all inclusive and would not differentiate between dirty and clean data, aged and recent data, secure and insecure data, complete or incomplete data. It will the job of data scientists to filter the data that is relevant for their models.

So Data scientists would have to work with data architects and use data preparation tool to do that. If there is lack in coordination, not only modelling will take long time , most probably of low confidence.

3.) Too much to drink from – In one of my post “WHY BI PROJECTS – AND THE ROLE OF DATA ARCHITECT” , I mentioned some Business Intelligence projects fails, they become over ambitious, and try to define all business rules in ETL layer. Same ambition can damage data scientists also. Too much data around will lead to complex models and over fitting of models. A complex model does not always result into better insight.

Moderation will be required.

Conclusion: Fluid storage and data lake is the future. As this technology matures, utility tools will be available to data scientist to do their job efficiently and accurately.

Data Lake and E.T.L – Big Data Plugins

A lot has been written about structure and constituents of Data Lake and what Data lake should contain. Equal volume of artifacts are available on shortcomings of Data Lake and what makes it toxic. In this blog, I am not discussing positives and negatives of Data Lake but How Data Lake initiative will change the ETL ( Extraction, Transformation and Load) workflows and may be the acronym ETL itself.

1.)  There is no T (transformation) between E (extraction) and L (load) : Data Lake stores data in most granular form and sometime in raw form itself. There is minimum transformation required in-between. There is combination of Extraction and Loading which is called Ingestion. Data in its rawest and most granular form, ingested into Lake.

So, the Data Lake architects are looking for tools like Storm, Flume, Chukwa, Sqoop, Message Queues, Zoo keeper, kafka etc which can do massive write operations with minimum transformations in-between. Traditional ETL tools have a strong transformation layer which is not suited for ingestion job.

2.) There are different Transformations : Traditional ETL tools’ transformations are primary focused on Data warehouse transformations like Slow Changing Dimensions, Year/Month till Date, Aggregate Awareness, Maps etc , Data Lake will focus on following transformations.

i) Latency Normalization : Multitude of Sources will dump data into lake which will have different latency – some real time, some delayed by minutes , some by hours and others even by days. One of the important transformations, needed by Data Lake would be “Normalize the Latency factor” so that records of different latency can be collaborated.

ii) Format and Semantic Transformer : Records from different sources will have different formats and nomenclature. Internationalization and Localization will add more complexity to format and nomenclature. Data lake will need strong transformations for formatting and semantic layer mapping.

iii) Structural Analysis : Data is dumped into Lake, without much consideration. Data might not be well formed and validated. Null, Outlier, Noise, DeDup, and Some Cleaning is required to make data usable. Data Quality will be integral part of transformation rather than an afterthought.

iv) Entity Resolution and Record Linking : Records, dumped into Data Lake are not atomic. A lot more focus will be on advance entity resolution and record linking transformations. Simple Map type of transformation will not sufficient. Unstructured data will need text based mining and ontology to resolve and define an Entity. Record linking will not be based on simple Primary Key and Foreign Key relationship, rather business rules will decide how to link records.

v) Meta Data Tagging: Data lake is fluid storage. Unlike pre-built schema definition of Data mart or Data warehouse, Data Lake will be a black for Information Retrieves. Meta data tagging is required to tell what is available and where it is available. Meta data tagging will make or break Data Lake initiatives.

3.) Workflow is different : Extraction , Transformation , Load (E.T.L) will be changing to Ingestion, Quality, Munging , Publishing (I.Q.M.P). New set of tools will focus to do more processing on cluster rather in-memory or inside tool’s container.

 4.) Technology is different : I will write more about it in next blog as it a big area of interest for Data Architects. Data marts and warehouses where hosted mostly on RDBMS while fluid design of Data lake makes it more suited for Big Data technologies like HDFS, Cassandra, Hbase etc. Also lot of unstructured and semi structured data  which will hosted which will be indexed and searchable. A search tool (like Lucene, Elastic Search etc) will become a must for Data Lake initiatives.

Summary: As Big Data and Data lake is getting good traction, ETL tools will change. They might provide plugins, where the bundle and enrich transformations which are suited for Data Lake and optimized for  big data processing.

About Blogger: Vivek Singh is an open source evangelist, chief commiter of Open Source Data Quality (osDQ) and Data Architect.

Integrated data quality

Integrated Data Quality (DQ) along Data Life Cycle

It is a cliché in Business Intelligence domain – Garbage in, Garbage out.  Everyone understands the need and importance of data quality (DQ) but unfortunately, today DQ seems to instill very little confidence into data and the reports, built on this data. Primary reason for lack of confidence is, DQ is done in silos and DQ does not validate bigger business rules. Today, DQ is mostly structural analysis (like null validation, range validation, format validation, deDup, geo-encoding etc.) which is done during data storage phase (when the data is stored in RDBMS). This paper proposes a data life-cycle model, like TOGAF (The Open Group Architecture Framework) proposes an architectural life-cycle and how DQ should be integrated along the multiple phase of data life-cycle.

As the requirement management is very critical for enterprise life-cycle, data quality is the center of data life-cycle. Inconsistent data in any phase of data life-cycle is bad for business. Like TOGAF this paper also proposes the input artifacts, output artifacts, steps to be taken within any data life-cycle phase. As TOFAG tries to standardize the enterprise architecture process, this paper proposes a model to standardize data life-cycle.


1.     Introduction

DQ, long has been pain point of business. It is not uncommon for an organization to spend time and resources to build operational and historical reports, only to find out the underlying data is not correct. Then, they spend more time and resources to figure out what went wrong and why the data was not correct.

Invariably, the findings will tell, there was a new business rule or change in business rule, which the DQ did not capture. As the business is dynamic, business rules are bound to change. DQ that validate the business rules are reactive and not proactive, because DQ is done in silos and DQ engineers do not understand the data life-cycle.

Secondly, in today world, almost all the data is shared. There are multiple touch points for data that can make it corrupt. In some cases, data owners and steward changes along with data life cycle, which makes it so much difficult to manage the transient data. So integrated DQ is needed, which lives along data life cycle and makes it definitive.

Organizations operate into motile time zones; various geographies. They have multiple vendors and deal with multiple currencies. Localization, globalization, change of format of incoming data can also make DQ validations fail.

Organizations have been suffering from low confidence about data. Though there are some tools available in market like Dataflux from SAS, First Logic, Open Source Data Profiler and others, they are not integrated along data life-cycle. They clean data in parts and a big part of data does not come into tools’ preview.

2.     Background

DQ started with address-correction and removing duplicate records of customers from Master Data or Reference Data. It saved organizations money by not sending duplicate emails, promotional offers, or returned postal mails. World has changed now. Today data is exploding; business rules are changing and are getting more complex and data is getting shared across (like social, web2.0 data) domains and organizations. Now DQ has to become ubiquitous. It is required in all domains and all verticals.

Today, if a business rules changes, someone has to tell DQ engineers to write new validations. If data is coming from a new source, DQ engineers need to be told about format and business rules. Only structural DQ is of no good. It will not increase confidence about data.

Let’s take example of ‘netflow’ data that collects incoming and outgoing bytes from interface devices. The netflow data collected from networking machines are structurally good. (No null, no bad data).

If interface-machine or machine-property mapping has been changed in OpsDB (reference database), netflow data is no good for reporting or analysis.

So, only structural DQ is not enough. Independently both data (netflow and opsDB) are good but together they are incompatible.

Let’s take another example. Once data is archived, format of data changes because of storage optimization. So if you run DQ on archived data or on the raw retrieved data, (before processing into older format) DQ will fail.

As mentioned earlier in today’s world, data is sharable and has so many touch points. Sometimes, during data processing, for ease of use or to integrate data with other systems, developers change the data format. Dirty data is injected into the good data.

Any integration point, routine, functions or procedures, third party software or user can potentially corrupt the data.

So, to have a successful DQ implementation, one should understand the data life-cycle and what kind of Quality validation needs to be done in each phase of life-cycle.

3.     Proposed Data Life Cycle Model

Like TOGAF, this paper tries to propose a model for data life-cycle. Each life-cycle phase defines the input artifacts (documents that required for entering this phase) for the phase, the output artifacts (documents that will be produced in this phase – exit criterion) of the phase, steps (processed toke input and create output) to be implemented during the phase and DQ checks to be validated inside the phase.

This model is agnostic of process, system, technology or domain. The circles on the rims are phases of the data life-cycle and the circle in the center is the data quality process which is driven through all phases of the life-cycle. The phases are:-

1.) Data Definition Phase.  

2.) Data Capture Phase.

3.) Data Validation & Formatting Phase.

4.) Data Movement & Sharing Phase.

5.) Data Modeling & Storage Phase.

6.) Data Archival Phase.

7.) Data Retrieval Phase.

8.) Data Presentation Phase.


See the pictorial presentation below

Figure 1.0 – Data Life-Cycle Model

Integrated data quality

Integrated Data Quality

3.1. Data Definition Phase

Input Artifacts: The input artifacts to this phase would be Business requirement document, Data source contacts, Data retention policy and Class diagram.

Output Artifacts: Canonical data reference model (CDRM), Entity identification

Steps to be performed: Data owner and data steward identification, Mapping out business entities, Naming convention definition

Data Quality Implementation: Semantic validation (calling one entity by one name, one acronym for one set of words), Validating naming conventions.

Example: In Infrastructure engineering team, ‘node’, ‘host’ and ‘machines’ words are used as inter-exchangeable. Some Data Source captures this information in host table while other keeps information in node table.

Semantic DQ validation will iron out these discrepancies and make data less confusing.

3.2. Data Capture Phase

Input Artifacts: Business architecture (BA), Business process diagram (BPD), Use cases, Class diagram

Output Artifacts: Mapping of entity attributes, List of attributes to be captured and discarded, List of temporal and persistent attributes, Listing of origin of entities, attributes and transactions.

Steps to be performed: Mapping of class to entities, Mapping of actors to entities, Mapping of use cases to transaction, Define attributes of transaction and entities

Data Quality Implementation: Validate the entities (Do not take old, stale or discarded data), Validate the state of temporal and persistent data (Do not bring data this is in invalid state – like do not bring hosts/machines Entities that have been ‘retired’ or ‘obsolete’), Validate the data lineage (input source is authentic or not. Do not take data from local or cache)

Example: Some projects in infrastructure engineering team keep a local copy of relevant entities from opsDB. They synch every day. But if there is frequent update in opsDB for some reason, these projects will be out of synch and data process results will be prone to error.

3.3. Data Validation & Formatting Phase

Input Artifacts: Business rules information, Master contract (SLA information), and Sequence diagram

Output Artifacts: Listing of valid state of transactions, Input format, Output format, Data Type, Attribute Ranges.

Steps to be performed: Mapping of business rules to Entity-Attribute, Find trigger for value changes, Define rules for new data Vs modified data, Define rules for discarding data

Data Quality Implementation: Format analysis, Boundary analysis, Range analysis, Data type validation, Validation of modified data business rule (if modified data comes from different source or in different format or at different time)

Example: With new set of IP addresses coming, old IP validation will start failing. New set of IP addresses define their format and range. New set of DQ rules should be written to validate new IP addresses.

3.4. Data Movement and Sharing Phase

Input Artifacts: Business domain definition, Business department definition, Security policy, Data policy, Transactional failure points

Output Artifacts: Domain (Boundary) of Data, Holes in boundaries, List of shared attributes, Listing of slow changing attributes, Data migration plan, Data flow diagram(DFD), Data security implementation guide.

Steps to be performed: Finding of entities and their attributes shared across business domains, Find transactions running across domains, Merging and Aggregation of attributes, Migrating entities into central repository.

Data Quality Implementation: Validate slow changing attributes (SCD) and their type, Validate the attributes that are required for joining two or more entities, Validate Data integration business rules like inner join, outer join, cardinality, is-a, has-a or other relational rules, Validate security rules like data masking for attributes grouped for unique customer identification, credit card number sharing etc

Example:  Typically, for any provisioning data, ‘Ticket_Id’ data type is modeled as Numeric. It helps in storage and indexing. But the Siebel ticket is alphanumeric.  Join of any numeric Ticket_Id type to Siebel Ticket_Id will be error.

3.5. Data Modeling and Storage Phase

Input Artifacts: Conceptual design, Transactional details, Migration plan, Sequence diagram

Output Artifacts: Logical design, Physical storage design, Data dictionary, ER diagram, Security implementation guide

Steps to be performed: Map Entity-Relationship, Find shared storage design (shared DB, shared schema, multi-tenancy etc), Technical storage design (table space, columnar storage etc), User level & row level security definition

Data Quality Implementation: Validating structural data quality like null validation, de-duplication, data and time format, geo-encoding, primary key-foreign key rule and data massaging. Validate data aggregation business rule like sum, average, count, min and max. Validation conditional business rules like if vendor is ‘x’ model value should be ‘model000, model101, model111 or model123’.

Example:  For Gomez Bing latency reporting, we implement a set a structural data quality rules – like Null validation, Range validation and De-duplication for primary key identification. Since the input data from third-party (Gomez) and structured, any structural DQ rejection will mean serious business problem. We reject the complete set of data if any of data quality validation fails and notify the input data owner.

3.6. Data Archival Phase

Input Artifacts: Archival policy, Retrieval policy, SLA information, Third Party contract information

Output Artifacts: Archival retrieval process, Raw data format, Business ready data format

Steps to be performed: Tagging of data by date, machine and property, Implementation of rules and security of data retrieval, Defining storage format.

Data Quality Implementation: Validation of checksum for complete data retrieval, Tag validation for getting the right set of data, Input and Output format validation.

Example:  We take frequent dump of Microstrategy Reporting server’s repository for archival. To check DQ of archival data, we restore the archived data at frequent interval as a process. Only after successful retrieval, archival data is tagged.

3.7. Data Retrieval Phase

Input Artifacts: Data storage information, Entities and attributes that needs to be retrieved, Business rule definition, Mapping of business data – raw data.

Output Artifacts: Data access layer definition, security definition, Result set definition

Steps to be performed: Implementation of persistent layer, Implementation of metadata definition layer

Data Quality Implementation: Data compatibility validation (matching data along same dimension – do not add dollar values to rupees values even if both are in sales metric), Validation of data caching and refreshing rules  ( When to bring new data and when to bring from cache), Access Control List (ACL) validation

Example: When we fetch system data having date and time information, we get data from different time zones.  We convert all of them to Universal time zone, before doing any meaningful time processing.

3.8. Data Presentation Phase

Input Artifacts: Reporting requirement, Metric definition, Slice and dice information, Report loading frequency

Output Artifacts: Reports, Valid data set

Steps to be performed: Warehouse implementation, Data mart implementation

Data Quality Implementation: Internationalization and localization, Standardization, Date, time and number formatting,

Example: Multiple Locales have different week starting date, week ending data, and week to month mapping i.e. some projects consider Sunday as week starting date while other consider Monday as week starting date. Some project considers 5 days / week while other considers 7 day /week. Non standardization of calendar date may affect metrics which calculate ‘Average’.

So we have implemented a global calendar for AMD warehouse reports for standardization. All time related dimensions come from this standard calendar.

4.     Case Study

In Y! Infrastructural Engineering team, we implement DQ in bits and pieces and as and when required basis, but DQ as a process is missing. DQ is not an afterthought. It should be integrated in planning phase itself. Projects do not define or publish their data life cycle so understanding data needs become difficult for data engineers.

In my previous organization, we faced the similar problem. Communicating with multiple teams and multiple stakeholders was time consuming. They were working on a part of data or on transient data but the integrated approach was missing. Then we embarked upon a conceptual project to define the semantic layer and DQ needs. The project created a collaborative interactive platform (twiki based) where data entities were created and, life cycle was described along with DQ requirement, in plain English. Enterprise data architect was owner of the twiki.

It was a simple twiki where data architect had broadly followed the above data life-cycle model. He has defined the top level data entities that will be required during the project and their life cycle. He also had defined the possible touch points (based on Data Flow Diagram) and business rules for each phase. Data owners and DQ for each touch points were also listed. Data format, volumetric information, border lines (where data changes) were also mentioned. Each data life cycle phase was mentioned along with Input Artifacts, Output Artifacts, Steps to be performed and Data Quality Implementation. Multiple projects have to consult through this collaborative twiki and they have to follow the data life cycle. Any issue or non compliance was discussed with enterprise data architect.

5.     Result

It was a successful exercise. It helped the team tremendously to reduce the time, required to talk to business/customer to understand DQ need. Any new business rule or changed business rules were captured in a twiki before hand and required changes were done before the data became dirty. It shortened the cycle for turn-out and also increased confidence in data.

Data audit team found around 90% improvement in the quality of data. Some minor issues were found which were more human than process oriented. This was accepted as best practice. All news projects had to define their data life cycle and broadcast it on collaborative twiki. Any changes in business rules or data quality also would be captured there.

A data modeling team was formed which looked into all the issues and non compliance. It was approving body for any change in twiki.

If we follow data life cycle and DQ as process or guideline, it will help us tremendously.

6.     Related work

It is not very difficult to define the structure of the data life-cycle in machine readable format like XML or JSON (See Appendix A – a sample host entity). Though it is not in the scope of this paper, we tried to define the data life cycle through Data Reference Model (DRM) and XML. We were able to capture most of the DQ information and validations, discussed in this paper.

If we create a common updatable repository of the related data entities, this model can be extended to data governance framework where business and technology can define, redefine, share and implement business rules collaboratively, in proactive mode rather than wait for business to tell engineering what to do.

7.     Conclusions

This paper proposes a model that will help an organization to build their DQ holistically along the data life-cycle understanding. It gives DQ engineering gives bigger perspective about data and expectation from DQ.

If we follow the data life cycle, understanding about data increases and that leads to better data quality management and confidence about data. If the process is optimized, DQ becomes a proactive process rather a reactive process. If we can mention the future need of business or possible forms and format of data; DQ engineers will proactively implement the DQ validations and alert that will save organizations from reactive developments.

It is possible that some projects have unique DQ requirement or they may not follow the above 8 steps of data life cycle. This model is an attempt to provide a framework, a best practice guideline rather than a stringent, must-follow process.




1.) – The Open Group Architecture Framework [Model 3 – Introduction to ADM]

2.) – Open Source Data Quality [Aggregate Profiler 5.0]

3.) The DAMA Dictionary of Data Management, 1st Edition, USA, 2008,

4.) – Glossary for Business Intelligence and Dataware housing


Appendix A

Machine readable definition of Data Entity

<?xml version=”1.0″ encoding=”UTF-8″ ?>

<!–  Purpose – This XML will work as template for creating XML to represent entities. Root element will be name of entity. Default namespace will be the domain in which entity will fall. Use third party reference domain elements wherever possible using the namespace.


<SEOHostEntity xmlns=”; xmlns:xsd=”;


xsi:schemaLocation=” EntityMetaDataTemplate.xsd”

xmlns:acl=”; xmlns:dns=””&gt;

<!–  Classification element will have metadata that will classify the entity representation. It will not have information related to entity instances.


<Classification VariationDependent=”False”>

<Author>Vivek Singh</Author>


<Approver>Infrastructural team</Approver>


<Domain>SEO/Data Architecture</Domain>



</ Classification >

<!– Storage Element will define where entity

instances is physically stored. Storage Type can be – File System, RDBMS, or MainFrame. Identifier can be filename or RDBMS name. If it stored in Database, it is good practice to add fieldname as an attribute element to establish reverse traceability.


<Storage VariationDependent=”True”>















Data – Asset or Liability ? What Data policy can do

Well, you have heard so many time “data is asset”, you started believing it. Had data were asset, richer companies were those having lots of data. But you know, it is not the reality. In fact some companies which produces huge data ( from sensors, hardware, RFID, retail, Supply chain ) are struggling to stay profitable. I know what are you thinking – the Facebooks, the Linkedins. I will cover those in following paragraphs. Let’s focus on ‘only data’ for sometime.

Data per se is NOT an asset. In fact, cost of storing all data is so prohibitory. It may eat into your profit. It is a liability. Following image is an indication, how much a company will need to spend to store and retrieve data ( Only Indicative )

Close to 60 million USD !!!

And this does not include licensing cost of multiple BI software. To prove data is not liability, companies have to show a tangible benefit of more than 60 million  USD from the data alone ( insights, predictions and what not, using only data ) Very few companies will be able to show the benefit.

Apparently web2.0 kind of companies do make profit. You may be aware, for them also the biggest cost is data centers. They are able to engage people on their site for longer period and in turn make more advertising money. Also there is a business model where people need their service – it is not data alone.

Let’s take example of two fictitious car companies which wanted to know, how much ,next season which model of car will sell  – so accordingly  manage their supply chain.

Company A, collected their sales data for 10 years, collected every bit of manufacturing data and bought consensus bureau data for demography and bought income related data also to cover all variability and seasonality , hire data analysts and data scientist, took 6 months to build a model and predicted the out come.

Company B, created a simple website and asked all his 5000 sales representative to predict and the aggregate the data to predict the outcome.

Any guess which will be more accurate !!
Well readers can argue which model will be more accurate but one thing is clear, company B will save lots of money during analysis. I am not belittling the effort of data architects and data scientists ( I am one of them 🙂 ). It is  important to understand the cost of data. Data is a liability (cost) per se and to make asset, you have to define a monetization model, which covers the cost of data. Unfortunately, many Business Intelligence project don’t do the analysis and fall flat.

“Data Policy” is important for any company to understand the data and hence cost associate with it. I am posting a generic data policy but companies can tweak as per their business models. Companies should know what is ‘actionable’ data, what is ‘good to have’ data and what is ‘noise’.

Data Architecture Principles and Guidelines

Definition :

Data: Distinct pieces of information usually formatted in a special way. All software is divided into two general categories: data and programs. Programs are collections of instructions for manipulating data.

Data Owner : Entity that can authorize or deny access to certain data, and is responsible for its accuracy, integrity, and timeliness.

Data Steward: A data steward is a person or organization delegated the responsibility for managing a specific set of data resources
• Establish a conceptual basis and bounds for more detailed Domain Analysis
• Determine whether planned development and evolution of the domain is viable relative to the organization’s business objectives
• Establish criteria by which management and engineers can judge whether a proposed system is properly within the domain
Data Lifecycle:
The Data Life Cycle is the phases in which data moves through the organization. The different phases include how the organization collects, stores, processes and disseminate their key data.


i) Data is an asset that has value to the enterprise and is managed accordingly.
Data is valuable corporate resource; it has real, measurable value.
The purpose of data is to aid decision-making. Accurate, timely data is critical to accurate, timely decisions.
Data is the foundation of corporate decision-making, so we must also carefully manage data to ensure that we know where it is, can rely upon its accuracy, and can obtain it when and where we need it.
The implication is that there is an education task to ensure that all organizations within the enterprise understand the relationship between value of data, sharing of data, and accessibility to data.
Owner must have the authority and means to manage the data for which they are accountable.

ii) Data owners are responsible for data integrity and distribution.
Data owners must be accountable for the effective and efficient management of data. The accuracy, concurrency and security of data are management concerns, best handled by data owners.
* Data Owner can be a system but steward has to be a physical person.
Companies needs to develop security procedures and standards which are consistent across the infrastructure. Companies needs to establish procedures for data sharing. Data Owner has to look into audit trail periodically.
Data owner will take decision about hand-shake policy with data consumers (down stream systems). Depending upon the nature (Security, format, usage, control etc.) of data, owner will make decision.

iii) Domain-level data is commonly defined and accessible across Domains.
Standards for common categories of data collected by domain facilitate information exchange and minimize duplicate information or information systems. Domain-level data definition is important to all Domains and as such needs to be available, accessible, consistent, and accurate. Common definition reduces duplication, mismatching, misuse and misinterpretation of data, promotes inter-domain cooperation, and facilitates data sharing. Standards for collecting and recording common data definitions can reduce acquisition costs and improve opportunities for maximum use of Domain information.
Data which is classified as Domain-level must be made available by the data owners across the infrastructure taking into account appropriate security concerns. Data owner should have access to common definition framework.

iv) Data Quality & Stewardship should be defined within Domain.
Data, products and information should be of quality sufficient to meet the requirements of business and to support sound decision making. People who take observations or produce data and information are stewards of these data, not owners. These data must be collected, produced, documented, transmitted and maintained with the accuracy, timeliness and reliability needed to meet the needs of all users.
Quality rules and Steward of business data should be defined. There might be a tool (framework) to look into data quality issues. It will be steward’s responsibility to look into quality aspect of data. You can look my open source tool at

v) Classify data elements into right Security class & Domain model.
Data must be categorized for easy management and better understanding. Categorization of data will help us finding right owner and steward.
We have to define different security level and domain partitioning. Ontology will come from business architecture.

vi) Data should have a guaranteed integrity across the Lifecycle.
If not, it will induce inconsistency into systems. Different information will flow into system, interacting with different part of data lifecycle.
Except owner, data can not be changed. Change management of data should be defined.

vii) Ensure Meta-data is in place.
Metadata will provide search, storage, consistency.
Data should be stored with metadata. And version should be maintained. Repository should be capable of doing Meta data search.

viii) Data Architecture is Requirements-driven.
It is essential that providers and users of data and products play an active role in defining the constantly evolving requirements that drive the development and evolution of data management systems. Every customer has different need. They do not have data in same format.
Data management should be extendable and flexible. Data structure should be extendable and compatible to usages. User should define usages of data but data owner should make an attempt to understand usages of data.

ix) Data Access should be from common abstraction layer.
Access to data modification (Create, Edit, Delete) abilities should be controlled by the business data access rules in the application or other abstraction layer. Abstraction layer is under control and revision checked, direct access is more free form and can lead to inconsistent result sets.
A Common data abstraction layer should be developed. Input will be taken from application architecture to develop common abstraction layer.

x) Data Lifecycle should be captured.
Data lifecycle gives guidance for storage, cataloguing and retrieval of historical data. Sometimes, it is contractual obligations to store historical data and retrieve it.
Domain owner and Business Owner have to define data life cycle.

1) When communicating with external parties adopt XML as a standard – adopt as much as possible one XML dialect only
2) Split data capture from data retrieval
3) Only store data together when it needs to be managed together (i.e. when you need to see it together you can also do that in the application)Store as much meta data as reasonably possible with the data
4) Use international standards, practices and framework (like sif, qti) wherever possible and relevant.
5) Data should be captured once and validated at the source or closest to source

Summary : A comprehensive data policy will classify the data based on value to business, save on storage and retrieval of data, will help in navigating the bureaucratic labyrinth for data access and data quality ( a major reason for delay and cost spike of Business Intelligence project), hence the cost of data.

Education Engagement Analysis

Vivek K. Singh and Amish Choudhary
Open Source Evangelist

Data Quality is a proven technology for Data enrichment and record linking. We are extending Data Quality and Analytics techniques to delve into student’s demographic and teacher’s engagement data for Indian education system. In this paper, we are proposing a process of collecting education data and do data quality using open source data quality tool (osDQ) also called “Aggregate Profiler”: Record linking techniques are used to define correlation for engagement and demographic analysis.

In the paper “Integrated Data Quality (DQ) along Data Life cycle[3]”, author proposed a model for Data life cycle consisting of eight Phases. In this paper we applied the data life cycle phases to Indian education system using Open Source Data Quality and Profiling tool. The tool provides capability for data cleaning and data analysis to get correlation of data. While the process allows deriving various inferences based on data, we are using the process to correlate the impact on student grades of the commute distance and schedule load of teachers.

Figure 1 Data Management Framework Process

2. Background
During our visit to various schools and parent interviews we felt the need to analyze the student’s demographic data and teacher’s engagement data to see the correlation on students grades of the time spend during the commute and teacher’s calendar. Indian education system is one of the most cumbersome due to sheer volume of data, non-uniform standard for collecting and storing the data. Hence we are proposing using Data Management Framework to cleanse and analyze the data. The tool provide
• Big data support – HIVE thrift server support
• Format Creation, Format Matching ( Phone, Date, String and Number), Format standardization
• Fuzzy Logic based similarity check, Cardinality check between tables and files
• Export and import from XML, XLS or CSV format, PDF export
• File Analysis, Regex search, Standardization, DB search
• Complete DB Scan, SQL interface, Data Dictionary, Schema Comparison
• Statistical Analysis, Reporting ( dimension and measure based)
• Pattern Matching , DeDuplication, Case matching, Basket Analysis, Distribution Chart
• Data generation and Data masking features
• Meta Data Information, Reverse engineering of Data Model
• Timeliness analysis , String length analysis
• Mysql, Oracle,Postgres,MS Access, Db2, SQL Server certified

3. Methodology
Data management framework defines eight phases as shown as Rim in Figure 1 Data Management Framework Process.

3.1 Data Definition Phase

We started by profiling the data. Figure 2 shows the profile of data like Record count, Pattern information and Distribution chart using Aggregate profiler tool. We defined the data nomenclature and entity identification by doing semantic validation using tools Fuzzy logic based similarity check, standardization and Case formatting feature. The following entities were identified: Students, Teachers, Grade, Teacher’s calendar, Time, Student home pin code, School’s pin code.

Figure 2 Data Profile

3.2 Data Capture Phase
During data capture phase we used Aggregate Profiler tool’s Mysql JDBC connectors to get Student data. Teacher calendar data was imported from spreadsheet using Import file option using First row column name option. From the captured data we identified relevant Entity Attributes to be captured and discarded. Validation of data sources was done for authenticity of data.

3.3 Data Validation and Formatting Phase
Most of data cleaning was done during this phase. We decided to use the tool feature to provide database comparison. Also we defined the range rules to find outliners like too large distance or students in hostels. Timeliness of data was used to weed out old and stale data like teacher’s calendar.

Figure 3 Validation and Formatting Mechanisms

3.4 Data Movement and Sharing Phase
During this phase each department was treated as separate domains. Data that was shared across department boundary like student identification was also identified. Similarly reference data attributes that can change was identifed during this phase and migration plan were put in place. For e.g. student can change residence during or between the sessions.
Record linking was done using fuzzy logic based similarity check (Figure 3) we massaged demographic data to provide uniform data – like locations “Electronic City” and “ECity” where mapped to same location/ Also Name Correlation was done using same logic – Carol and Carole were mapped to same teacher.

3.5 Data Modeling and Storage Phase
After cleaning the data and applying various filtering, validation and formatting rules, relevant data pertaining to student and school demographics, student grade and teachers schedule calendar was stored in relevant demograhic tables in Mysql database.

Student Entiry

Student Entiry

Figure 4 Student Entity Data

3.6 Data Archival Phase
We did not archive any information

3.7 Data Retrieval Phase
During this phase we provided Data access layer using Aggregate profiler’s JDBC connectors Figure 5. Tools provide connector to several RDBMS and Hadoop Hive. We have stored out data in mysql so we used mysql connector. If dataset is very large (Volume, Variety and Velocity) Hive can be used. RDBMS was faster than big data.
Figure 5 Data Connectors

3.8 Data Presentation Phase
During this phase we did data analysis to determine correlation between the data. Figure 6 shows the bar chart of Average grade of student as a function of commuting distance

Figure 6 Average Student Grade vs. Distance

4. Results
Applying the data management framework we found a strong correlation between commuting distance and student grades Figure 6. However there was a weak correlation between teacher’s load and student grades Figure 7.

Figure 7 Average Grades vs. Teaching Hours
5. Conclusion
The strong correlation between commuting distance and student grade can be explained due to student fatigue factor and amount of time Indian parents get to teach the students at home.
However in Average Grade vs. Teaching hours spend by teachers a lot of other factors like subject taught and skill level of teachers also comes into play.
The data management framework and Aggregate Profiler Data Quality tool allows the data quality to be applied across multiple domains for cleaning, analysing and presenting data across different domains. It can be used by education institutions to do meaningful analysis of data by focusing on subset of data and identify area of improvents.

[1] Singh, Vivek. K. 2012. Integrated Data Quality along Data Life Cycle. SiliconIndia (
[2] Gibson, Neal and Talburt, John 2012. Hive: Crowdsourcing education data. In Journal of Computing Sciences in Colleges Volume 25 Issue 5 (May 2010), 72-78.