Integrated Data Quality (DQ) along Data Life Cycle

It is a cliché in Business Intelligence domain – Garbage in, Garbage out.  Everyone understands the need and importance of data quality (DQ) but unfortunately, today DQ seems to instill very little confidence into data and the reports, built on this data. Primary reason for lack of confidence is, DQ is done in silos and DQ does not validate bigger business rules. Today, DQ is mostly structural analysis (like null validation, range validation, format validation, deDup, geo-encoding etc.) which is done during data storage phase (when the data is stored in RDBMS). This paper proposes a data life-cycle model, like TOGAF (The Open Group Architecture Framework) proposes an architectural life-cycle and how DQ should be integrated along the multiple phase of data life-cycle.

As the requirement management is very critical for enterprise life-cycle, data quality is the center of data life-cycle. Inconsistent data in any phase of data life-cycle is bad for business. Like TOGAF this paper also proposes the input artifacts, output artifacts, steps to be taken within any data life-cycle phase. As TOFAG tries to standardize the enterprise architecture process, this paper proposes a model to standardize data life-cycle.

 


1.     Introduction

DQ, long has been pain point of business. It is not uncommon for an organization to spend time and resources to build operational and historical reports, only to find out the underlying data is not correct. Then, they spend more time and resources to figure out what went wrong and why the data was not correct.

Invariably, the findings will tell, there was a new business rule or change in business rule, which the DQ did not capture. As the business is dynamic, business rules are bound to change. DQ that validate the business rules are reactive and not proactive, because DQ is done in silos and DQ engineers do not understand the data life-cycle.

Secondly, in today world, almost all the data is shared. There are multiple touch points for data that can make it corrupt. In some cases, data owners and steward changes along with data life cycle, which makes it so much difficult to manage the transient data. So integrated DQ is needed, which lives along data life cycle and makes it definitive.

Organizations operate into motile time zones; various geographies. They have multiple vendors and deal with multiple currencies. Localization, globalization, change of format of incoming data can also make DQ validations fail.

Organizations have been suffering from low confidence about data. Though there are some tools available in market like Dataflux from SAS, First Logic, Open Source Data Profiler and others, they are not integrated along data life-cycle. They clean data in parts and a big part of data does not come into tools’ preview.

2.     Background

DQ started with address-correction and removing duplicate records of customers from Master Data or Reference Data. It saved organizations money by not sending duplicate emails, promotional offers, or returned postal mails. World has changed now. Today data is exploding; business rules are changing and are getting more complex and data is getting shared across (like social, web2.0 data) domains and organizations. Now DQ has to become ubiquitous. It is required in all domains and all verticals.

Today, if a business rules changes, someone has to tell DQ engineers to write new validations. If data is coming from a new source, DQ engineers need to be told about format and business rules. Only structural DQ is of no good. It will not increase confidence about data.

Let’s take example of ‘netflow’ data that collects incoming and outgoing bytes from interface devices. The netflow data collected from networking machines are structurally good. (No null, no bad data).

If interface-machine or machine-property mapping has been changed in OpsDB (reference database), netflow data is no good for reporting or analysis.

So, only structural DQ is not enough. Independently both data (netflow and opsDB) are good but together they are incompatible.

Let’s take another example. Once data is archived, format of data changes because of storage optimization. So if you run DQ on archived data or on the raw retrieved data, (before processing into older format) DQ will fail.

As mentioned earlier in today’s world, data is sharable and has so many touch points. Sometimes, during data processing, for ease of use or to integrate data with other systems, developers change the data format. Dirty data is injected into the good data.

Any integration point, routine, functions or procedures, third party software or user can potentially corrupt the data.

So, to have a successful DQ implementation, one should understand the data life-cycle and what kind of Quality validation needs to be done in each phase of life-cycle.

3.     Proposed Data Life Cycle Model

Like TOGAF, this paper tries to propose a model for data life-cycle. Each life-cycle phase defines the input artifacts (documents that required for entering this phase) for the phase, the output artifacts (documents that will be produced in this phase – exit criterion) of the phase, steps (processed toke input and create output) to be implemented during the phase and DQ checks to be validated inside the phase.

This model is agnostic of process, system, technology or domain. The circles on the rims are phases of the data life-cycle and the circle in the center is the data quality process which is driven through all phases of the life-cycle. The phases are:-

1.) Data Definition Phase.  

2.) Data Capture Phase.

3.) Data Validation & Formatting Phase.

4.) Data Movement & Sharing Phase.

5.) Data Modeling & Storage Phase.

6.) Data Archival Phase.

7.) Data Retrieval Phase.

8.) Data Presentation Phase.

 

See the pictorial presentation below

Figure 1.0 – Data Life-Cycle Model

Integrated data quality

Integrated Data Quality

3.1. Data Definition Phase

Input Artifacts: The input artifacts to this phase would be Business requirement document, Data source contacts, Data retention policy and Class diagram.

Output Artifacts: Canonical data reference model (CDRM), Entity identification

Steps to be performed: Data owner and data steward identification, Mapping out business entities, Naming convention definition

Data Quality Implementation: Semantic validation (calling one entity by one name, one acronym for one set of words), Validating naming conventions.

Example: In Infrastructure engineering team, ‘node’, ‘host’ and ‘machines’ words are used as inter-exchangeable. Some Data Source captures this information in host table while other keeps information in node table.

Semantic DQ validation will iron out these discrepancies and make data less confusing.

3.2. Data Capture Phase

Input Artifacts: Business architecture (BA), Business process diagram (BPD), Use cases, Class diagram

Output Artifacts: Mapping of entity attributes, List of attributes to be captured and discarded, List of temporal and persistent attributes, Listing of origin of entities, attributes and transactions.

Steps to be performed: Mapping of class to entities, Mapping of actors to entities, Mapping of use cases to transaction, Define attributes of transaction and entities

Data Quality Implementation: Validate the entities (Do not take old, stale or discarded data), Validate the state of temporal and persistent data (Do not bring data this is in invalid state – like do not bring hosts/machines Entities that have been ‘retired’ or ‘obsolete’), Validate the data lineage (input source is authentic or not. Do not take data from local or cache)

Example: Some projects in infrastructure engineering team keep a local copy of relevant entities from opsDB. They synch every day. But if there is frequent update in opsDB for some reason, these projects will be out of synch and data process results will be prone to error.

3.3. Data Validation & Formatting Phase

Input Artifacts: Business rules information, Master contract (SLA information), and Sequence diagram

Output Artifacts: Listing of valid state of transactions, Input format, Output format, Data Type, Attribute Ranges.

Steps to be performed: Mapping of business rules to Entity-Attribute, Find trigger for value changes, Define rules for new data Vs modified data, Define rules for discarding data

Data Quality Implementation: Format analysis, Boundary analysis, Range analysis, Data type validation, Validation of modified data business rule (if modified data comes from different source or in different format or at different time)

Example: With new set of IP addresses coming, old IP validation will start failing. New set of IP addresses define their format and range. New set of DQ rules should be written to validate new IP addresses.

3.4. Data Movement and Sharing Phase

Input Artifacts: Business domain definition, Business department definition, Security policy, Data policy, Transactional failure points

Output Artifacts: Domain (Boundary) of Data, Holes in boundaries, List of shared attributes, Listing of slow changing attributes, Data migration plan, Data flow diagram(DFD), Data security implementation guide.

Steps to be performed: Finding of entities and their attributes shared across business domains, Find transactions running across domains, Merging and Aggregation of attributes, Migrating entities into central repository.

Data Quality Implementation: Validate slow changing attributes (SCD) and their type, Validate the attributes that are required for joining two or more entities, Validate Data integration business rules like inner join, outer join, cardinality, is-a, has-a or other relational rules, Validate security rules like data masking for attributes grouped for unique customer identification, credit card number sharing etc

Example:  Typically, for any provisioning data, ‘Ticket_Id’ data type is modeled as Numeric. It helps in storage and indexing. But the Siebel ticket is alphanumeric.  Join of any numeric Ticket_Id type to Siebel Ticket_Id will be error.

3.5. Data Modeling and Storage Phase

Input Artifacts: Conceptual design, Transactional details, Migration plan, Sequence diagram

Output Artifacts: Logical design, Physical storage design, Data dictionary, ER diagram, Security implementation guide

Steps to be performed: Map Entity-Relationship, Find shared storage design (shared DB, shared schema, multi-tenancy etc), Technical storage design (table space, columnar storage etc), User level & row level security definition

Data Quality Implementation: Validating structural data quality like null validation, de-duplication, data and time format, geo-encoding, primary key-foreign key rule and data massaging. Validate data aggregation business rule like sum, average, count, min and max. Validation conditional business rules like if vendor is ‘x’ model value should be ‘model000, model101, model111 or model123’.

Example:  For Gomez Bing latency reporting, we implement a set a structural data quality rules – like Null validation, Range validation and De-duplication for primary key identification. Since the input data from third-party (Gomez) and structured, any structural DQ rejection will mean serious business problem. We reject the complete set of data if any of data quality validation fails and notify the input data owner.

3.6. Data Archival Phase

Input Artifacts: Archival policy, Retrieval policy, SLA information, Third Party contract information

Output Artifacts: Archival retrieval process, Raw data format, Business ready data format

Steps to be performed: Tagging of data by date, machine and property, Implementation of rules and security of data retrieval, Defining storage format.

Data Quality Implementation: Validation of checksum for complete data retrieval, Tag validation for getting the right set of data, Input and Output format validation.

Example:  We take frequent dump of Microstrategy Reporting server’s repository for archival. To check DQ of archival data, we restore the archived data at frequent interval as a process. Only after successful retrieval, archival data is tagged.

3.7. Data Retrieval Phase

Input Artifacts: Data storage information, Entities and attributes that needs to be retrieved, Business rule definition, Mapping of business data – raw data.

Output Artifacts: Data access layer definition, security definition, Result set definition

Steps to be performed: Implementation of persistent layer, Implementation of metadata definition layer

Data Quality Implementation: Data compatibility validation (matching data along same dimension – do not add dollar values to rupees values even if both are in sales metric), Validation of data caching and refreshing rules  ( When to bring new data and when to bring from cache), Access Control List (ACL) validation

Example: When we fetch system data having date and time information, we get data from different time zones.  We convert all of them to Universal time zone, before doing any meaningful time processing.

3.8. Data Presentation Phase

Input Artifacts: Reporting requirement, Metric definition, Slice and dice information, Report loading frequency

Output Artifacts: Reports, Valid data set

Steps to be performed: Warehouse implementation, Data mart implementation

Data Quality Implementation: Internationalization and localization, Standardization, Date, time and number formatting,

Example: Multiple Locales have different week starting date, week ending data, and week to month mapping i.e. some projects consider Sunday as week starting date while other consider Monday as week starting date. Some project considers 5 days / week while other considers 7 day /week. Non standardization of calendar date may affect metrics which calculate ‘Average’.

So we have implemented a global calendar for AMD warehouse reports for standardization. All time related dimensions come from this standard calendar.

4.     Case Study

In Y! Infrastructural Engineering team, we implement DQ in bits and pieces and as and when required basis, but DQ as a process is missing. DQ is not an afterthought. It should be integrated in planning phase itself. Projects do not define or publish their data life cycle so understanding data needs become difficult for data engineers.

In my previous organization, we faced the similar problem. Communicating with multiple teams and multiple stakeholders was time consuming. They were working on a part of data or on transient data but the integrated approach was missing. Then we embarked upon a conceptual project to define the semantic layer and DQ needs. The project created a collaborative interactive platform (twiki based) where data entities were created and, life cycle was described along with DQ requirement, in plain English. Enterprise data architect was owner of the twiki.

It was a simple twiki where data architect had broadly followed the above data life-cycle model. He has defined the top level data entities that will be required during the project and their life cycle. He also had defined the possible touch points (based on Data Flow Diagram) and business rules for each phase. Data owners and DQ for each touch points were also listed. Data format, volumetric information, border lines (where data changes) were also mentioned. Each data life cycle phase was mentioned along with Input Artifacts, Output Artifacts, Steps to be performed and Data Quality Implementation. Multiple projects have to consult through this collaborative twiki and they have to follow the data life cycle. Any issue or non compliance was discussed with enterprise data architect.

5.     Result

It was a successful exercise. It helped the team tremendously to reduce the time, required to talk to business/customer to understand DQ need. Any new business rule or changed business rules were captured in a twiki before hand and required changes were done before the data became dirty. It shortened the cycle for turn-out and also increased confidence in data.

Data audit team found around 90% improvement in the quality of data. Some minor issues were found which were more human than process oriented. This was accepted as best practice. All news projects had to define their data life cycle and broadcast it on collaborative twiki. Any changes in business rules or data quality also would be captured there.

A data modeling team was formed which looked into all the issues and non compliance. It was approving body for any change in twiki.

If we follow data life cycle and DQ as process or guideline, it will help us tremendously.

6.     Related work

It is not very difficult to define the structure of the data life-cycle in machine readable format like XML or JSON (See Appendix A – a sample host entity). Though it is not in the scope of this paper, we tried to define the data life cycle through Data Reference Model (DRM) and XML. We were able to capture most of the DQ information and validations, discussed in this paper.

If we create a common updatable repository of the related data entities, this model can be extended to data governance framework where business and technology can define, redefine, share and implement business rules collaboratively, in proactive mode rather than wait for business to tell engineering what to do.

7.     Conclusions

This paper proposes a model that will help an organization to build their DQ holistically along the data life-cycle understanding. It gives DQ engineering gives bigger perspective about data and expectation from DQ.

If we follow the data life cycle, understanding about data increases and that leads to better data quality management and confidence about data. If the process is optimized, DQ becomes a proactive process rather a reactive process. If we can mention the future need of business or possible forms and format of data; DQ engineers will proactively implement the DQ validations and alert that will save organizations from reactive developments.

It is possible that some projects have unique DQ requirement or they may not follow the above 8 steps of data life cycle. This model is an attempt to provide a framework, a best practice guideline rather than a stringent, must-follow process.

 

References

 

1.) http://www.togaf.org – The Open Group Architecture Framework [Model 3 – Introduction to ADM]

2.) http://sourceforge.net/projects/dataquality/ – Open Source Data Quality [Aggregate Profiler 5.0]

3.) The DAMA Dictionary of Data Management, 1st Edition, USA, 2008, www.dama.org/i4a/pages/index.cfm?pageid=3345

4.) http://www.sdgcomputing.com/glossary.htm – Glossary for Business Intelligence and Dataware housing

 

Appendix A

Machine readable definition of Data Entity

<?xml version=”1.0″ encoding=”UTF-8″ ?>

<!–  Purpose – This XML will work as template for creating XML to represent entities. Root element will be name of entity. Default namespace will be the domain in which entity will fall. Use third party reference domain elements wherever possible using the namespace.

–>

<SEOHostEntity xmlns=”http://seo.corp.yahoo.com/ea/host&#8221; xmlns:xsd=”http://www.w3.org/2001/XMLSchema&#8221;

xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance&#8221;

xsi:schemaLocation=”http://seo.corp.yahoo.com/ea/xsd EntityMetaDataTemplate.xsd”

xmlns:acl=”http://seo.corp.yahoo.com/ea/acl&#8221; xmlns:dns=”http://seo.corp.yahoo.com/ea/dns”&gt;

<!–  Classification element will have metadata that will classify the entity representation. It will not have information related to entity instances.

–>

<Classification VariationDependent=”False”>

<Author>Vivek Singh</Author>

<Version>1.0</Version>

<Approver>Infrastructural team</Approver>

<Owner>SEO</Owner>

<Domain>SEO/Data Architecture</Domain>

<Application>Generic</Application>

<EntityName>Host</EntityName>

</ Classification >

<!– Storage Element will define where entity

instances is physically stored. Storage Type can be – File System, RDBMS, or MainFrame. Identifier can be filename or RDBMS name. If it stored in Database, it is good practice to add fieldname as an attribute element to establish reverse traceability.

–>

<Storage VariationDependent=”True”>

<StorageInstances>Multiple</StorageInstances>

<StorageID>0</StorageID>

<StorageForInstance>Generic</StorageForInstance>

<StorageType>RDBMS</StorageType>

<Identifier>OpsDB</Identifier>

<StorageID>1</StorageID>

<StorageForInstance>Generic</StorageForInstance>

<StorageType>RDBMS</StorageType>

<Identifier>networkDB</Identifier>

<StorageID>2</StorageID>

<StorageForInstance>Generic</StorageForInstance>

<StorageType>RDBMS</StorageType>

<Identifier>roleDB</Identifier>

</Storage>

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s