Well, you have heard so many time “data is asset”, you started believing it. Had data were asset, richer companies were those having lots of data. But you know, it is not the reality. In fact some companies which produces huge data ( from sensors, hardware, RFID, retail, Supply chain ) are struggling to stay profitable. I know what are you thinking – the Facebooks, the Linkedins. I will cover those in following paragraphs. Let’s focus on ‘only data’ for sometime.
Data per se is NOT an asset. In fact, cost of storing all data is so prohibitory. It may eat into your profit. It is a liability. Following image is an indication, how much a company will need to spend to store and retrieve data ( Only Indicative )
Close to 60 million USD !!!
And this does not include licensing cost of multiple BI software. To prove data is not liability, companies have to show a tangible benefit of more than 60 million USD from the data alone ( insights, predictions and what not, using only data ) Very few companies will be able to show the benefit.
Apparently web2.0 kind of companies do make profit. You may be aware, for them also the biggest cost is data centers. They are able to engage people on their site for longer period and in turn make more advertising money. Also there is a business model where people need their service – it is not data alone.
Let’s take example of two fictitious car companies which wanted to know, how much ,next season which model of car will sell – so accordingly manage their supply chain.
Company A, collected their sales data for 10 years, collected every bit of manufacturing data and bought consensus bureau data for demography and bought income related data also to cover all variability and seasonality , hire data analysts and data scientist, took 6 months to build a model and predicted the out come.
Company B, created a simple website and asked all his 5000 sales representative to predict and the aggregate the data to predict the outcome.
Any guess which will be more accurate !!
Well readers can argue which model will be more accurate but one thing is clear, company B will save lots of money during analysis. I am not belittling the effort of data architects and data scientists ( I am one of them 🙂 ). It is important to understand the cost of data. Data is a liability (cost) per se and to make asset, you have to define a monetization model, which covers the cost of data. Unfortunately, many Business Intelligence project don’t do the analysis and fall flat.
“Data Policy” is important for any company to understand the data and hence cost associate with it. I am posting a generic data policy but companies can tweak as per their business models. Companies should know what is ‘actionable’ data, what is ‘good to have’ data and what is ‘noise’.
Data Architecture Principles and Guidelines
Data: Distinct pieces of information usually formatted in a special way. All software is divided into two general categories: data and programs. Programs are collections of instructions for manipulating data.
Data Owner : Entity that can authorize or deny access to certain data, and is responsible for its accuracy, integrity, and timeliness.
Data Steward: A data steward is a person or organization delegated the responsibility for managing a specific set of data resources
• Establish a conceptual basis and bounds for more detailed Domain Analysis
• Determine whether planned development and evolution of the domain is viable relative to the organization’s business objectives
• Establish criteria by which management and engineers can judge whether a proposed system is properly within the domain
The Data Life Cycle is the phases in which data moves through the organization. The different phases include how the organization collects, stores, processes and disseminate their key data.
i) Data is an asset that has value to the enterprise and is managed accordingly.
Data is valuable corporate resource; it has real, measurable value.
The purpose of data is to aid decision-making. Accurate, timely data is critical to accurate, timely decisions.
Data is the foundation of corporate decision-making, so we must also carefully manage data to ensure that we know where it is, can rely upon its accuracy, and can obtain it when and where we need it.
The implication is that there is an education task to ensure that all organizations within the enterprise understand the relationship between value of data, sharing of data, and accessibility to data.
Owner must have the authority and means to manage the data for which they are accountable.
ii) Data owners are responsible for data integrity and distribution.
Data owners must be accountable for the effective and efficient management of data. The accuracy, concurrency and security of data are management concerns, best handled by data owners.
* Data Owner can be a system but steward has to be a physical person.
Companies needs to develop security procedures and standards which are consistent across the infrastructure. Companies needs to establish procedures for data sharing. Data Owner has to look into audit trail periodically.
Data owner will take decision about hand-shake policy with data consumers (down stream systems). Depending upon the nature (Security, format, usage, control etc.) of data, owner will make decision.
iii) Domain-level data is commonly defined and accessible across Domains.
Standards for common categories of data collected by domain facilitate information exchange and minimize duplicate information or information systems. Domain-level data definition is important to all Domains and as such needs to be available, accessible, consistent, and accurate. Common definition reduces duplication, mismatching, misuse and misinterpretation of data, promotes inter-domain cooperation, and facilitates data sharing. Standards for collecting and recording common data definitions can reduce acquisition costs and improve opportunities for maximum use of Domain information.
Data which is classified as Domain-level must be made available by the data owners across the infrastructure taking into account appropriate security concerns. Data owner should have access to common definition framework.
iv) Data Quality & Stewardship should be defined within Domain.
Data, products and information should be of quality sufficient to meet the requirements of business and to support sound decision making. People who take observations or produce data and information are stewards of these data, not owners. These data must be collected, produced, documented, transmitted and maintained with the accuracy, timeliness and reliability needed to meet the needs of all users.
Quality rules and Steward of business data should be defined. There might be a tool (framework) to look into data quality issues. It will be steward’s responsibility to look into quality aspect of data. You can look my open source tool at http://sourceforge.net/projects/dataquality/
v) Classify data elements into right Security class & Domain model.
Data must be categorized for easy management and better understanding. Categorization of data will help us finding right owner and steward.
We have to define different security level and domain partitioning. Ontology will come from business architecture.
vi) Data should have a guaranteed integrity across the Lifecycle.
If not, it will induce inconsistency into systems. Different information will flow into system, interacting with different part of data lifecycle.
Except owner, data can not be changed. Change management of data should be defined.
vii) Ensure Meta-data is in place.
Metadata will provide search, storage, consistency.
Data should be stored with metadata. And version should be maintained. Repository should be capable of doing Meta data search.
viii) Data Architecture is Requirements-driven.
It is essential that providers and users of data and products play an active role in defining the constantly evolving requirements that drive the development and evolution of data management systems. Every customer has different need. They do not have data in same format.
Data management should be extendable and flexible. Data structure should be extendable and compatible to usages. User should define usages of data but data owner should make an attempt to understand usages of data.
ix) Data Access should be from common abstraction layer.
Access to data modification (Create, Edit, Delete) abilities should be controlled by the business data access rules in the application or other abstraction layer. Abstraction layer is under control and revision checked, direct access is more free form and can lead to inconsistent result sets.
A Common data abstraction layer should be developed. Input will be taken from application architecture to develop common abstraction layer.
x) Data Lifecycle should be captured.
Data lifecycle gives guidance for storage, cataloguing and retrieval of historical data. Sometimes, it is contractual obligations to store historical data and retrieve it.
Domain owner and Business Owner have to define data life cycle.
1) When communicating with external parties adopt XML as a standard – adopt as much as possible one XML dialect only
2) Split data capture from data retrieval
3) Only store data together when it needs to be managed together (i.e. when you need to see it together you can also do that in the application)Store as much meta data as reasonably possible with the data
4) Use international standards, practices and framework (like sif, qti) wherever possible and relevant.
5) Data should be captured once and validated at the source or closest to source
Summary : A comprehensive data policy will classify the data based on value to business, save on storage and retrieval of data, will help in navigating the bureaucratic labyrinth for data access and data quality ( a major reason for delay and cost spike of Business Intelligence project), hence the cost of data.