Data Science for Data Scientist

Well, it sounds like oxymoron – why a data scientist will need data science to help him ; he rather will use data science to help find insights and predicts the future. Data Science has a connotation that it is to be used discovery, insight, prediction and artificial intelligence.

But data science can also be used to do data preparation which will help data scientist to develop right model for discovery, insight, prediction and artificial intelligence.

1.) Regression : When DBA does not know what data scientists are looking for they dump everything on data scientist. Sometimes even hundreds of attributes. Some attributes have influence on model; some are only informational. IoT also dumps lots of machine data which are irrelevant for model.

Data science can help data scientist in finding the relevant attributes. They can run regression algorithms to find out attributes which have effect on outcome.

2.) Missing Values : Missing values are ubiquitous. Depending on model, it can be discarded, have default values or can have advance statistics to generate the missing numbers based on other attributes like time, location, customer behavior.

In some cases, missing values can be detrimental to model. Data science will help data scientist to auto-generate realistic numbers , conforming to model.

3.) Clustering : Some popular clustering algorithms (like K-Mean, nearest neighborhood )require data scientist to provided initial parameters to model, to refine it further. So if the initial number is way off, model will fail.

Data scientists can use binning, basket analysis, outlier, no frill clustering algorithms to figure out tentative initial parameters to build a good model.

4.) Anomaly Detection : Some systems ( like alerting and event correlation) needs anomalies to perform their task. But finding anomalies is like finding needle in haystack.

Data Scientist can use anomalies detection algorithms like support vector machine ( one class), association rule, replicator neural network to filter out in-bound or one class data and get anomalies.

Above examples are only for sake of example. There are many ways data science can help in data preparation and business rule validation. Data science will be extended to data preparation also.

Author : Vivek Singh is data architect and developing world’s fist open source data quality and data preparation tool


Big Data: Demystified

Big Data, besides being a challenge is also a huge opportunity in terms of generating insights from new and varied types of data enabling businesses to become more agile than before. A simple, graphic and contemporary definition of Big Data would be ‘all the machine generated data, which gets populated rapidly alongside all the types of data that have complexities rather than size or volume’. Examples of Big Data would be pertinent to industries such as e-Commerce, Telecom, Social Media and BFSI with the types of data being dealt with including call logs, Web logs, Internet text/documents/search indexing, sensor networks, RFID, social network data etc.

A typical example would be a Web commerce enterprise, which would need to gain near-real-time insights into its customers’ behavioral patterns and trends in order to influence their marketing campaigns, delivery model, pricing as well as its products or service offerings. Big Data can and, more often than not, it does include a variety of ‘unusual’ data types. In that sense, it can be either structured or unstructured. The former would include high volume transaction data such as call data records in telcos, retail transaction data and pharmaceutical drug test data. Unstructured data is more difficult to process and this includes semi-structured data such as XML and HTML besides unstructured data like text, image, rich media, Web logs etc. Challenges here include the capture, storage, search, sharing, analysis and visualization of data sets. Though Size (Volume) is key to the primary definition (typically more than a couple of Terabytes) of Big Data, the other critical dimensions are:

Latency (Velocity): For time-sensitive processes such as detecting fraud, Big Data must be used as it streams into your enterprise in order to maximize its value.

Variety: Big Data is any type of data; it includes structured as well as unstructured data in the form of text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Complexity (Multi-Dimensionality): Big often refers to complexity rather than volume. Big Data can be very small and not all large datasets are huge.

The term Big Data is now applied more broadly to cover platforms having faster, cheaper and distributed processing power, clustered computing, lower cost of storage with in-built fault tolerance and network using commodity (cost-effective) hardware. Hadoop is one such platform. It is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Hbase and Zookeeper that extend the power of Hadoop.

One of our clients that generates about 2 TB/day of networking data wanted to conduct a cost-analysis of this data. It also wanted to store the data for 18 months. The data was time-series and it would fit into the definition of Big Data. It was not streaming data and, therefore, we recommended GridFTP (an open source solution) as the data transport layer and suggested that the client set up a 50 node (2 cpu -quad core, 8 GB RAM and 1 TB disk space) Apache Hadoop grid for this purpose. Data was compressed using bzip compression logic. Since it was structured data, we used Pig Latin to do the data quality and transformation. The client also wanted an SQL like interface for doing the analysis. The output of Pig was fed into Hive and Hive tables were created where the client was able to run SQL like statements. With the help of the Unix crontab utility, jobs were scheduled and new data was appended to Hive Tables after the completion of each job. The client is also looking at pre-defined, canned as well as ad hoc reporting. We at GrayMatter are well placed in the Big Data space having been a firm advocate and believer in open source systems and technologies from our inception when it was little known and not as mature as it is today.

Proven recommendations for Big Data projects

Unstructured Search Solr (MR version of Lucene)
Structured search (Key-Val pair) Cassandra(Not Hadoop supported), Hbase
Document Search MangoDB, NoSQL, Solr
Transformation, Data Quality Pig , Java MR code, Hadoop Streaming
SQL like analysis Hive, HBase
Stream input Scribe, Flume
Student Entiry

Why BI projects fail – and the role of Data Architect

During my long stint as business intelligence professional, I have seen many projects fail and of course, smelled some success. Let’s see what is common among successful projects and the unsuccessful ones.

Organisation Structure: Unlike other IT  engineering projects, Business Intelligence projects need strong business interface – a make or break for BI project. Business is divided into groups or sliced by business functions (BU) – but data is not. A typical BI project will run into multiple Business Functions – which means working with two or more VPs and their organisations. If BI project is sponsored by IT it will hit bottleneck with Business. A typical reply will be – a.) we already have this information in XLS b.) Our processes are different , c.) we can not wait so long for data d.) this is done at vendor site

And business is right. Every business functions or unit is different – Granularity is different, business focus is different, workflow is different. IT sponsored BI projects treats all business units same and hence BI projects lose relevance for business. I have seen high rate of success with  MIS related BI projects – because IT is both consumer and sponsor of project. Otherwise IT does not business rules and processes to make a report relevant – that has to come from Business. So it is important business sponsors it and business unit heads or VPs sponsor it.

Enterprise Data warehouse (EDW) : A typical approach to a BI project is, create an enterprise data warehouse and all BU ( business Unit) will take data from there and have their  data mart. Business is complex and putting all rules in Enterprise warehouse has a very high degree of failure. Enterprise data warehouse become too cumbersome to use and invariably business rule will change overtime. Cost of changing ETL jobs are very high and it takes long time to propagate the change from source, to staging area, to target , to reports, to analytic. Business get frustrated and figures out a way to get data in xls sheet and then does it own analysis –  EDW fails.

A leaner and less complex data model is better. Report Developer, ETL developer or Data Integration are not business analyst. They are bound to do mistakes in putting all the rules in all encompassing warehouse. A good amount of business rules can be pushed to report engine and analytic engine. A focused warehouse is more successful and a multi-purpose or generic warehouse.

Data Quality:  In real business, dirty data or incomplete data do come in. If they are feed into warehouse as is, report is faulty and unfortunately it happens most of time. Technical people can understand data type, data structure and raw data quality – like null values, duplicate values, negative values, outliner etc. But they do not understand business implication of that and do not what should be right value. A typically ETL tool will discard those records and report will not be updated. Let’s take an imaginary business rule – where if an existing ( from other business unit) customer walks-in you do not ask him fill-in personal information form and only take customer id. ( The idea is you will collect all the data from other business unit and save time  of customer). When desktop operation feed the data in – he or she feeds only with customer id. If you business processes are not real time ( in most cases they are not ) only customer id goes into CRM system and most probably nightly ETL load will ignore data as dirty data and your report will show one less count.

Data Analyst and Business Analyst have to sit together and profile their data and look into conditions using tools like osDQ –  to get a good understanding of data , before moving onto project.

Role of Data Architect:  Probably he is the first person to know if the project is on track but he has limited visibility. Most of time, data architects belong to IT group and has very limited saying in Business Unit. Unfortunately, today we do not have a process or framework which tell how data architect should talk / show artifacts to business. TOGAF has tried to give some framework, but it very limited.

A good start can be starting from IT Domain architecture where business unit and high level functionality is mapped. Let take an example of company which creates, tests, scores and report K-12 tests.

IT Domain Architecture    Once the business domain is identifies, data architect should create DFD ( data flow document – like image below ) which says which data moves across business domains and which is the data lying within a domain. The data is flowing across business domains ( or units) are the once which are more prone to error – as it changes values across domain.TCMDataFlow

Once the DFD is created , Entity reference model can be created as below where steward can be identified.

Student Entiry

Student Entiry

Once we have formal steward in place from business side, the success rate increases . In my next post I will write in detail about roles and responsibilities of data architect and the process to find out steward.

Good Luck !!