Inbound and Outbound data movement from Data Lake

Description: Enterprise data is sitting in Enterprise Data Lake. Only internal employees have access to data for analysis and insight. This paper discusses conceptual framework, where 3rd party data can be brought in for “fusion” with enterprise data to find new insights or how, enterprises data can be made available to 3rd party for “monetization”.

Abstract: Businesses are in agreement that data sharing is good for eco-system. One hand it increases top-line by adding new revenue channel by monetizing data, on other hand, hitherto unknown data brings new insight. The problem is,

  • technology is not ready to infuse massive 3rd party data in automated way and
  • Data Lake can’t emit data to 3rd

There are actions items about information mapping, data curation, security and compliance, data movement from internal storage to public facing storage, which need to be fixed to enable the framework.

This paper discusses a conceptual framework (a process flow) which can be followed while creating inbound and outbound data flow framework on Data Lake.

Business Process Flow for Data Movement:

  • Information Mapping
  • Data Curation
  • Entity Resolution
  • Security and Compliance

 

 

  • Information Mapping :

a.) Business Entity discovery

b.) Business Attributes discovery

c.) Sharable attributes

d.)  Joinable attributes

e.) Aggregation level

f.) Time to share

g.) Time to fetch

  • Data Curation:

a.) Schema mapping

b.) Missing value replacement

c.) Dirty data drop

d.) Fuzzy Joining of data

e.) inbound and outbound dataset creation

  • Entity Resolution :

a.) Finding the business entities across data sets

b.) Finding attributes which can affect behavior of entities

c.) Logic of correlation

d.) Logic for join

  • Security & Compliance:

a.) Deletion of personal identifiable data

b.) Masking of critical data

c.) Validating compliance

Technical consideration:

a.) downloadable or API based,

b.) format of download

c.) choice of ETL tool

d.) choice of EAI tool

e.) internal storage Vs public facing storage

f.) scalability

Business consideration:

a.) What to expose

b.) Whom to expose, registered user or to public,

c.) What is monetization policy (by download, by advertisement or by API usages?)

 

Conclusion: Every business is unique. This paper tries to bring out a conceptual guideline on how to fusion 3rd party data into your system and monetize your data by providing to 3rd party. Companies are struggling to find out a way to monetize data and also bring more insight into data.

A data sharing ecosystem will be a big boost to companies.

Advertisements