Biggest Problem of Big Data – Entity Resolution

Big Data has gone past PoC phase. Different companies are at different stages of implementation. Data Ingestion, Data Storage ( Data Lake and EDW), Data Processing and Data Visualization processes have been quite mature and there are many open source and proprietary software to solve these problems.

One major hurdle Big Data faces today is – Entity Resolution ( defining a business entity form multitude of data sources). In EDW (Enterprise data warehouse) world, data were structured and sources were limited. Also keys of sources were pre and well defined . So RDBMS joins ( inner, outer, left outer, right outer, semi join etc) were enough to merge data from two different systems and tables.

In Big Data world, there is hardly any key or attribute that runs across the sources. Also keys of one system is useless as other systems are completely independent of each other. So business have to define their own logic for merging data from different sources which defines one entity. To make it worse, RDBMS kind of exact match joins should be replaced by fuzzy joins as referential integrity across systems can be ensured.

Following is a practical approach for resolving Entity Resolution:-

1.) Pre-define your entities and super set of  attributes ( coming from all data sources)

2.) Attributes may have multiple related values that should to mapped to same attributes. Plan your storage like this  ( Graph databases work fine for this relationship storage)

3.) Merge data from multiple sources using merge business logic to create a virtual entity with sizable attributes ( we used Apache spark for this )

Mapping attributes


  • Zip, County, State, Lat/long
  • Nearby locations (+/- area)  – high propensity area
  • IP location


  • Date/time stamp
  • Nearby time stamp (+/-) – Event happening before or after a period


  • String match (Fuzzy) – Name, Address, Cause
  • Cardinal Match  – events sharing same or similar key
  • IP Correlation (Primary IP, Secondary IP, IP from same ZIP code)
  • Other business logic related merging rules

You can make you merge model Machine Learning based so it will be leaning over time to do a relevant merge.

4.) Right Sizing Merges Columns

  • Remove transaction columns
  • Remove  database columns
  • Remove technical identifiable columns
  • Remove duplicate Columns

5.) Search this entity in your relational graph database to find ranks of similar entities more than threshold. If the outcome is less than threshold then make a new entry in your entity table with the attributes of the searching entity.

 6.) Take the highest ranked entity and mapped the missing attributes from highest ranked entities. This entity is you final entity for business

How to enhance entities with changing attributes:

Like any practical entity, values of attributes keep changing. You map the attributes values of searching entity to highest ranked entity and see the different of values. let’s say the IP values of an entity is matching to its secondary value over time. Then the secondary IP value becomes primary and primary becomes secondary. Or it is new IP then add one more relationship node with new values.

Ranking algorithm:

Business can assign different weight-age to must have, critical, important and good to have attributes and their matching threshold. This model also matches with secondary or related values so make it more accurate. Let’s say, address is a must match attributes. A customer other attributes are matching but address not matching so other models will reject it but in this model if matches with his or her office address or other address, it will boost the record that is right.

Entity Resolution and Event Correlation – Datalake DQ

DQ ( Data Quality) historically started with missing values and then moved into address correction and data enrichment ( Geo Encoding, Standardization etc.) Data Quality tools have been successfully solved the traditional data quality problems; like the ones discussed above.

So far, DQ was single source and single domain. With the advent of data lake, DQ has to adopt to new strategy. Event Correlation and Entity Resolution are going to be crucial for data lake validation. DQ tools have to provide these 2 must features for data lake.

Entity Resolution : Data lake will hold data from multiple sources and domains. It would be critical to create right entities from the data set. Following will be prime components of Entity Resolution (ER)

a.) Fuzzy Join : we have so many joins ( inner, outer, left outer, semi , equi etc ) supported today but they match exact values. Dimension from multiple sources may not have exact match  ( like name or address). Fuzzy join will match values which are similar but may not match exactly – like John Smith and John Smithe

b.) Algorithm for picking dimension values : Datalake will contain data from multiple CRMs, domains. While matching dimensional values, there will conflict which one to pick – let say SalesForce has different address, Sales mart has different address, the data you bought have different address. The entity should have one master address. ER algorithm will pick the right value based on timeliness, validity of source, most common occurrence etc.

c.) Entity Classification: Once the Entity Unique id and master dimensions are identifies, next step involves classifying the entity using business rules. These entity may be outdated, inactive or have little relevance. Once entity is classified and tagged, it can be used for further analysis or can be put in historical datalake. An entity with missing critical dimensional value will be dumped in dirty datalake for further investigation.

Event Correlation:  Theoretically, event is also an entity but I am putting it different header because it is temporal in nature and the algorithms used for correlation events would be different.

a.) Range Bound Correlation : Hardly two correlated event will occur at same time. One event will fire another event which may lag in time or place or in both. Along with event identifier fields, range bound dimension will be used to correlate events. Business rules will decide the  width of boundary.

b.) Event aggregation : An event can fire many sub events and super events. All these events has to suppressed into one related event. Event Correlation (EC) algorithms will map all these events into related event and cause and bring into human readable format.

c.) Noise reduction:  Aggregated event may be a false event or noise. Business rules will decide will event should be carried forward ( assuming they have strong correlation with business )and which should be dropped. Events will also go through business classification to rank their importance.

Conclusion:  Datalake will bring new challenges to Data Quality which will go through transformation to solve new problems. DQ will move from :

i)Single Source   –> DataLake

ii)Structure Analysis –> Mapping Entity

iii) Operational —> Analysis