osDQ releases apache spark based data quality

World’s first open source data quality and data preparation project (osDQ – https://sourceforge.net/projects/dataquality/ ) releases apache spark based data quality and data preparation modules for big data.

Apache Spark based APIs can be downloaded from here : https://sourceforge.net/projects/apache-spark-osdq/

This beta release has following features :

Normalization:

  • ZScore

Functional input – mean and std dev

Return type : dataframe

  • ZeroScore (between 0 and 1)

Functional input – min and max

Return type : dataframe

  • RatioScore (num/denum)

Functional input – ratio number

Return type : dataframe

  • Subtraction Score (a –b)

Functional input – Subtraction number

Return type : dataframe

Replacement:

  • Replacement with key-value pairs

Functional input – hashtable and columns type

Return type : dataframe

  • Replacement Null with default value

Functional input – value

Return type : dataframe

  • Replacement using regression value (linear and multi-linear)

Functional input – No of iterations

Return type : dataframe

Remove:

  • Removing Null Rows

Functional input – all or any

Return type : dataframe

  • Removing Duplicate Rows

Functional input – all or any

Return type : dataframe

Profiling:

Functional input – DataFrame

Return type: Hashtable<Colname,Hashtatable<Key,Value>>

HashKeys – “count”,”unique”,”nullcount”,”pattern”,”min”,”max”

Fuzzy Join and Replacement :

Function Input – two strings

Return type – cosine similarity ( between -1 to 1)

Summary : osDQ will enhance the project to provide more core APIs for data quality , data preparation and data science. It will save community time to write those functions for big data environment.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s