Recently Big Data QA profile was needed for a project which processes terabyte of machine generated data everyday. Below is the cryptic job description that was sent to recruitment team —
” We are looking for Big data developer, who can write code on Hadoop cluster to validate structured and semi structured data.
Skill sets – apache spark, Hive, Hadoop, data quality ”
Recruitment team was amused.
Are you looking for a developer or QA ? Is it manual testing or automated testing ? Where we should look for these kind of profile ?
Big Data QA is an emerging area and it is different from traditional product or enterprise QA. Big data QA is big data developer who develops code for validating data which is beyond human scale. BDQ (Big data QA) does not fit into traditional QA profiles – automated vs manual ; backend vs frontend ; feature vs non feature etc.
BDQ is required by data products to ensure their processes which manage data lifecycle, are ingesting and emitting right data. The volumes is so huge; it is beyond human to load the data into spreadsheet and validate row by row ; column by column.
As per business requirement, BDQ may write very complex code to validate complex business rules. However some day to day activity of BDQ requires :
- Profile the data to validate dataset — pre and post business rules
- Find out data hole to ensure all data are coming
- Statistical modeling to compare and contrast two or more datasets
- Implement unified formatting and unified data types
- Implement multiple type of joins on dataset including fuzzy joins
- Implement UDF ( user defined functions) for reusable validation logic
- Find outlier data and how to manage them
- Find Null and Empty data and how to manage them
- Validate the naming conventions of datasets
- Validate the file format ( csv, avro, parquet etc) of datasets
- Monitor incoming and outgoing dataset and redistribute if it fails
- Validate the data models for analytics
- Create sample data and make sure sample is not skewed
- Create training and test dataset
- Encrypt and DeCrypt (anonymize) data
- Implement and monitor compliance rule
By no means, above is an exhaustive list but it is more of a indicative list. BDQ is a prolific programmer who understands data ( like business analyst) well and code for Data Quality domain. These bold combination makes BDQ so hard to find.
It is best to train data engineers to become BDQ by providing courses and exposures on data quality and data management.
Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/
Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/