There has been a debate in big data community “do we need sampling in big data world” ? Now big data platform has storage and processing power to use all dataset for analysis so sampling is not required – is the argument.
This arguments holds true for data discovery where doing analysis on full dataset give more confidence and every anomaly and pattern is captured. However, sampling still saves considerable amount of time in dimension reduction, correlation and model generation. You have to go through hundreds of attributes (with all permutation and combination) to find dependent, independent , correlated variables where representative dataset saves you hour with almost same accuracy.
We have open sourced data sampling code using apache spark 2.1.1 at https://sourceforge.net/projects/apache-spark-osdq/
Random Sampling : on dataset<row> random sampling is provided by apache spark where user can provide the fraction he needs for sampling.
Dataset<Row> org.apache.spark.sql.Dataset.sample(boolean withReplacement, double fraction)
Stratified Sampling : dataset<row> does not provide stratified sampling so dataset is converted into PairedRDD with key column which need to be stratified and then use samplebyKeyExact. It does many pass to find the exact fraction
for(Row uniqueKey:uniqueKeys){
fractionsMap.merge(uniqueKey.mkString(),fraction, (V1,V2) -> V1);
}
JavaPairRDD<String, Row> dataPairRDD = SparkHelper.dfToPairRDD(keyColumn, df);
JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKeyExact(false, fractionsMap).values();
Dataset<Row> sampledDF = df.sqlContext().createDataFrame(sampledRDD, df.schema());
return sampledDF;
Keylist Sampling : This is like Stratified sampling but only make one pass to meet the fraction value.
JavaRDD<Row> sampledRDD = dataPairRDD.sampleByKey(false, fractionsMap).values();
Command Line Option :
-c,--keyColumn <arg> Key Column for stratified/keylist sampling
-f,--fraction <arg> Sample fraction size
-fm,--fractionMapping <arg> comma seperated pairs of key,fraction size
-h,--help show this help.
-i,--input <arg> Input Folder/File path
-if,--inputFormat <arg> input file format
-o,--output <arg> Output Folder path
-of,--outFormat <arg> output file format
-t,--type <arg> Sampling type - ran
dom/stratified/keylist
example : ” -f 0.2 -i ./testfile -if csv -o ./outputFile -of csv -t stratified -c key1 -fm ./keymapping”
testfile:
key1,key2 vivek,1 vivek,1 vivek,1 vivek,1 vivek,1 vivek,1 vivek,1 vivek,1 singh,2 singh,2 singh,2 singh,2 singh,2 singh,2 singh,2 singh,2 singh,2
keymapping:
vivek,0.1 singh,0.1
about Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/
Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/