Apache spark based classification and prediction model

We have open sourced apache spark based random forest and multilevel perceptron algorithms which can be used for classification and prediction. It can be downloaded from


Use cases :

a.) if you are using huge volume of data (Big data problem) which has large set feature columns. For smaller dataset, you can run apache spark in local mode.

b.) If you want run multiple multi class models together to see which one gives better result. Right now random forest and multilevel perceptron algorithms are implemented. But framework is there to take other algorithms also.

c.) No coding required. Just change the config file and good to go – both for training and classifying/predicting. All you need is java 8

d.) Restful APIs are there to predict/classify along with probability. Easily integratabtle.

e.) If you want to see the accuracy of multiple label columns for dimension reducibility

Overview: This program can be used for both training and classifying purpose. You can train the model and use RESTFul web service to query the model.This program also exposes a RESTFul web service to (jetty and javaspark based) expose classification/prediction as a service.


1. Download the package the package “spark-classifier_1.3-SNAPSHOT.zip”

2. Unzip the pre-built distribution and follow the below details

3. Understand the folder structure of release upon unzipping

* spark-classifier_\<version>

* /lib: contains all dependent jar

* /conf: contains classifier.properties, please review this file before running the program

* /model: the default model path where both model would saved (after training) and read (during classification service). You should have write access to this folder

* /spark-classifier-\<version>.jar: the main driver jar

Configuration:Currently it supports Random Forest and Multilayer Perceptron classifiers. Please set the same under “conf/classifier.properties”

# Currently supported algorithm RANDOM_FOREST or MULTILEVEL_PERCEPTRON



It takes Comma(,) separated list of columns for Feature and Label. * in label means it will take all columns to predict. It will skip feature columns if they in in predict or label column too.

classifier.featurecols=Number,Follow up

####list of labels to be predicted

#### '*' will process all the columns

classifier.labelcols=Root Cause

#classifier.labelcols=L1, L2, L3..

Train the model:

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.classifier.ClassifierTrainer

The input file name and output model location can be defined inside `conf/classifier.properties` By default, above command would assume that `conf/classifier.properties` file is correctly setup.

Use the model to predict or classify

cmd > java -cp spark-classifier-<version>-SNAPSHOT.jar:lib/*:conf org.arrahtech.service.ClassifierService

It will start default jetty server which will accept post requests. After this you may post the RESTFul API http://localhost:4567/classify/<algorithm_name>/<label_name&gt; -d jsonfile

Where \<algorithm_name> can be “randon_forest” or “multilevel_perceptron” and \<label_name> would be the label column name (column for which model was trained) in your training dataset and json file will have feature column and values which are input for prediction or classification

cmd > curl -XPOST http://localhost:4567/classify/random_forest/LABEL1 -d '[{


       "FeatureField2":" FeatureField2VALUE",

       "FeatureField3":" FeatureField3VALUE"}]'
> Response JSON


       "classifiedLabel": "PredictedValue",

       "probability": "0.951814884316891"


Things to Remember

1.)   Presently it takes only txt file with field separator

2.)   Null is replaced by NULLVALUE as null cannot be used in model

3.)   multilevel_perceptron does not give probability of predicted value. This feature is available in latest apache spark version.

4.)   Currently label_name shouldn’t have hyphen ‘-‘ character

5.)   If there is space in label column name use ‘%20’ for space.

If you face any issue feel free to contact us or raise a bug. We are developing an open source platform for integrated data life cycle – with ingestion, DQ, Profiling, Analytics and Prediction , all in one.

About Me : Vivek Singh – data architect, open source evangelist , chief contributor of Open Source Data Quality project http://sourceforge.net/projects/dataquality/

Author of fiction book “The Reverse Journey” http://www.amazon.com/Reverse-Journey-Vivek-Kumar-Singh/dp/9381115354/


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s