Classifier in Apache Spark using Complex Expression

FEATURE

Column abstraction in Apache Spark 2.1

Classification is a common technique in machine learning and statistics that is used to categorize an entity as a function of other properties observed (or measured) for the entity. It is a simple, but powerful technique that can be modeled from a training set, verified against a test set, and then put to use for production data.

Apache Spark, as a data processing engine, provides a rich set of features to develop a classification model and product-ionize the model in categorizing datasets. This article highlights the simplicity and richness of the Apache Spark APIs in order to accomplish this task. Searching the web for implementing a classifier in Apache Spark yields plenty of results that provide useful resource for a rookie, but lack the sophistication that is required in building a complex real-world solution. Most of the web results provide the reader with an example that builds a classifier based on a single property. These are good starting aids, but do not provide necessary guidance to build onto the first baby steps.

EXAMPLE

This article provides an illustration of implementing a 'real-world' classifier using Apache Spark 2.1. For the purpose, consider "subject" for which tests are being conducted and results recorded for each test as an integer value between 0 and 100. The classifier categorizes the subject assigning a discrete value based on all the recorded results. The 'Subject' Scala case class models this example capturing two test results for the purpose of this illustration. The subsequent code snippet sets up some test data.

'Subject' case class:

case class Subject(id: String, test1: Int, test2: Int)

Setup test data:

import spark.implicits._
val subjects = Seq(Subject("1", 90, 92), Subject("2", 94, 88))
val dataset: Dataset[Subject] = spark.createDataset[Subject](subjects)

CLASSIFIER

Apache Spark provides an abstraction to represent a column of tabular data, the "Column" class. The Column class typifies the rich API provided by Apache Spark. An instance of the Column class not only represents a column of data, it can be a statement composed of multiple columns representing a complex expression.

Continuing on with the example, the following Scala code defines a classifier of type "Column" that is a complex statement that checks the values of the two test results and returns a Boolean. The next line in the code defines a UDF (User Defined Function) to classify the data-set by categorizing each subject and assigning a discrete 'grade'. The classified dataset is the end result of the process where the actual category (grade) is assigned to each subject.

Classify the test data:

val classifier: Column = dataset("test1") >= 90 && dataset("test2") >= 90

val grade = udf((result: Boolean) => if (result) { "A" } else { "B" })

val classifiedDataset = dataset.withColumn("grade", grade(classifier))

Output of the test run:

+---+-----+-----+-----+
| id|test1|test2|grade|
+---+-----+-----+-----+
|  1|   90|   92|    A|
|  2|   94|   88|    B|
+---+-----+-----+-----+

The code in its entirety:

    import org.apache.spark.SparkConf
    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql.{Column, Dataset, SparkSession}
    
    case class Subject(id: String, test1: Int, test2: Int)

    import spark.implicits._
    val subjects = Seq(Subject("1", 90, 92), Subject("2", 94, 88))
    val dataset: Dataset[Subject] = spark.createDataset[Subject](subjects)

    val classifier: Column = dataset("test1") >= 90 && dataset("test2") >= 90

    val grade = udf((result: Boolean) => if (result) { "A" } else { "B" })

    val classifiedDataset = dataset.withColumn("grade", grade(classifier))

Distributed Computing

Wednesday, January 11, 2017