Saturday, December 31, 2016

Getting Started with Apache Spark 2.1 and Docker

TUTORIAL

An introductory guide to development with Apache Spark 2.1 and Docker



Apache Spark is a fast and general engine for large-scale data processing which runs on distributed computing frameworks such as Hadoop, Mesos, or in cloud; it can also execute in standalone mode. Docker is an open source platform for building, deploying, and running distributed applications.

This installment of Distributed Computing provides an introduction to developing applications for the Spark data processing engine using the Scala programming language. In this gentle introduction, we will build a simple application, a "Hello World" application for Spark. Readers familiar with Java programming langugage should be able to follow the steps in this tutorial.

We will package the Spark application, deploy, and execute it in a Docker container. Dockerized applications can be run locally on developer's machine in similar fashion to the production applications deployed in the cloud infrastructure. Dockerized applications lend to rapid development and continuous integration.

PREREQUISITES

In this tutorial, we will be working with Apache Spark version 2.1 (the latest version at the time of writing this article). You don't need to download and install Apache Spark as this will be done by the Docker container. We will cover that later on in this article. To compile and package the Spark application, we will be using sbt, a build tool for Scala. Again, we don't need to download and install Spark libraries (jars), as this will be done by sbt and will be covered later on in this article.

As a prerequisite for this tutorial, you will need to download and install Docker for your development machine. The Docker web-site has quick and easy instructions for installing Docker on various platforms: here. Follow the instructions to install Docker and verify the installation to ensure things are working. This is crucial before continuing with this tutorial.

Another prerequisite for this tutorial is Java development environment/kit (JDK), version 8. Install Oracle JDK by downloading from here or Open JDK from here.

Finally, I present the last prerequisite: sbt, version 0.13.x, which can be downloaded here. Instructions for installing are on that site.

BUILD

We will first start by building the Spark application, a simple app that computes average of sequence of numbers. The source-code for the Spark application can be downloaded from Github: here

Or you can create the files as we walk through the code below. Create a directory for the project called "spark-app" and the sub-directory hierarchy as presented below:

# Your directory layout should look like this
$ find spark-app
spark-app
spark-app/build.sbt
spark-app/src
spark-app/src/main
spark-app/src/main/docker
spark-app/src/main/docker/deploy
spark-app/src/main/docker/Dockerfile
spark-app/src/main/scala
spark-app/src/main/scala/SparkApp.scala

Create a build definition file called "build.sbt" at the root level of the project directory (if not downloading from Github).

Listing for build.sbt

name := "Spark App"

version := "0.0.1"

scalaVersion := "2.11.8"

val sparkVersion = "2.1.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
)

The build definition file is used by sbt to compile the source code and package the binaries into a jar file. The build definition file provided in the listing above is minimal. It includes the name of the application: "Spark App", as well as the version of the application: "0.0.1". Additionally, it provides versions for Scala compiler and Spark libraries to be used by sbt to compile and build the application. Lastly, the spark-core and the spark-hive libraries are included as library dependencies for the application. The scope of the dependent libraries are defined to be "provided" as we do not want to package these libraries in our application jar. The Spark runtime environment will provide these jars during the execution of the application.

Create a Scala source file called "SparkApp.scala" under 'src/main/scala' sub-directory (if not downloading from Github).

Listing for SparkApp.scala

/* SparkApp.scala */
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Dataset, SparkSession}

object SparkApp {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Spark App")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    val dataset = spark.range(1, 101, 1, 2)
    val avg = dataset.agg("id" -> "avg").head.getAs[Double](0)
    spark.stop()

    println(s"Spark App average : $avg")
  }
}

The 'SparkApp.scala' source file provides declaration for an Object named "SparkApp". For those new to Scala, an Object in Scala declares a singleton object, i.e., a class with a single instance. For those familiar with the Java programing language and new to the Scala programming language, refer to this introduction of Scala for Java programmers: here. The SparkApp object decalaration includes a main method, which is invoked when the Spark application is executed.

A quick overview of the Scala source code. The first two lines in the main method, declare two values (or final variables). The first value, sparkConf, defines a Spark configuration which includes the application name: "Spark App". And the second value, spark, instantiates a Spark session, the main entry point to Spark. Starting with Apache Spark 2.0, a Spark session is now a combination of all contexts (such as SQL context, Hive context, and streaming context). Once a session to Spark is established, all APIs provided by these contexts are available thereafter.

The following three lines in the source file provide the main functionality we are after, i.e., generate a sequence of numbers and compute the average of these numbers. The code is concise and straightforward. Though, it should be noted, that our goal in this tutorial is not lofty. We are after-all building a very basic Spark application, a "Hello World" for Spark development.

The first of these three lines imports the SQLImplicits object, called the `implicits`, from the Spark session. The next line generates a dataset of numbers (long to be precise) in the range from 1 to 101 (exclusive) with a step value of 1. The final parameter to the range method of 2 indicates the number of partitions for the dataset. Partitioning is a common technique in distributed computing to divide a large dataset into smaller partitions, which faciliates the distribution of the workload amongst multiple worker nodes. In our example, we have created a small dataset of just 100 values and therefore partitioning is immaterial. For larger datasets, partitioning is key to distributing and parallelizing the workload to number of nodes in the cluster.

The last of these three lines calculates the average of the numbers in the dataset by calling the `agg` function (short for aggregate). The `agg` function aggregates numbers from the dataset (labeled as "id" in the function call) by computing the average (indicated by the "avg" in the function call). The `agg` function returns a RDD (Resilient Distributed Dataset), another dataset with a single row and single column containing the computed average. The chaining of calls to the `head' and `getAs` function retrieves this value.

Finally, the Spark session is closed by calling the `stop` method, which closes the session and releases resources used during the session. The final line of code prints the computed average, which we will see later on.

Create an instructions file for Docker called "Dockerfile" under 'src/main/docker' sub-directory (if not downloading from Github). We will cover this file piecewise.

The first instruction in the Dockerfile (shown below) creates a new Docker image starting from an official Docker image of OpenJDK, version 8, that is built on top of Alpine Linux. Alpine Linux is a light-weight Linux variant that is the current preferred choice for Docker containers.

FROM openjdk:8-jdk-alpine

The next set of instructions in the Dockerfile (shown below) define environment variables for downloading and installing Apache Spark, version 2.1, in the Docker container.

ENV INSTALL_DIR /usr/local
ENV SPARK_HOME $INSTALL_DIR/spark
ENV SPARK_VERSION 2.1.0
ENV HADOOP_VERSION 2.7
ENV SPARK_TGZ_URL https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz

The subsequent instruction in the Dockerfile updates the Alpine Linux system. Additionally, bash and curl are installed as these are needed later for installing and executing Spark.

RUN apk update \
      && apk upgrade \
      && apk add --update bash \
      && apk add --update curl \
      && rm -rf /var/cache/apk/*

The next set of instructions in the Dockerfile (shown below) download and install Spark. Additionally, the Spark installation directory is renamed to 'spark' and the downloaded tarball is deleted.

WORKDIR $INSTALL_DIR

RUN set -x \
      && curl -fSL "$SPARK_TGZ_URL" -o spark.tgz \
      && tar -xzf spark.tgz \
      && mv spark-* spark \
      && rm spark.tgz

The following set of instructions in the Dockerfile (shown below) define environment variables useful for deploying our Spark application in the Docker container.

ENV APP_HOME /opt/sparkapp
ENV APP_VERSION 0.0.1
ENV APP_SCALA_VERSION 2.11
ENV APP_JAR spark-app_$APP_SCALA_VERSION-$APP_VERSION.jar

The set of instructions as shown below, deploy our Spark application in the Docker container. Finally, the command to be performed when the container is run is defined by the `ENTRYPOINT` instruction, which executes the Spark application in the local mode using two cores.

WORKDIR $APP_HOME
ADD deploy $APP_HOME
ENTRYPOINT "$SPARK_HOME/bin/spark-submit" --class SparkApp --master local[2] "$APP_JAR"

Next, we build the Spark application and deploy to the container. Change to the top-level project directory and execute the following command to compile the Scala source code and package it into a jar file.

sbt package

The above command will generate the application jar, "spark-app_2.11-0.0.1.jar", in the "spark-app/target/scala-2.11/" directory. Copy the application jar to the "spark-app/src/main/docker/deploy/" directory.

Now, we are ready to deploy the application jar and execute the Spark application. Change to the "spark-app/src/main/docker/" directory and execute the command shown below, which builds a Docker image named "my-spark-app" using our Dockerfile.

docker build -t my-spark-app .
The next command starts the Docker container, which we created in the above step, and executes the command declared in the `ENTRYPOINT` instruction. As mentioned above, the `ENTRYPOINT` instruction is declared to execute the Spark application.

docker run -it --rm my-spark-app

The output from the execution of the Spark application will be displayed in the terminal window and will conclude with the following output:

INFO SparkContext: Successfully stopped SparkContext
Spark App average : 50.5

The computed average is output to the terminal window, as shown above. We have built, deployed, and executed a Spark application in Docker container.

NEXT STEPS

The next logical step, to build on the simple introduction to Apache Spark development, is to build a cluster of Apache Spark nodes that can be deployed as Docker containers. Watch out for the next article in this series, which will cover this topic.

1 comment:

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Saurabh Srivastava
    MaxMunus
    E-mail: saurabh@maxmunus.com
    Skype id: saurabhmaxmunus
    Ph:+91 8553576305 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete