TUTORIAL
An introductory guide to development with Apache Spark 2.1 and Docker
Apache Spark is a fast and general engine for large-scale data processing which runs on distributed computing frameworks such as Hadoop, Mesos, or in cloud; it can also execute in standalone mode. Docker is an open source platform for building, deploying, and running distributed applications.
This installment of Distributed Computing provides an introduction to developing applications for the Spark data processing engine using the Scala programming language. In this gentle introduction, we will build a simple application, a "Hello World" application for Spark. Readers familiar with Java programming langugage should be able to follow the steps in this tutorial.
We will package the Spark application, deploy, and execute it in a Docker container. Dockerized applications can be run locally on developer's machine in similar fashion to the production applications deployed in the cloud infrastructure. Dockerized applications lend to rapid development and continuous integration.
PREREQUISITES
In this tutorial, we will be working with Apache Spark version 2.1 (the latest version at the time of writing this article). You don't need to download and install Apache Spark as this will be done by the Docker container. We will cover that later on in this article. To compile and package the Spark application, we will be using sbt, a build tool for Scala. Again, we don't need to download and install Spark libraries (jars), as this will be done by sbt and will be covered later on in this article.As a prerequisite for this tutorial, you will need to download and install Docker for your development machine. The Docker web-site has quick and easy instructions for installing Docker on various platforms: here. Follow the instructions to install Docker and verify the installation to ensure things are working. This is crucial before continuing with this tutorial.
Another prerequisite for this tutorial is Java development environment/kit (JDK), version 8. Install Oracle JDK by downloading from here or Open JDK from here.
Finally, I present the last prerequisite: sbt, version 0.13.x, which can be downloaded here. Instructions for installing are on that site.
BUILD
We will first start by building the Spark application, a simple app that computes average of sequence of numbers. The source-code for the Spark application can be downloaded from Github: hereOr you can create the files as we walk through the code below. Create a directory for the project called "spark-app" and the sub-directory hierarchy as presented below:
# Your directory layout should look like this $ find spark-app spark-app spark-app/build.sbt spark-app/src spark-app/src/main spark-app/src/main/docker spark-app/src/main/docker/deploy spark-app/src/main/docker/Dockerfile spark-app/src/main/scala spark-app/src/main/scala/SparkApp.scala
Create a build definition file called "build.sbt" at the root level of the project directory (if not downloading from Github).
Listing for build.sbt
name := "Spark App" version := "0.0.1" scalaVersion := "2.11.8" val sparkVersion = "2.1.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-hive" % sparkVersion % "provided" )
The build definition file is used by sbt to compile the source code and package the binaries into a jar file. The build definition file provided in the listing above is minimal. It includes the name of the application: "Spark App", as well as the version of the application: "0.0.1". Additionally, it provides versions for Scala compiler and Spark libraries to be used by sbt to compile and build the application. Lastly, the spark-core and the spark-hive libraries are included as library dependencies for the application. The scope of the dependent libraries are defined to be "provided" as we do not want to package these libraries in our application jar. The Spark runtime environment will provide these jars during the execution of the application.
Create a Scala source file called "SparkApp.scala" under 'src/main/scala' sub-directory (if not downloading from Github).
Listing for SparkApp.scala
/* SparkApp.scala */ import org.apache.spark.SparkConf import org.apache.spark.sql.{Dataset, SparkSession} object SparkApp { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("Spark App") val spark = SparkSession.builder().config(sparkConf).getOrCreate() import spark.implicits._ val dataset = spark.range(1, 101, 1, 2) val avg = dataset.agg("id" -> "avg").head.getAs[Double](0) spark.stop() println(s"Spark App average : $avg") } }
The 'SparkApp.scala' source file provides declaration for an Object named "SparkApp". For those new to Scala, an Object in Scala declares a singleton object, i.e., a class with a single instance. For those familiar with the Java programing language and new to the Scala programming language, refer to this introduction of Scala for Java programmers: here. The SparkApp object decalaration includes a main method, which is invoked when the Spark application is executed.
A quick overview of the Scala source code. The first two lines in the main method, declare two values (or final variables). The first value, sparkConf, defines a Spark configuration which includes the application name: "Spark App". And the second value, spark, instantiates a Spark session, the main entry point to Spark. Starting with Apache Spark 2.0, a Spark session is now a combination of all contexts (such as SQL context, Hive context, and streaming context). Once a session to Spark is established, all APIs provided by these contexts are available thereafter.
The following three lines in the source file provide the main functionality we are after, i.e., generate a sequence of numbers and compute the average of these numbers. The code is concise and straightforward. Though, it should be noted, that our goal in this tutorial is not lofty. We are after-all building a very basic Spark application, a "Hello World" for Spark development.
The first of these three lines imports the SQLImplicits object, called the `implicits`, from the Spark session. The next line generates a dataset of numbers (long to be precise) in the range from 1 to 101 (exclusive) with a step value of 1. The final parameter to the range method of 2 indicates the number of partitions for the dataset. Partitioning is a common technique in distributed computing to divide a large dataset into smaller partitions, which faciliates the distribution of the workload amongst multiple worker nodes. In our example, we have created a small dataset of just 100 values and therefore partitioning is immaterial. For larger datasets, partitioning is key to distributing and parallelizing the workload to number of nodes in the cluster.
The last of these three lines calculates the average of the numbers in the dataset by calling the `agg` function (short for aggregate). The `agg` function aggregates numbers from the dataset (labeled as "id" in the function call) by computing the average (indicated by the "avg" in the function call). The `agg` function returns a RDD (Resilient Distributed Dataset), another dataset with a single row and single column containing the computed average. The chaining of calls to the `head' and `getAs` function retrieves this value.
Finally, the Spark session is closed by calling the `stop` method, which closes the session and releases resources used during the session. The final line of code prints the computed average, which we will see later on.
Create an instructions file for Docker called "Dockerfile" under 'src/main/docker' sub-directory (if not downloading from Github). We will cover this file piecewise.
The first instruction in the Dockerfile (shown below) creates a new Docker image starting from an official Docker image of OpenJDK, version 8, that is built on top of Alpine Linux. Alpine Linux is a light-weight Linux variant that is the current preferred choice for Docker containers.
FROM openjdk:8-jdk-alpine
The next set of instructions in the Dockerfile (shown below) define environment variables for downloading and installing Apache Spark, version 2.1, in the Docker container.
ENV INSTALL_DIR /usr/local ENV SPARK_HOME $INSTALL_DIR/spark ENV SPARK_VERSION 2.1.0 ENV HADOOP_VERSION 2.7 ENV SPARK_TGZ_URL https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz
The subsequent instruction in the Dockerfile updates the Alpine Linux system. Additionally, bash and curl are installed as these are needed later for installing and executing Spark.
RUN apk update \ && apk upgrade \ && apk add --update bash \ && apk add --update curl \ && rm -rf /var/cache/apk/*
The next set of instructions in the Dockerfile (shown below) download and install Spark. Additionally, the Spark installation directory is renamed to 'spark' and the downloaded tarball is deleted.
WORKDIR $INSTALL_DIR RUN set -x \ && curl -fSL "$SPARK_TGZ_URL" -o spark.tgz \ && tar -xzf spark.tgz \ && mv spark-* spark \ && rm spark.tgz
The following set of instructions in the Dockerfile (shown below) define environment variables useful for deploying our Spark application in the Docker container.
ENV APP_HOME /opt/sparkapp ENV APP_VERSION 0.0.1 ENV APP_SCALA_VERSION 2.11 ENV APP_JAR spark-app_$APP_SCALA_VERSION-$APP_VERSION.jar
The set of instructions as shown below, deploy our Spark application in the Docker container. Finally, the command to be performed when the container is run is defined by the `ENTRYPOINT` instruction, which executes the Spark application in the local mode using two cores.
WORKDIR $APP_HOME ADD deploy $APP_HOME ENTRYPOINT "$SPARK_HOME/bin/spark-submit" --class SparkApp --master local[2] "$APP_JAR"
Next, we build the Spark application and deploy to the container. Change to the top-level project directory and execute the following command to compile the Scala source code and package it into a jar file.
sbt package
The above command will generate the application jar, "spark-app_2.11-0.0.1.jar", in the "spark-app/target/scala-2.11/" directory. Copy the application jar to the "spark-app/src/main/docker/deploy/" directory.
Now, we are ready to deploy the application jar and execute the Spark application. Change to the "spark-app/src/main/docker/" directory and execute the command shown below, which builds a Docker image named "my-spark-app" using our Dockerfile.
docker build -t my-spark-app .
docker run -it --rm my-spark-app
The output from the execution of the Spark application will be displayed in the terminal window and will conclude with the following output:
INFO SparkContext: Successfully stopped SparkContext Spark App average : 50.5
The computed average is output to the terminal window, as shown above. We have built, deployed, and executed a Spark application in Docker container.
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
ReplyDeleteMaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Saurabh Srivastava
MaxMunus
E-mail: saurabh@maxmunus.com
Skype id: saurabhmaxmunus
Ph:+91 8553576305 / 080 - 41103383
http://www.maxmunus.com/