SPARK-GRADLE-RUNNER — A gradle plugin to simplify submitting spark application

Problem Statement

Launching spark application onto a cluster(yarn/mesos) typically requires the following common tasks

  1. Build the jar.
  2. Bundle dependencies.
  3. Set required System/Environment variables like SPARK_HOME , HADOOP_HOME.
  4. Compose spark-submit command (with all its options) and hope for the best.

Prerequisites

If you are reading this I would assume you are already a spark developer who could write spark via scala/java and also with very basic knowledge of gradle. So I will be going straight into the details of system/software requirements to use this tool.

  1. Gradle (5.6.1 version and above) , Installation is simple , download the zip and add the entry to PATH env variable (https://gradle.org/install/ for more details). Gradle plugin for eclipse (https://projects.eclipse.org/projects/tools.buildship/downloads) , this would be very useful for creating gradle projects in eclipse and for running the “spark-gradle-plugin” as well.
  2. winutils.exe(Only for windows developers) — Windows binaries for Hadoop , https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe which is for hadoop 2.7.1 , modify the url and substitute the corresponding version in hadoop-{version} in the url. This is required for enabling windows machine to interact with hadoop cluster.
  3. Spark-with-hadoop support , https://archive.apache.org/dist/spark/ , (preferred https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz)
  4. M2_HOME — Point it to location containing maven settings.xml file and make sure there is a localRepository configured to download and cache dependencies into local repository for faster performance.

Solution overview

The utility is built on top of Gradle build framework (https://docs.gradle.org/current/userguide/what_is_gradle.html), an absolute winner compared to Maven , its programmers’s build system. “spark-gradle-plugin” is a custom Gradle plugin which has the following tasks and in the following order.

  1. prepareClusterSubmit — Creates a zip file from the output directory of the task “downloadDependencies “ and then uses the site configuration files provided as part of “hadoopConf” (we will talk about the configurations part later) and uploads the zip file onto the location specified as part of “jarZipDestPath” which is set as Yarn’s distributed cache to be used as classpath in driver and executor.
  2. launchSpark- Launches a spark application ( in local mode or yarn cluster mode) , this is last task in the task hierarchy which means running this task would run the previous tasks downloadDependencies and prepareClusterSubmit.

Build.gradle

Build.gradle is at the heart of gradle build system , it contains configurations which determines the flow of gradle task runs . The consumer of this plugin is required to create a gradle project through the IDE of their choice.

buildscript {
repositories {
mavenCentral()
}
dependencies {
classpath 'io.github.krishari2020:spark-gradle-plugin:1.1.7'
}
}
apply plugin : 'scala'
apply plugin : 'spark-gradle-plugin'
apply plugin : 'eclipse'
dependencies {
compile 'org.apache.spark:spark-sql_2.11:2.4.0'
implementation 'org.scala-lang:scala-library:2.11.8'
runtime 'org.apache.spark:spark-launcher_2.11:2.4.0','com.sun.jersey:jersey-client:1.9','org.apache.spark:spark-yarn_2.11:2.4.0', 'io.github.krishari2020:spark-gradle-plugin:1.1.7'
}
settings {
mainClass 'com.hari.gradle.spark.plugin.test.cluster.SrcToTgt'
sparkHome 'C:\\Softwares\\Spark2.4.0'
hadoopHome 'C:\\Softwares\\Hadoop_Home'
scalaVersion '2.11'
hadoopConf 'C:\\Users\\harim\\Documents\\ClusterClientConfigs\\TestHadoopHome'
master 'yarn'
mode 'cluster'
jarZipDestPath '/tmp/spark_gradle_plugin/jars.zip'
appName 'SrcToTgt'
}

Cheatcode

  1. If ever do you see any NoClassDefError in stdErr.txt it means “SPARK_HOME”/jars require some jars as dependencies and the most common dep jars missing I have encountered is jersey jars, This could be fixed by running “gradlew -i downloadDependencies” ( *1 of Solution Overview) and by copying “jersey-core-1.9.jar”,”jersey-client-1.9.jar”,”jersey-json-1.9.jar”,”jersey-server-1.9.jar” into ${SPARK_HOME}/jars/ folder. The easier hack is copy all jars from the ${PROJECT_HOME}/build/jobDeps/ folder and copy it into ${SPARK_HOME}/jars folder but do ensure that they are the same version of jars added/replaced .
  2. In order to skip “prepareClusterSubmit” which gets invoked while “gradlew -b build.gradle -i launchSpark” is executed ( only if you are confident the distributed cache is same and there is no change in the jars needed/ modified) use the command instead “gradlew -x prepareClusterSubmit -i launchSpark”.
  3. If there are inconsistencies just run brute force “gradlew -i launchSpark — rerun-tasks” it would run all tasks forcibly again.
  4. If there are any jar conflicts it could avoided by adding dependency exclusion , for example a jackson.core module is conflicting with an existing jar an exclusion could be added as follows “configurations.compile.exclude group: “com.fasterxml.jackson.core”, module: “jackson-core”

Future Enhancements

  1. Support kerberos enabled cluster
  2. Support launching spark application on k8 clusters.

License

https://www.apache.org/licenses/LICENSE-2.0

Software Developer , Distributed Systems are my areas of interest