Now you can save all configurations and start VM.
Fire Wall Rules:
You can define fire wall rules to open Network for specific
port like 8090 on which spark job server works.
Goto VPC Network
menu given at left side bar of GCP and click on “Firewall Rules” link:

Here you can define Firewall Rules name and Target tags
as per the requirement like here it is defined as jobserver.
Source IP range would be 0.0.0.0/0 and protocol and ports should be tcp:8090
Now save the details.
Finally, this jobserver
Rule can be use as Network Tag while defining VM instance. As this rule will
open network for specific port 8090 on VM.
You can access it’s SSH terminal. This will open putty-like
command prompt on the browser itself.
As we have installed 1.2-debian
9 image on Dataproc Cluster Master
VM which has a specific version of software (https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.2).
Hence, we need to install a similar version of software on our
Ubuntu 18.0.4 LTS image of VM.
Version details are given below:
·
Java 1.8
·
Hadoop 2.8.5
·
Sbt 0.13.11
·
Spark 2.2.3 without Hadoop version
·
Spark Job Server 0.8.0
first of all, log in as a root user
install wget
sudo apt-get install wget
install open jdk 1.8.131
sudo apt-get install openjdk-8-jdk
set JAVA_HOME in bashrc file
vi ~/.bashrc
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
now load bashrc file
Install Hadoop 2.8.5 at /opt/ location
set environment variable for Hadoop:
vi ~/.bashrc
export HADOOP_HOME="/opt/hadoop-2.8.5"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
export HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop
source ~/.bashrc
set environment variable for SBT
export SBT_VERSION=0.13.11
Install SBT 0.13.11
wget -P /tmp "http://dl.bintray.com/sbt/native-packages/sbt/0.13.11/sbt-${SBT_VERSION}.tgz"
tar -xzf /tmp/sbt-${SBT_VERSION}.tgz -C /usr/local/
ln -sf /usr/local/sbt/bin/sbt /usr/local/bin/sbt
rm /tmp/sbt-${SBT_VERSION}.tgz
set environment variable for spark installation
export HADOOP_VERSION=2.8.5
export SPARK_VERSION=2.2.3
export SPARK_HADOOP_BIN=hadoop2.8.5
install Spark 2.2.3
now set SPARK_HOME and other variables in bashrc
vi ~/.bashrc
export SPARK_HOME=/opt/spark-2.2.3
export SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:
/opt/hadoop-2.8.5/share/hadoop/common/lib/*:
/opt/hadoop-2.8.5/share/hadoop/common/*:
/opt/hadoop-2.8.5/share/hadoop/hdfs:
/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:
/opt/hadoop-2.8.5/share/hadoop/hdfs/*:
/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:
/opt/hadoop-2.8.5/share/hadoop/yarn/*:
/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:
/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:
/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:
/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar
source ~/.bashrc
set environment
variable in bashrc for spark job server:
export JOBSERVER_VERSION=v0.8.0
export SPARK_JOBSERVER_VERSION=0.8.0
install spark job server 0.8.0
wget -q https://github.com/spark-jobserver/spark-jobserver/archive/v${SPARK_JOBSERVER_VERSION}.tar.gz
tar -xzf v${SPARK_JOBSERVER_VERSION}.tar.gz
mv spark-jobserver-${SPARK_JOBSERVER_VERSION} /opt/
rm v${SPARK_JOBSERVER_VERSION}.tar.gz
now set
environment variable in path in bashrc file:
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/sbin:$SPARK_HOME/bin
to check env variable is set or not
Make a file spark-default.conf
in /opt/spark-2.2.3/conf directory
and add below lines in the file.
Make sure Master is yarn
and deploy mode is client
Spark Jars location would be cluster master VM HDFS address on which all spark jars are located.
Note:
Replace <cluster-master-name> with your actual master VM instance name.
spark.master yarn
spark.submit.deployMode client
spark.yarn.jars=hdfs://<cluster-master-name>/user/spark/jars/*.jar
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<cluster-master-name>/user/spark/eventlog
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.executor.instances 10000
spark.dynamicAllocation.maxExecutors 10000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address <cluster-master-name>:18080
spark.history.fs.logDirectory hdfs://<cluster-master-name>/user/spark/eventlog
spark.driver.extraJavaOptions -Dflogger.backend_factory=com.google.cloud.
hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance
spark.executor.extraJavaOptions -Dflogger.backend_factory=com.google.cloud.hadoop.
repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance
spark.sql.parquet.cacheMetadata=false
spark.executor.cores=1
spark.executor.memory=3712m
spark.driver.memory=1920m
spark.driver.maxResultSize=960m
spark.yarn.am.memory=640m
Add HADOOP_CONF_DIR in spark-env.sh file given at /opt/spark-2.2.3/conf
location:
export HADOOP_CONF_DIR="/opt/hadoop-2.8.5/etc/hadoop"
Add yarn resource manager hostname (in our case cluster
master VM host name) and other properties in yarn-site.xml file given at /opt/hadoop-2.8.5/etc/hadoop
location:
Note:
Replace <cluster-master-name> with your actual master VM instance name.
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value><cluster-master-name></value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
Add below properties in core-site.xml
file give at /opt/hadoop-2.8.5/etc/hadoop
location. Here make sure fs.default.name
is hdfs://<cluster-master-name>
which
is hdfs address of master VM in cluster.
Note:
Replace <cluster-master-name> with your actual master VM instance name.
<configuration>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://<cluster-master-name></value>
<description>The old FileSystem used by FsShell.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://<cluster-master-name></value>
<description>
The name of the default file system. A URI whose scheme and authority
determine the FileSystem implementation. The uri's scheme determines
the config property (fs.SCHEME.impl) naming the FileSystem
implementation class. The uri's authority is used to determine the
host, port, etc. for a filesystem.
</description>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value><Bucket-name></value>
<description>
GCS bucket to use as a default bucket if fs.default.name is not a gs: uri.
</description>
</property>
<property>
<name>fs.gs.metadata.cache.directory</name>
<value>/hadoop_gcs_connector_metadata_cache</value>
<description>
Only used if fs.gs.metadata.cache.type is FILESYSTEM_BACKED, specifies
the local path to use as the base path for storing mirrored GCS metadata.
Must be an absolute path, must be a directory, and must be fully
readable/writable/executable by any user running processes which use the
GCS connector.
</description>
</property>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value><name-of-project></value>
<description>
Google Cloud Project ID with access to configured GCS buckets.
</description>
</property>
<property>
<name>fs.gs.metadata.cache.enable</name>
<value>false</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
<property>
<name>fs.gs.implicit.dir.infer.enable</name>
<value>true</value>
<description>
If set, we create and return in-memory directory objects on the fly when
no backing object exists, but we know there are files with the same
prefix.
</description>
</property>
<property>
<name>fs.gs.application.name.suffix</name>
<value>-dataproc</value>
<description>
Appended to the user-agent header for API requests to GCS to help identify
the traffic as coming from Dataproc.
</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.gs.metadata.cache.type</name>
<value>FILESYSTEM_BACKED</value>
<description>
Specifies which implementation of DirectoryListCache to use for
supplementing GCS API &amp;quot;list&amp;quot; requests. Supported
implementations: IN_MEMORY: Enforces immediate consistency within
same Java process. FILESYSTEM_BACKED: Enforces consistency across
all cooperating processes pointed at the same local mirror
directory, which may be an NFS directory for massively-distributed
coordination.
</description>
</property>
<property>
<name>fs.gs.block.size</name>
<value>134217728</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
<property>
<name>hadoop.ssl.enabled.protocols</name>
<value>TLSv1,TLSv1.1,TLSv1.2</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
</configuration>
Make a file yarn.sh at /opt/spark-jobserver/config
location of spark-jobserver and paste below content in the file:
Here it is import that you have set YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop
and HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop
which is a Hadoop location where all configuration files are
available.
#!/usr/bin/env bash
# Environment and deploy file
# For use with bin/server_deploy, bin/server_package etc.
DEPLOY_HOSTS="spark-job-server-ubuntu"
APP_USER=spark
APP_GROUP=spark
JMX_PORT=9999
# optional SSH Key to login to deploy server
#SSH_KEY=/path/to/keyfile.pem
INSTALL_DIR=/usr/local/spark-jobserver
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.3
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/opt/spark-2.2.3
SPARK_CONF_DIR=$SPARK_HOME/conf
YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
SCALA_VERSION=2.11.6 # or 2.11.6
Make a file yarn.conf at /opt/spark-jobserver/bin
location of spark-jobserver and paste below content in the file. Here make sure
following points:
“Master” would be “yarn”
“Submit.deployMode” would be
“client”
“context-per-jvm” would be “true”
# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
# local[...], yarn, mesos://... or spark://...
master = "yarn"
# client or cluster deployment
submit.deployMode = "client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
context-per-jvm = true
# Note: JobFileDAO is deprecated from v0.7.0 because of issues in
# production and will be removed in future, now defaults to H2 file.
jobdao = spark.jobserver.io.JobSqlDAO
filedao {
rootdir = /tmp/spark-jobserver/filedao/data
}
datadao {
# storage directory for files that are uploaded to the server
# via POST/data commands
rootdir = /tmp/spark-jobserver/upload
}
sqldao {
# Slick database driver, full classpath
slick-driver = slick.driver.H2Driver
# JDBC driver, full classpath
jdbc-driver = org.h2.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = /tmp/spark-jobserver/sqldao/data
# Full JDBC URL / init string, along with username and password. Sorry, needs to match above.
# Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
jdbc {
url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
user = ""
password = ""
}
# DB connection pool settings
dbcp {
enabled = false
maxactive = 20
maxidle = 10
initialsize = 10
}
}
# When using chunked transfer encoding with scala Stream job results, this is the size of each chunk
result-chunk-size = 1m
}
# Predefined Spark contexts
# contexts {
# my-low-latency-context {
# num-cpu-cores = 1 # Number of cores to allocate. Required.
# memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
# }
# # define additional contexts here
# }
# Universal context configuration. These settings can be overridden, see README.md
context-settings {
num-cpu-cores = 2 # Number of cores to allocate. Required.
memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# In case spark distribution should be accessed from HDFS (as opposed to being installed on every Mesos slave)
# spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
# URIs of Jars to be loaded into the classpath for this context.
# Uris is a string list, or a string separated by commas ','
# dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
# Add settings you wish to pass directly to the sparkConf as-is such as Hadoop connection
# settings that don't use the "spark." prefix
passthrough {
#es.nodes = ""
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
# home = "/home/spark/spark"
}
# Note that you can use this file to define settings not only for job server,
# but for your Spark jobs as well. Spark job configuration merges with this configuration file as defaults.
akka {
remote.netty.tcp {
#hostname = ""
# This controls the maximum message size, including job results, that can be sent
maximum-frame-size = 100 MiB
}
}
At /opt/spark-jobserver/bin location of spark-jobserver, paste below
content in settings.sh file:
#!/usr/bin/env bash
# Environment and deploy file
# For use with bin/server_deploy, bin/server_package etc.
DEPLOY_HOSTS="spark-jobserver"
APP_USER=spark
APP_GROUP=spark
JMX_PORT=9999
# optional SSH Key to login to deploy server
#SSH_KEY=/path/to/keyfile.pem
INSTALL_DIR=/home/spark/job-server
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.3
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/opt/spark-2.2.3
SPARK_CONF_DIR=$SPARK_HOME/conf
YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
SCALA_VERSION=2.11.6 # or 2.11.6
At /opt/spark-jobserver/bin location of spark-jobserver, paste below
content in log4j-server.properties file.
This will make logs in /var/log/job-server/
location:
# Rotating log file configuration for server deploys
# Root logger option
log4j.rootLogger=INFO,LOGFILE
log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.LOGFILE.File=${LOG_DIR}/spark-job-server.log
log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout
# log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p %c - %m%n
log4j.appender.LOGFILE.layout.ConversionPattern=[%d] %-5p %.26c [%X{testName}] [%X{akkaSource}] - %m%n
log4j.appender.LOGFILE.maxFileSize=20MB
log4j.appender.LOGFILE.maxBackupIndex=30
# Settings to quiet spark logs that are too verbose
log4j.logger.org.apache.spark.scheduler.TaskSetManager=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=WARN
Now final step, to run spark job
server goto /opt/spark-jobserver/bin
location and hit below command:
You can check server has been
started on by jps command:
This will show all java processes
available in the system, in which you can view one process name is SparkSubmit. Hence you can confirm
spark job server has been started.
To stop spark job server, hit
below command at /opt/spark-jobserver/bin
location:
1) While
running Spark Job Server If you got below error:
Failed to find Spark
jars directory (/opt/spark-2.2.3/assembly/target/scala-2.10/jars).
You need to build
Spark with the target "package" before running this program.
Solution: This
means hadoop distribution jars are missing so we have two option:
option 1:
jars copied from hadoop distribution to spark directory
Option 2: (Recommended)
only download Spark Distribution
without Hadoop(spark-2.2.3-bin-without-hadoop.tgz) from spark website. This
will have all the required jars files in the setup.
2)
While starting spark-shell in yarn mode like if
you hit the below command at terminal:
spark-shell –master
yarn and getting below error:
Error: Setting default log level to
"WARN".
To adjust logging
level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/06/26 07:21:58 WARN
util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
19/06/26 07:22:01 WARN
yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling
back to uploading libraries under SPARK_HOME.
19/06/26 07:22:07
ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It
might have been killed or unable to launch an application master.
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at
org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at
org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
Solution: Step1. Set SPARK_HOME and in PATH add $SPARK_HOME/sbin:$SPARK_HOME/bin
export
SPARK_HOME=/opt/spark-2.2.3
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/sbin:$SPARK_HOME/bin
Step2. add $SPARK_HOME/jars/*.jar in
SPARK_DIST_CLASSPATH variable
export SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar
==================================================================
3)
How to set SPARK_DIST_CLASSPATH in environment
variables in bashrc file:
Solution:
Option 1: (Recommended)
run echo $(hadoop classpath) this will give an output like
below:
/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar
Now copy this complete path and paste it in bashrc file as
below:
export
SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar
Option 2:
Add below line in bashrc file:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
==================================================================
4)
To run spark job server in yarn-client mode
following variables are necessary in bashrc file:
export
JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export HADOOP_HOME="/opt/hadoop-2.8.5"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
export HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop