Tuesday, December 8, 2020

GCP Spark Job Server Setup

 Create a GCP VM Instance:


First, make a VM instance on GCP with the configurations given below:
    ·         Make sure Machine Type is 2 vCPUs, 13GB memory. You can choose more scalable, but the price will vary.
    ·        Image installation type would be Ubuntu 18.0.4 LTS.
    ·        Disk Size would be minimum 50 GB.
    ·        Zone would be us-central1-c or you can choose like Dataproc cluster master VM.
    ·        Firewalls: If you want to access this VM over http or https port then you can select to Allow the traffic for both.
    ·        Network Tag: These should be jobserver and spark-jobserver. As these tags will open network for specific ports of VM. (Note: Network tag creation is given later in the document)




Cloud API Access Scopes: You can choose API access for cloud. By default, you can access only your own APIs. If you want to access all cloud APIs then choose “Allow full access to all Cloud APIs” option.



Now you can save all configurations and start VM.

 Fire Wall Rules:

You can define fire wall rules to open Network for specific port like 8090 on which spark job server works.

Goto VPC Network menu given at left side bar of GCP and click on “Firewall Rules” link:


Here you can define Firewall Rules name and Target tags as per the requirement like here it is defined as jobserver.

Source IP range would be 0.0.0.0/0 and protocol and ports should be tcp:8090


Now save the details.

Finally, this jobserver Rule can be use as Network Tag while defining VM instance. As this rule will open network for specific port 8090 on VM.

SSH Terminal Access:

You can access it’s SSH terminal. This will open putty-like command prompt on the browser itself.




Software Installation:

As we have installed 1.2-debian 9 image on Dataproc Cluster Master VM which has a specific version of software (https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.2).  

Hence, we need to install a similar version of software on our Ubuntu 18.0.4 LTS image of VM.

Version details are given below:

        ·        Java 1.8

        ·        Hadoop 2.8.5

        ·        Sbt 0.13.11

        ·        Spark 2.2.3 without Hadoop version

        ·        Spark Job Server 0.8.0

Java Installation:

first of all, log in as a root user

sudo su –

 install wget

sudo apt-get install wget

install open jdk 1.8.131

sudo apt-get install openjdk-8-jdk

set JAVA_HOME in bashrc file

vi ~/.bashrc
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

now load bashrc file

source ~/.bashrc 


Hadoop Installation:

Install Hadoop 2.8.5 at /opt/ location

Cd /opt/
curl -O  https://archive.apache.org/dist/hadoop/core/hadoop-2.8.5/hadoop-2.8.5.tar.gz
tar -xzf hadoop-2.8.5.tar.gz -C /usr/local/
mv /usr/local/hadoop-2.8.5/ /opt/
rm /opt/hadoop-2.8.5.tar.gz

 set environment variable for Hadoop:

vi ~/.bashrc
export HADOOP_HOME="/opt/hadoop-2.8.5"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
export HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop
source ~/.bashrc


SBT installation:

set environment variable for SBT

export SBT_VERSION=0.13.11

Install SBT 0.13.11

 wget -P /tmp "http://dl.bintray.com/sbt/native-packages/sbt/0.13.11/sbt-${SBT_VERSION}.tgz"
 tar -xzf /tmp/sbt-${SBT_VERSION}.tgz -C /usr/local/
 ln -sf /usr/local/sbt/bin/sbt /usr/local/bin/sbt 
 rm /tmp/sbt-${SBT_VERSION}.tgz


Spark Installation:

set environment variable for spark installation

 export HADOOP_VERSION=2.8.5
 export SPARK_VERSION=2.2.3
 export SPARK_HADOOP_BIN=hadoop2.8.5

install Spark 2.2.3

wget -P /tmp “http://archive.apache.org/dist/spark/spark-2.2.3/spark-2.2.3-bin-without-hadoop.tgz
tar -xzf /tmp/spark-${SPARK_VERSION}-bin-${SPARK_HADOOP_BIN}.tgz
mv spark-${SPARK_VERSION}-bin-${SPARK_HADOOP_BIN} /opt/
rm /tmp/spark-${SPARK_VERSION}-bin-${SPARK_HADOOP_BIN}.tgz

now set SPARK_HOME and other variables in bashrc

vi ~/.bashrc
export SPARK_HOME=/opt/spark-2.2.3
export SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:
/opt/hadoop-2.8.5/share/hadoop/common/lib/*:
/opt/hadoop-2.8.5/share/hadoop/common/*:
/opt/hadoop-2.8.5/share/hadoop/hdfs:
/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:
/opt/hadoop-2.8.5/share/hadoop/hdfs/*:
/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:
/opt/hadoop-2.8.5/share/hadoop/yarn/*:
/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:
/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:
/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:
/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar
source ~/.bashrc

Spark Job Server Installation:

 set environment variable in bashrc for spark job server:

export JOBSERVER_VERSION=v0.8.0
export SPARK_JOBSERVER_VERSION=0.8.0 

install spark job server 0.8.0

wget -q https://github.com/spark-jobserver/spark-jobserver/archive/v${SPARK_JOBSERVER_VERSION}.tar.gz
tar -xzf v${SPARK_JOBSERVER_VERSION}.tar.gz
mv spark-jobserver-${SPARK_JOBSERVER_VERSION} /opt/
rm v${SPARK_JOBSERVER_VERSION}.tar.gz

 now set environment variable in path in bashrc file:

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/sbin:$SPARK_HOME/bin

 to check env variable is set or not

env


Steps to run Spark Job Server when spark is running on yarn-client mode:

Make a file spark-default.conf in /opt/spark-2.2.3/conf directory and add below lines in the file.

Make sure Master is yarn and deploy mode is client

Spark Jars location would be cluster master VM HDFS address on which all spark jars are located.

Note: Replace <cluster-master-name> with your actual master VM instance name.

spark.master yarn
spark.submit.deployMode client
spark.yarn.jars=hdfs://<cluster-master-name>/user/spark/jars/*.jar
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<cluster-master-name>/user/spark/eventlog
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.executor.instances 10000
spark.dynamicAllocation.maxExecutors 10000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0
spark.yarn.historyServer.address <cluster-master-name>:18080
spark.history.fs.logDirectory hdfs://<cluster-master-name>/user/spark/eventlog
spark.driver.extraJavaOptions -Dflogger.backend_factory=com.google.cloud.
hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance
spark.executor.extraJavaOptions -Dflogger.backend_factory=com.google.cloud.hadoop.
repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance
spark.sql.parquet.cacheMetadata=false
spark.executor.cores=1
spark.executor.memory=3712m
spark.driver.memory=1920m
spark.driver.maxResultSize=960m
spark.yarn.am.memory=640m

 Add  HADOOP_CONF_DIR in spark-env.sh file given at /opt/spark-2.2.3/conf location:

export HADOOP_CONF_DIR="/opt/hadoop-2.8.5/etc/hadoop"

 Add yarn resource manager hostname (in our case cluster master VM host name) and other properties in yarn-site.xml file given at /opt/hadoop-2.8.5/etc/hadoop location:

Note: Replace <cluster-master-name> with your actual master VM instance name.

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value><cluster-master-name></value>
  </property>
  <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
</configuration>

Add below properties in core-site.xml file give at /opt/hadoop-2.8.5/etc/hadoop location. Here make sure fs.default.name is hdfs://<cluster-master-name> which is hdfs address of master VM in cluster.

Note: Replace <cluster-master-name> with your actual master VM instance name.

 

<configuration>
  <property>
    <name>hadoop.proxyuser.hive.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://<cluster-master-name></value>
    <description>The old FileSystem used by FsShell.</description>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://<cluster-master-name></value>
    <description>
      The name of the default file system. A URI whose scheme and authority
      determine the FileSystem implementation. The uri's scheme determines
      the config property (fs.SCHEME.impl) naming the FileSystem
      implementation class. The uri's authority is used to determine the
      host, port, etc. for a filesystem.
    </description>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.groups</name>
    <value>*</value>
  </property>
  <property>
          <name>fs.gs.working.dir</name>
           <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
  <property>
    <name>fs.gs.system.bucket</name>
    <value><Bucket-name></value>
    <description>
      GCS bucket to use as a default bucket if fs.default.name is not a gs: uri.
    </description>
  </property>
  <property>
    <name>fs.gs.metadata.cache.directory</name>
    <value>/hadoop_gcs_connector_metadata_cache</value>
    <description>
Only used if fs.gs.metadata.cache.type is FILESYSTEM_BACKED, specifies
      the local path to use as the base path for storing mirrored GCS metadata.
      Must be an absolute path, must be a directory, and must be fully
      readable/writable/executable by any user running processes which use the
      GCS connector.
    </description>
  </property>
  <property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
  </property>
  <property>
    <name>fs.gs.project.id</name>
    <value><name-of-project></value>
    <description>
      Google Cloud Project ID with access to configured GCS buckets.
    </description>
  </property>
  <property>
    <name>fs.gs.metadata.cache.enable</name>
    <value>false</value>
    <final>false</final>
    <source>Dataproc Cluster Properties</source>
  </property>
  <property>
    <name>fs.gs.implicit.dir.infer.enable</name>
    <value>true</value>
    <description>
      If set, we create and return in-memory directory objects on the fly when
      no backing object exists, but we know there are files with the same
      prefix.
    </description>
  </property>
  <property>
    <name>fs.gs.application.name.suffix</name>
    <value>-dataproc</value>
    <description>
      Appended to the user-agent header for API requests to GCS to help identify
      the traffic as coming from Dataproc.
    </description>
  </property>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>The AbstractFileSystem for gs: (GCS) uris.</description>
  </property>
  <property>
<name>fs.gs.metadata.cache.type</name>
    <value>FILESYSTEM_BACKED</value>
    <description>
      Specifies which implementation of DirectoryListCache to use for
      supplementing GCS API &amp;amp;quot;list&amp;amp;quot; requests. Supported
      implementations:       IN_MEMORY: Enforces immediate consistency within
      same Java process.       FILESYSTEM_BACKED: Enforces consistency across
      all cooperating processes       pointed at the same local mirror
      directory, which may be an NFS directory       for massively-distributed
      coordination.
    </description>
  </property>
  <property>
    <name>fs.gs.block.size</name>
    <value>134217728</value>
    <final>false</final>
    <source>Dataproc Cluster Properties</source>
  </property>
  <property>
    <name>hadoop.ssl.enabled.protocols</name>
    <value>TLSv1,TLSv1.1,TLSv1.2</value>
    <final>false</final>
    <source>Dataproc Cluster Properties</source>
  </property>
</configuration>

 

Make a file yarn.sh at /opt/spark-jobserver/config location of spark-jobserver and paste below content in the file:

Here it is import that you have set YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop and HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/Hadoop which is a Hadoop location where all configuration files are available.


#!/usr/bin/env bash
# Environment and deploy file
# For use with bin/server_deploy, bin/server_package etc.
DEPLOY_HOSTS="spark-job-server-ubuntu"
APP_USER=spark
APP_GROUP=spark
JMX_PORT=9999
# optional SSH Key to login to deploy server
#SSH_KEY=/path/to/keyfile.pem
INSTALL_DIR=/usr/local/spark-jobserver
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.3
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/opt/spark-2.2.3
SPARK_CONF_DIR=$SPARK_HOME/conf
YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
SCALA_VERSION=2.11.6 # or 2.11.6

 

Make a file yarn.conf at /opt/spark-jobserver/bin location of spark-jobserver and paste below content in the file. Here make sure following points:

“Master” would be “yarn”

“Submit.deployMode” would be “client”

“context-per-jvm” would be “true”

# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
# Spark Cluster / Job Server configuration
spark {
  # spark.master will be passed to each job's JobContext
  # local[...], yarn, mesos://... or spark://...
  master = "yarn"
  # client or cluster deployment
  submit.deployMode = "client"
  # Default # of CPUs for jobs to use for Spark standalone cluster
  job-number-cpus = 4
  jobserver {
    port = 8090
    context-per-jvm = true
    # Note: JobFileDAO is deprecated from v0.7.0 because of issues in
    # production and will be removed in future, now defaults to H2 file.
    jobdao = spark.jobserver.io.JobSqlDAO
    filedao {
      rootdir = /tmp/spark-jobserver/filedao/data
    }
 
    datadao {
      # storage directory for files that are uploaded to the server
      # via POST/data commands
      rootdir = /tmp/spark-jobserver/upload
    }
    sqldao {
      # Slick database driver, full classpath
      slick-driver = slick.driver.H2Driver
      # JDBC driver, full classpath
      jdbc-driver = org.h2.Driver
           # Directory where default H2 driver stores its data. Only needed for H2.
      rootdir = /tmp/spark-jobserver/sqldao/data
      # Full JDBC URL / init string, along with username and password.  Sorry, needs to match above.
      # Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
      jdbc {
        url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
        user = ""
        password = ""
      }
      # DB connection pool settings
      dbcp {
        enabled = false
        maxactive = 20
        maxidle = 10
        initialsize = 10
      }
    }
    # When using chunked transfer encoding with scala Stream job results, this is the size of each chunk
    result-chunk-size = 1m
  }
  # Predefined Spark contexts
  # contexts {
  #   my-low-latency-context {
  #     num-cpu-cores = 1           # Number of cores to allocate.  Required.
  #   memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
  #   }
  #   # define additional contexts here
  # }
  # Universal context configuration.  These settings can be overridden, see README.md
  context-settings {
    num-cpu-cores = 2           # Number of cores to allocate.  Required.
    memory-per-node = 512m         # Executor memory per node, -Xmx style eg 512m, #1G, etc.
    # In case spark distribution should be accessed from HDFS (as opposed to being installed on every Mesos slave)
    # spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
    # URIs of Jars to be loaded into the classpath for this context.
    # Uris is a string list, or a string separated by commas ','
    # dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
    # Add settings you wish to pass directly to the sparkConf as-is such as Hadoop connection
    # settings that don't use the "spark." prefix
    passthrough {
         #es.nodes = ""
    }
  }
  # This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
  # home = "/home/spark/spark"
}
# Note that you can use this file to define settings not only for job server,
# but for your Spark jobs as well.  Spark job configuration merges with this configuration file as defaults.
akka {
  remote.netty.tcp {
     #hostname = ""
    # This controls the maximum message size, including job results, that can be sent
     maximum-frame-size = 100 MiB
  }
}

 

At /opt/spark-jobserver/bin location of spark-jobserver, paste below content in settings.sh file:

#!/usr/bin/env bash
# Environment and deploy file
# For use with bin/server_deploy, bin/server_package etc.
DEPLOY_HOSTS="spark-jobserver"
APP_USER=spark
APP_GROUP=spark
JMX_PORT=9999
# optional SSH Key to login to deploy server
#SSH_KEY=/path/to/keyfile.pem
INSTALL_DIR=/home/spark/job-server
LOG_DIR=/var/log/job-server
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.3
MAX_DIRECT_MEMORY=512M
SPARK_HOME=/opt/spark-2.2.3
SPARK_CONF_DIR=$SPARK_HOME/conf 
YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop
SCALA_VERSION=2.11.6 # or 2.11.6

At /opt/spark-jobserver/bin location of spark-jobserver, paste below content in log4j-server.properties file. This will make logs in /var/log/job-server/ location:

# Rotating log file configuration for server deploys
# Root logger option
log4j.rootLogger=INFO,LOGFILE
log4j.appender.LOGFILE=org.apache.log4j.RollingFileAppender
log4j.appender.LOGFILE.File=${LOG_DIR}/spark-job-server.log
log4j.appender.LOGFILE.layout=org.apache.log4j.PatternLayout
# log4j.appender.LOGFILE.layout.ConversionPattern=%d %-5p %c - %m%n
log4j.appender.LOGFILE.layout.ConversionPattern=[%d] %-5p %.26c [%X{testName}] [%X{akkaSource}] - %m%n
log4j.appender.LOGFILE.maxFileSize=20MB
log4j.appender.LOGFILE.maxBackupIndex=30
# Settings to quiet spark logs that are too verbose
log4j.logger.org.apache.spark.scheduler.TaskSetManager=WARN
log4j.logger.org.apache.spark.scheduler.DAGScheduler=WARN

 

Start/Stop Spark Job Server:

Now final step, to run spark job server goto /opt/spark-jobserver/bin location and hit below command:

./server_start.sh

You can check server has been started on by jps command:

jps

This will show all java processes available in the system, in which you can view one process name is SparkSubmit. Hence you can confirm spark job server has been started.

To stop spark job server, hit below command at /opt/spark-jobserver/bin location:

./server_stop.sh


Error while running Software through terminal:

        1)     While running Spark Job Server If you got below error:        

Failed to find Spark jars directory (/opt/spark-2.2.3/assembly/target/scala-2.10/jars).

You need to build Spark with the target "package" before running this program.

Solution: This means hadoop distribution jars are missing so we have two option:

option 1:

jars copied from hadoop distribution to spark directory

Option 2: (Recommended)

only download Spark Distribution without Hadoop(spark-2.2.3-bin-without-hadoop.tgz) from spark website. This will have all the required jars files in the setup.

     2)     While starting spark-shell in yarn mode like if you hit the below command at terminal:

spark-shell –master yarn and getting below error:

Error: Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

19/06/26 07:21:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

19/06/26 07:22:01 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

19/06/26 07:22:07 ERROR spark.SparkContext: Error initializing SparkContext.

org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch an application master.

        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)

        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)

        at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)

        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)

        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)

 Solution: Step1. Set SPARK_HOME and in PATH add $SPARK_HOME/sbin:$SPARK_HOME/bin

               export SPARK_HOME=/opt/spark-2.2.3

               export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/sbin:$SPARK_HOME/bin

          Step2. add $SPARK_HOME/jars/*.jar in SPARK_DIST_CLASSPATH variable

export SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar

                               

==================================================================            

        3)     How to set SPARK_DIST_CLASSPATH in environment variables in bashrc file:

Solution:

Option 1: (Recommended)

run echo $(hadoop classpath) this will give an output like below:

/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar

Now copy this complete path and paste it in bashrc file as below:

export SPARK_DIST_CLASSPATH=/opt/hadoop-2.8.5/etc/hadoop:/opt/hadoop-2.8.5/share/hadoop/common/lib/*:/opt/hadoop-2.8.5/share/hadoop/common/*:/opt/hadoop-2.8.5/share/hadoop/hdfs:/opt/hadoop-2.8.5/share/hadoop/hdfs/lib/*:/opt/hadoop-2.8.5/share/hadoop/hdfs/*:/opt/hadoop-2.8.5/share/hadoop/yarn/lib/*:/opt/hadoop-2.8.5/share/hadoop/yarn/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.8.5/share/hadoop/mapreduce/*:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/opt/hadoop-2.8.5/contrib/capacity-scheduler/*.jar:$SPARK_HOME/jars/*.jar

Option 2:

Add below line in bashrc file:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

==================================================================            

        4)     To run spark job server in yarn-client mode following variables are necessary in bashrc file:

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

export HADOOP_HOME="/opt/hadoop-2.8.5"

export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

export YARN_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop

export HADOOP_CONF_DIR=/opt/hadoop-2.8.5/etc/hadoop