Add JAR files to a Spark job - spark-submit
Use the --jars
option with spark-submit
command to incorporate multiple JARs, using commas as a delimiter:
The Spark job will include dep1.jar
and dep2.jar
as dependencies.
For more complex dependency trees, the SparkContext's addJar
method or the --conf
option can help to tailor driver and executor classpaths.
Quick guide to separators by platform
In Unix-like systems and MacOS, use :
to separate JAR paths. On the flip side, Windows relies on ;
. This is crucial when working with classpath environment variables or certain build tools.
Classpath ordering in the driver
If you want your driver classpath to take precedence, set --conf spark.driver.userClassPathFirst=true
. You can verify the updated paths in Spark UI under the environment tab.
Cluster manager dependent optimizations
Spark supports different modes like yarn-client
and yarn-cluster
. Particularly in the yarn-cluster
mode, using spark.yarn.archive
can distribute an archive of jars across your YARN cluster faster when these are world-readable on HDFS.
Complex dependencies: the staging directory approach
For applications having complex dependencies, consider using a staging directory on HDFS. Handy hint: directory expansion in yarn-cluster
mode simplifies JAR specification.
Working with different URI schemes
spark-submit
can cater to different URI schemes like hdfs://
, s3://
, file://
. Make sure the JARs are placed strategically for quick accessibility.
Careful combination of classpath options
Mixing multiple classpath options (--jars
, --files
, extraClassPath
) need due diligence to dodge classpath conflicts.
Centralizing regularly used libraries
For commonly used JARs, having a centralized cache on HDFS or a similar location ensures uniformity across all nodes and speeds up job initialization.
Environment specific precautionary checks
Always ensure paths and classpaths are validated across different environments (especially dev/prod) to avoid discrepancies.
Fine-grained control with Spark configurations
When you need fine-grained control over Spark's behavior, use the --conf
option such as spark.executor.extraClassPath
to add to the executor's classpath on worker nodes.
Configuring classpaths from inside a Spark session
If you're launching a Spark session inside another application or notebook, use the --conf
option or SparkConf()
to configure the driver and executor classpaths accordingly.
addJar or addFile: a choice of necessity
Use SparkContext.addJar
for adding dependencies needed on the classpath. Conversely, SparkContext.addFile
is appropriate for files required at runtime, which might not necessarily be on the classpath.
Consistency across all modes
Test your JAR deployment strategy across yarn-client
, yarn-cluster
, and local mode for uniform behavior.
The last resort: the official documentation
Regular consultation of the official Spark documentation will provide up-to-date and comprehensive guidance for managing JAR dependencies in your Spark job.
Was this article helpful?