Explain Codes LogoExplain Codes Logo

Add JAR files to a Spark job - spark-submit

java
spark-submit
classpath-management
spark-configurations
Alex KataevbyAlex Kataev·Jan 28, 2025
TLDR

Use the --jars option with spark-submit command to incorporate multiple JARs, using commas as a delimiter:

spark-submit --class YourClass --jars dep1.jar,dep2.jar app.jar

The Spark job will include dep1.jar and dep2.jar as dependencies.

For more complex dependency trees, the SparkContext's addJar method or the --conf option can help to tailor driver and executor classpaths.

Quick guide to separators by platform

In Unix-like systems and MacOS, use : to separate JAR paths. On the flip side, Windows relies on ;. This is crucial when working with classpath environment variables or certain build tools.

Classpath ordering in the driver

If you want your driver classpath to take precedence, set --conf spark.driver.userClassPathFirst=true. You can verify the updated paths in Spark UI under the environment tab.

Cluster manager dependent optimizations

Spark supports different modes like yarn-client and yarn-cluster. Particularly in the yarn-cluster mode, using spark.yarn.archive can distribute an archive of jars across your YARN cluster faster when these are world-readable on HDFS.

Complex dependencies: the staging directory approach

For applications having complex dependencies, consider using a staging directory on HDFS. Handy hint: directory expansion in yarn-cluster mode simplifies JAR specification.

Working with different URI schemes

spark-submit can cater to different URI schemes like hdfs://, s3://, file://. Make sure the JARs are placed strategically for quick accessibility.

Careful combination of classpath options

Mixing multiple classpath options (--jars, --files, extraClassPath) need due diligence to dodge classpath conflicts.

Centralizing regularly used libraries

For commonly used JARs, having a centralized cache on HDFS or a similar location ensures uniformity across all nodes and speeds up job initialization.

Environment specific precautionary checks

Always ensure paths and classpaths are validated across different environments (especially dev/prod) to avoid discrepancies.

Fine-grained control with Spark configurations

When you need fine-grained control over Spark's behavior, use the --conf option such as spark.executor.extraClassPath to add to the executor's classpath on worker nodes.

Configuring classpaths from inside a Spark session

If you're launching a Spark session inside another application or notebook, use the --conf option or SparkConf() to configure the driver and executor classpaths accordingly.

addJar or addFile: a choice of necessity

Use SparkContext.addJar for adding dependencies needed on the classpath. Conversely, SparkContext.addFile is appropriate for files required at runtime, which might not necessarily be on the classpath.

Consistency across all modes

Test your JAR deployment strategy across yarn-client, yarn-cluster, and local mode for uniform behavior.

The last resort: the official documentation

Regular consultation of the official Spark documentation will provide up-to-date and comprehensive guidance for managing JAR dependencies in your Spark job.