How to Start a Spark Cluster on Kestrel¶

This guide explains how to start an Apache Spark cluster on NLR’s Kestrel HPC system for running dsgrid queries.

Prerequisites¶

Install the Python package sparkctl - a tool for managing Spark clusters on HPC systems:

pip install "sparkctl[pyspark]"

Refer to the sparkctl documentation for more details.

Compute Node Types¶

Spark works best with fast local storage. The standard Kestrel nodes do not have any local storage. The best candidates are the 256 standard nodes (no GPUs) with 1.92 TB NVMe M.2 drives.

Please refer to the Kestrel system configuration page for specific hardware information. The GPU nodes will work as well, but at a greater cost in AUs.

Tip

If those nodes are not available, you may be able to complete your queries by using the standard nodes and specifying a path on the Lustre filesystem in the Spark configuration file conf/spark-env.sh. Change SPARK_LOCAL_DIRS and SPARK_WORKER_DIR.

Steps¶

1. Create a Work Directory¶

From the HPC login node, create a work directory somewhere in /scratch/$USER:

cd /scratch/$USER
mkdir dsgrid-work
cd dsgrid-work

2. Allocate Compute Nodes¶

Request one or more nodes from the SLURM scheduler. Adjust the parameters based on your needs:

salloc -t 01:00:00 -N1 --account=dsgrid --partition=debug --tmp=1600G --mem=240G

Parameter guide:

-t 01:00:00: Time limit (1 hour in this example)
-N1: Number of nodes (1 in this example; increase for larger datasets)
--account=dsgrid: Your allocation account
--partition=debug: Queue partition (use standard for longer jobs)
--tmp=1600G: Local scratch space (use with NVMe nodes)
--mem=240G: Memory per node

3. Configure and Start the Cluster¶

Configure the Spark settings and start the cluster:

sparkctl configure --start

Run sparkctl --help to see all available options.

4. Set Environment Variables¶

Set the Spark configuration and Java environment variables:

export SPARK_CONF_DIR=$(pwd)/conf
export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7

5. Verify Cluster is Running¶

The Spark cluster is now ready to use at spark://$(hostname):7077.

You can verify it’s running by checking the Spark master UI:

echo "Spark Master UI: http://$(hostname):8080"

Run all query scripts from this node using spark-submit as described in Run dsgrid on Kestrel.

Example: Multi-Node Cluster¶

For larger datasets, allocate multiple nodes:

# Allocate 4 nodes for 2 hours
salloc -t 02:00:00 -N4 --account=dsgrid --partition=standard --tmp=1600G --mem=240G

# Configure and start (sparkctl detects all allocated nodes)
sparkctl configure --start

# Set environment
export SPARK_CONF_DIR=$(pwd)/conf
export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7

# Run dsgrid query
spark-submit --master=spark://$(hostname):7077 $(which dsgrid-cli.py) query project run query.json5

Configuration Tips¶

Adjust Spark Partitions¶

For better performance with large datasets, set the number of shuffle partitions:

spark-submit --master=spark://$(hostname):7077 \
    --conf spark.sql.shuffle.partitions=2400 \
    $(which dsgrid-cli.py) query project run query.json5

Rule of thumb: Use 2-4x the number of CPU cores across your cluster.

Memory Settings¶

If you encounter out-of-memory errors, adjust executor memory:

spark-submit --master=spark://$(hostname):7077 \
    --executor-memory 50g \
    --driver-memory 50g \
    $(which dsgrid-cli.py) query project run query.json5

Troubleshooting¶

Cluster Won’t Start¶

Check that your allocation is active: squeue -u $USER
Verify Java is available: java --version
Review logs in ./logs/ directory

Out of Memory Errors¶

Increase --mem when allocating nodes
Add more nodes with -N
Reduce data processed per partition

Slow Performance¶

Use nodes with local NVMe storage
Increase shuffle partitions for better parallelism
Review Spark UI (port 8080) for task distribution

Cleaning Up¶

When finished, stop the Spark cluster and release your allocation:

sparkctl stop
exit  # Exit salloc session