# How to Start a Spark Cluster on Kestrel

This guide explains how to start an Apache Spark cluster on NLR's Kestrel HPC system for running dsgrid queries.

## Prerequisites

Install the Python package `sparkctl` - a tool for managing Spark clusters on HPC systems:

```bash
pip install "sparkctl[pyspark]"
```

Refer to the [sparkctl documentation](https://nrel.github.io/sparkctl/) for more details.

## Compute Node Types

Spark works best with fast local storage. The standard Kestrel nodes do not have any local storage. The best candidates are the **256 standard nodes (no GPUs) with 1.92 TB NVMe M.2 drives**.

Please refer to the [Kestrel system configuration page](https://www.nrel.gov/hpc/kestrel-system-configuration.html) for specific hardware information. The GPU nodes will work as well, but at a greater cost in AUs.

:::{tip}
If those nodes are not available, you may be able to complete your queries by using the standard nodes and specifying a path on the Lustre filesystem in the Spark configuration file `conf/spark-env.sh`. Change `SPARK_LOCAL_DIRS` and `SPARK_WORKER_DIR`.
:::

## Steps

### 1. Create a Work Directory

From the HPC login node, create a work directory somewhere in `/scratch/$USER`:

```bash
cd /scratch/$USER
mkdir dsgrid-work
cd dsgrid-work
```

### 2. Allocate Compute Nodes

Request one or more nodes from the SLURM scheduler. Adjust the parameters based on your needs:

```bash
salloc -t 01:00:00 -N1 --account=dsgrid --partition=debug --tmp=1600G --mem=240G
```

**Parameter guide:**
- `-t 01:00:00`: Time limit (1 hour in this example)
- `-N1`: Number of nodes (1 in this example; increase for larger datasets)
- `--account=dsgrid`: Your allocation account
- `--partition=debug`: Queue partition (use `standard` for longer jobs)
- `--tmp=1600G`: Local scratch space (use with NVMe nodes)
- `--mem=240G`: Memory per node

### 3. Configure and Start the Cluster

Configure the Spark settings and start the cluster:

```bash
sparkctl configure --start
```

Run `sparkctl --help` to see all available options.

### 4. Set Environment Variables

Set the Spark configuration and Java environment variables:

```bash
export SPARK_CONF_DIR=$(pwd)/conf
export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7
```

### 5. Verify Cluster is Running

The Spark cluster is now ready to use at `spark://$(hostname):7077`.

You can verify it's running by checking the Spark master UI:
```bash
echo "Spark Master UI: http://$(hostname):8080"
```

Run all query scripts from this node using spark-submit as described in [Run dsgrid on Kestrel](run_on_kestrel).

## Example: Multi-Node Cluster

For larger datasets, allocate multiple nodes:

```bash
# Allocate 4 nodes for 2 hours
salloc -t 02:00:00 -N4 --account=dsgrid --partition=standard --tmp=1600G --mem=240G

# Configure and start (sparkctl detects all allocated nodes)
sparkctl configure --start

# Set environment
export SPARK_CONF_DIR=$(pwd)/conf
export JAVA_HOME=/datasets/images/apache_spark/jdk-21.0.7

# Run dsgrid query
spark-submit --master=spark://$(hostname):7077 $(which dsgrid-cli.py) query project run query.json5
```

## Configuration Tips

### Adjust Spark Partitions

For better performance with large datasets, set the number of shuffle partitions:

```bash
spark-submit --master=spark://$(hostname):7077 \
    --conf spark.sql.shuffle.partitions=2400 \
    $(which dsgrid-cli.py) query project run query.json5
```

Rule of thumb: Use 2-4x the number of CPU cores across your cluster.

### Memory Settings

If you encounter out-of-memory errors, adjust executor memory:

```bash
spark-submit --master=spark://$(hostname):7077 \
    --executor-memory 50g \
    --driver-memory 50g \
    $(which dsgrid-cli.py) query project run query.json5
```

## Troubleshooting

### Cluster Won't Start

- Check that your allocation is active: `squeue -u $USER`
- Verify Java is available: `java --version`
- Review logs in `./logs/` directory

### Out of Memory Errors

- Increase `--mem` when allocating nodes
- Add more nodes with `-N`
- Reduce data processed per partition

### Slow Performance

- Use nodes with local NVMe storage
- Increase shuffle partitions for better parallelism
- Review Spark UI (port 8080) for task distribution

## Cleaning Up

When finished, stop the Spark cluster and release your allocation:

```bash
sparkctl stop
exit  # Exit salloc session
```

## Next Steps

- Learn how to [run dsgrid on Kestrel](run_on_kestrel)
- Understand [Spark configuration options](../apache_spark/overview)
- Follow the [query project tutorial](../tutorials/query_project)