Run datasight on an HPC compute node¶
If your data lives on an HPC filesystem, the simplest setup is to run datasight itself on a compute node and query it from your laptop. No Flight SQL server, no extra infrastructure. For the browser UI, prefer an SSH-forwarded UNIX domain socket over exposing a TCP port on the compute node.
For a multi-user shared backend or a non-DuckDB engine, see Connect to a remote Flight SQL backend instead.
Because datasight itself runs on the compute node in this setup, the LLM call originates from the compute node too. That's the right fit if you want to use a local model on the compute node's GPU (e.g. Ollama on a GPU node), or if a hosted API key configured on the compute node is fine. If policy or preference requires the LLM to run from your laptop — including using your laptop's GPU — use the Flight SQL setup instead, which keeps the agent client-side.
1. Install datasight on the HPC¶
From the login node, create a uv environment and install:
If uv isn't available, install it first with
curl -LsSf https://astral.sh/uv/install.sh | sh. Install once from
the login node; compute nodes share the same filesystem.
2. Allocate a compute node¶
DuckDB benefits from plenty of memory and cores for large aggregations. Adjust to match your dataset.
3. Point datasight at your data¶
You have two low-friction options on the compute node:
datasight inspect runs deterministic profile/quality/measures/
dimensions analyses on raw files — no project or LLM required:
Create a project directory (on the login node is fine — it's a
one-time setup) with a .duckdb file of views over your parquet:
CREATE VIEW generation AS
SELECT * FROM read_parquet('/scratch/project/data/generation/**/*.parquet');
CREATE VIEW stations AS
SELECT * FROM read_parquet('/scratch/project/data/stations.parquet');
Then point .env at it:
See Set up a project for schema descriptions and example queries.
4. Pick a workflow¶
For headless runs, scripts, and batch work, stay in the compute node shell:
datasight ask "Monthly generation trend" --format csv -o trend.csv
datasight ask --file questions.txt --output-dir batch-output
datasight profile
datasight audit-report -o audit.html
See Ask questions from the CLI for the full CLI reference.
Start the server on the compute node with a UNIX domain socket:
From your laptop, forward a local TCP port to that remote socket
through the login node. Replace compute-node-42 with your actual
hostname (hostname on the compute node, or
squeue --me --format="%N" --noheader):
Open http://localhost:8084 in your browser.
This keeps datasight off the compute node's TCP listener table entirely. It also avoids choosing and coordinating an open port for each interactive job.
If your local OpenSSH build or site policy does not support socket forwarding, datasight now binds to loopback by default. Start it on the compute node:
Then forward a TCP port through the compute node itself:
Open http://localhost:8084 in your browser.
Tips¶
- Your LLM key still travels from the laptop environment if you set it
there — but since datasight is running on the compute node in this
setup, set
ANTHROPIC_API_KEY(or equivalent) in the compute node shell or the project's.env. - Pre-aggregate in views for common queries. A
daily_summaryview is much faster than scanning raw parquet every time. - Write a
schema_description.md— the AI discovers table structure automatically, but domain context dramatically improves SQL quality. - Watch the Slurm time limit. When the job ends, the datasight server
stops and the tunnel breaks. Use
sallocfor interactive sessions you may want to extend.