Choosing an AI provider¶

datasight uses an AI model — also called an LLM (Large Language Model) — to translate your questions into SQL. You can use a hosted service like Anthropic Claude or OpenAI's GPT, or run a model locally with Ollama. The right choice depends on your data sensitivity, budget, and whether you want to run a model locally. This page helps you pick one without reading every provider's pricing page.

Quick decision guide¶

Your situation	Start with
Trying datasight for the first time, non-sensitive data	Anthropic Claude Haiku or OpenAI GPT-4o-mini
Want zero cost, don't mind rate limits	GitHub Models (free tier, recommended over Ollama for most users)
Already have an OpenAI key	OpenAI (`gpt-4o-mini` or `gpt-4.1-mini`)
Data is sensitive and must not leave your network	Local Ollama (laptop or HPC GPU node)
Data is sensitive but you have a secure hosted endpoint	Anthropic on Bedrock or Azure OpenAI (custom `base_url`)
Writing SQL for a well-documented schema	Haiku / GPT-4o-mini is usually enough
Complex multi-step analytical questions, poor results from the cheap tier	Step up to Sonnet or GPT-4o

When in doubt, start with Haiku. datasight's main job — turning a question into SQL against a documented schema — is not a frontier-model task, and Haiku handles it well for most projects.

Most users can stop reading here

Pick the option that matches your row in the table above, then head to Install datasight. The sections below cover advanced tradeoffs — data sensitivity policy, cost modeling, local GPU sizing, and network configuration — and are only worth reading if the table left you uncertain.

Factor 1: data sensitivity¶

This is the first question to answer, because it rules some options out.

Non-sensitive or already-public data. Any hosted provider is fine. Only the SQL and sampled result rows leave your machine; datasight does not upload raw files.
Sensitive data where a hosted API is acceptable under a BAA or enterprise agreement. Use a secure endpoint such as Anthropic on AWS Bedrock, Azure OpenAI, or a corporate gateway. Configure datasight with the provider's base_url pointing at your endpoint.
Sensitive data that must not traverse the public internet at all. Run a local model via Ollama — on your laptop, or on an HPC GPU node (see below).

Note: even with a hosted API, the data values that reach the LLM are limited to column names, schema descriptions, example queries, and small result samples used for summarization. Full tables are never uploaded. That said, column names and sample rows can themselves be sensitive, so treat them accordingly.

Factor 2: cost¶

For the hosted options, rough order of magnitude (check current pricing — these move):

GitHub Models — free for a generous monthly quota, rate-limited. Great for evaluation and light use. Provides access to GPT, Llama, and other open models through a single GitHub token. Note: the free tier caps requests at 8,000 tokens, which is easy to exceed on databases with many tables or wide tables. If you hit context-length errors, see Limit schema sent to the LLM.
Cheap hosted tier (Anthropic Haiku, OpenAI GPT-4o-mini / GPT-4.1-mini) — typical datasight sessions cost a fraction of a penny.
Mid hosted tier (Anthropic Sonnet, OpenAI GPT-4o) — roughly 5× the cheap tier. Noticeably better at ambiguous questions and multi-step reasoning.
Top hosted tier (Anthropic Opus, OpenAI's largest model) — roughly 5× the mid tier. Rarely needed for datasight's workload. If the mid tier is struggling on your schema, better schema descriptions and example queries usually help more than jumping a tier.

A practical starting rule: use the cheap tier until you can point to specific questions it gets wrong, then try the mid tier on just those.

Factor 3: local models with Ollama¶

Local models cost nothing per query, keep data on your hardware, and work offline — at the price of needing GPU memory and slower inference than hosted APIs.

Sizing rule of thumb¶

VRAM needed ≈ model parameter count × bytes per parameter, plus some overhead for context.

4-bit quantized (Ollama default): ~0.5 GB per billion parameters
8-bit: ~1 GB per billion parameters
16-bit (fp16): ~2 GB per billion parameters

So a Llama 3.1 8B model fits in ~5 GB VRAM at 4-bit, a 70B model needs ~40 GB, and a 405B model needs ~200+ GB.

On a laptop¶

Laptop hardware	What fits comfortably
Apple Silicon with 16 GB unified memory	7–8B models at 4-bit
Apple Silicon with 32 GB	13B at 4-bit, or 8B at 8-bit
Apple Silicon with 64 GB+	34–70B at 4-bit, or sparse-MoE models like Qwen3.6 35B-A3B
NVIDIA laptop GPU, 8 GB VRAM	7–8B at 4-bit
NVIDIA laptop GPU, 16 GB VRAM	13B at 4-bit

For datasight's SQL-generation workload, qwen2.5:7b is the recommended starting point for CLI queries (datasight ask). For the web UI with visualizations, step up to qwen2.5:14b — the 7B model struggles with the more complex multi-step agent interactions required for chart generation. Smaller models often struggle with realistic schemas.

Apple Silicon: MLX-native models¶

If you're on Apple Silicon, models tagged -mlx-* use Apple's MLX runtime and Metal compute. They typically decode 10–30% faster than the equivalent GGUF model, but the resident memory can be much larger than the weight size alone suggests because MLX allocates a large KV-cache buffer for the model's default context window (often 256K tokens). Measure before recommending to users — the model card's parameter count is not a reliable predictor of laptop fit.

Measured on a single benchmark dataset (5 questions, agent loop with tool calls, Ollama server keep-alive at default 5 min) on a Mac with unified memory:

Model	Decode (tok/s)	Resident memory (incl. KV cache)	Answer style
`qwen2.5:7b` (q4_K_M, GGUF)	~85	~2 GB	Middle: substantive but can hit `max_tokens`
`gemma4:e2b-mlx-bf16`	~95	~11 GB	Tersest: dumps data tables, minimal analysis
`qwen3.6:35b-a3b-coding-mxfp8`	~90	~38 GB	Richest: includes slopes, R², regional context

The headline surprise: gemma4:e2b-mlx-bf16 is not a low-memory option, despite the "e2b" (effective 2B) naming. Its weights are small but the default 256K-token context allocation dominates resident memory. Use it on 32 GB+ Macs only.

The benchmark above measures datasight ask only. The other LLM-using commands have very different shapes, and observed behavior on the two qwen3.6 variants splits cleanly along those shapes:

Workload	Calls a tool?	Output budget	Best of the two
`datasight ask`	yes (multi-turn agent)	small per turn	either; coding MoE for richer prose
`datasight tidy review` (LLM advisor)	yes (single `propose_reshapes` call)	4 K	general `qwen3.6`
`datasight grounding repair`	no (long-form file rewrite)	16 K	`qwen3.6:35b-a3b-coding-mxfp8`

The split is consistent with what code-specialized fine-tunes are known to trade: better long-form structured generation (winning grounding repair, where the prompt and the output are both large) at the cost of weaker tool-call adherence (losing tidy review, where the model has to emit a structured tool call instead of free text). Observed in practice: the coding variant silently emitted zero proposals on tidy review's propose_reshapes, while the general variant timed out on grounding repair against the same database.

Practical setup: pull both models. Use qwen3.6 as your default OLLAMA_MODEL, and override per-call where the coding variant wins:

datasight grounding repair --model qwen3.6:35b-a3b-coding-mxfp8
datasight tidy review --model qwen3.6   # explicit default; useful in scripts

Both tidy review and grounding repair accept --model, as do ask, verify, and run.

Apple Silicon recommendations by RAM tier:

Unified memory	Recommended model	Why
16 GB	`qwen2.5:7b` (GGUF)	Only option that fits with headroom for the OS, browser, and IDE.
32 GB	`qwen2.5:7b` or `gemma4:e2b-mlx-bf16`	Either fits. Gemma is faster but its answers are tersest; pick based on whether you want interpretation or just raw data.
48 GB+	Both `qwen3.6` and `qwen3.6:35b-a3b-coding-mxfp8` (switch per command)	Sparse MoE (3B active params) — properly leverages Apple Silicon's unified memory + Metal. The two variants are complementary, not interchangeable: general for tool-use commands (`ask`, `tidy review`), coding for long-form generation (`grounding repair`, `ask` when you want richer prose). See workload table above.

If you have an Apple Silicon machine but aren't sure which tag to use, start with qwen2.5:7b (the cross-platform recommendation above). It works on every backend and has the smallest memory footprint by far.

On an HPC GPU node¶

If your HPC has GPU nodes, they typically unlock much larger models.

NLR Kestrel as a concrete example. Kestrel has 156 GPU nodes, each with 4 NVIDIA H100 SXM GPUs (80 GB VRAM each, 320 GB per node) and 384–1536 GB system RAM. On a single Kestrel GPU node you can run:

Llama 3.1 70B at fp16 (~140 GB) with headroom to spare
Llama 3.1 405B at 4-bit quantization (~200 GB) across the 4 GPUs
Multiple mid-sized models concurrently

Kestrel's debug partition lets you request up to half a GPU node for 4 hours without a large allocation — a practical way to try local models before committing resources.

See Run on an HPC compute node for the deployment pattern (datasight runs on the compute node and the web UI is typically forwarded back to your laptop over SSH, preferably via a UNIX domain socket).

When hosted beats local¶

A hosted cheap-tier call (Haiku or GPT-4o-mini) often produces better SQL than a locally-run 8B model, at a fraction of a cent. GitHub Models offers a free tier that handles the full datasight feature set — including visualizations — better than most local models. Don't reach for local models just to avoid hosted costs — reach for them when data sensitivity or offline use requires it.

Factor 4: where the LLM call originates¶

datasight makes its LLM calls from wherever the datasight process is running. That matters when you're combining a remote data backend with any kind of policy or network constraint:

datasight on your laptop + local data — LLM call from laptop.
datasight on an HPC compute node — LLM call from the compute node. Good fit if you want to use the compute node's GPU for a local model, or if hosted API keys are configured there.
datasight on your laptop + remote Flight SQL backend on HPC — LLM call from laptop, SQL executed on HPC. Good fit if your laptop has the GPU you want to use, or if compute-node egress to hosted APIs is blocked.

See the two HPC how-tos for the tradeoffs: Run on an HPC compute node and Connect to a remote Flight SQL backend.

Configuring your choice¶

Once you've picked a provider, see the Install and configure an LLM how-to for the exact environment variables. The short version:

# Anthropic (default)
ANTHROPIC_API_KEY=sk-ant-...

# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

# GitHub Models
LLM_PROVIDER=github
GITHUB_TOKEN=ghp-...

# Ollama (local — use for cost/data-security reasons)
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b      # CLI queries; use qwen2.5:14b for web UI with viz

A secure hosted endpoint (Bedrock, Azure OpenAI, corporate proxy) is configured by setting ANTHROPIC_BASE_URL or OPENAI_BASE_URL alongside the credentials.