Choosing an AI provider¶
datasight uses an AI model — also called an LLM (Large Language Model) — to translate your questions into SQL. You can use a hosted service like Anthropic Claude or OpenAI's GPT, or run a model locally with Ollama. The right choice depends on your data sensitivity, budget, and whether you want to run a model locally. This page helps you pick one without reading every provider's pricing page.
Quick decision guide¶
| Your situation | Start with |
|---|---|
| Trying datasight for the first time, non-sensitive data | Anthropic Claude Haiku or OpenAI GPT-4o-mini |
| Want zero cost, don't mind rate limits | GitHub Models (free tier, recommended over Ollama for most users) |
| Already have an OpenAI key | OpenAI (gpt-4o-mini or gpt-4.1-mini) |
| Data is sensitive and must not leave your network | Local Ollama (laptop or HPC GPU node) |
| Data is sensitive but you have a secure hosted endpoint | Anthropic on Bedrock or Azure OpenAI (custom base_url) |
| Writing SQL for a well-documented schema | Haiku / GPT-4o-mini is usually enough |
| Complex multi-step analytical questions, poor results from the cheap tier | Step up to Sonnet or GPT-4o |
When in doubt, start with Haiku. datasight's main job — turning a question into SQL against a documented schema — is not a frontier-model task, and Haiku handles it well for most projects.
Most users can stop reading here
Pick the option that matches your row in the table above, then head to Install datasight. The sections below cover advanced tradeoffs — data sensitivity policy, cost modeling, local GPU sizing, and network configuration — and are only worth reading if the table left you uncertain.
Factor 1: data sensitivity¶
This is the first question to answer, because it rules some options out.
- Non-sensitive or already-public data. Any hosted provider is fine. Only the SQL and sampled result rows leave your machine; datasight does not upload raw files.
- Sensitive data where a hosted API is acceptable under a BAA or
enterprise agreement. Use a secure endpoint such as Anthropic on AWS
Bedrock, Azure OpenAI, or a corporate gateway. Configure datasight with
the provider's
base_urlpointing at your endpoint. - Sensitive data that must not traverse the public internet at all. Run a local model via Ollama — on your laptop, or on an HPC GPU node (see below).
Note: even with a hosted API, the data values that reach the LLM are limited to column names, schema descriptions, example queries, and small result samples used for summarization. Full tables are never uploaded. That said, column names and sample rows can themselves be sensitive, so treat them accordingly.
Factor 2: cost¶
For the hosted options, rough order of magnitude (check current pricing — these move):
- GitHub Models — free for a generous monthly quota, rate-limited. Great for evaluation and light use. Provides access to GPT, Llama, and other open models through a single GitHub token. Note: the free tier caps requests at 8,000 tokens, which is easy to exceed on databases with many tables or wide tables. If you hit context-length errors, see Limit schema sent to the LLM.
- Cheap hosted tier (Anthropic Haiku, OpenAI GPT-4o-mini / GPT-4.1-mini) — typical datasight sessions cost a fraction of a penny.
- Mid hosted tier (Anthropic Sonnet, OpenAI GPT-4o) — roughly 5× the cheap tier. Noticeably better at ambiguous questions and multi-step reasoning.
- Top hosted tier (Anthropic Opus, OpenAI's largest model) — roughly 5× the mid tier. Rarely needed for datasight's workload. If the mid tier is struggling on your schema, better schema descriptions and example queries usually help more than jumping a tier.
A practical starting rule: use the cheap tier until you can point to specific questions it gets wrong, then try the mid tier on just those.
Factor 3: local models with Ollama¶
Local models cost nothing per query, keep data on your hardware, and work offline — at the price of needing GPU memory and slower inference than hosted APIs.
Sizing rule of thumb¶
VRAM needed ≈ model parameter count × bytes per parameter, plus some overhead for context.
- 4-bit quantized (Ollama default): ~0.5 GB per billion parameters
- 8-bit: ~1 GB per billion parameters
- 16-bit (fp16): ~2 GB per billion parameters
So a Llama 3.1 8B model fits in ~5 GB VRAM at 4-bit, a 70B model needs ~40 GB, and a 405B model needs ~200+ GB.
On a laptop¶
| Laptop hardware | What fits comfortably |
|---|---|
| Apple Silicon with 16 GB unified memory | 7–8B models at 4-bit |
| Apple Silicon with 32 GB | 13B at 4-bit, or 8B at 8-bit |
| Apple Silicon with 64 GB+ | 34–70B at 4-bit, or sparse-MoE models like Qwen3.6 35B-A3B |
| NVIDIA laptop GPU, 8 GB VRAM | 7–8B at 4-bit |
| NVIDIA laptop GPU, 16 GB VRAM | 13B at 4-bit |
For datasight's SQL-generation workload, qwen2.5:7b is the recommended
starting point for CLI queries (datasight ask). For the web UI with
visualizations, step up to qwen2.5:14b — the 7B model struggles with
the more complex multi-step agent interactions required for chart
generation. Smaller models often struggle with realistic schemas.
Apple Silicon: MLX-native models¶
If you're on Apple Silicon, models tagged -mlx-* use Apple's MLX
runtime and Metal compute. They typically decode 10–30% faster than the
equivalent GGUF model, but the resident memory can be much larger than
the weight size alone suggests because MLX allocates a large KV-cache
buffer for the model's default context window (often 256K tokens).
Measure before recommending to users — the model card's parameter count
is not a reliable predictor of laptop fit.
Measured on a single benchmark dataset (5 questions, agent loop with tool calls, Ollama server keep-alive at default 5 min) on a Mac with unified memory:
| Model | Decode (tok/s) | Resident memory (incl. KV cache) | Answer style |
|---|---|---|---|
qwen2.5:7b (q4_K_M, GGUF) |
~85 | ~2 GB | Middle: substantive but can hit max_tokens |
gemma4:e2b-mlx-bf16 |
~95 | ~11 GB | Tersest: dumps data tables, minimal analysis |
qwen3.6:35b-a3b-coding-mxfp8 |
~90 | ~38 GB | Richest: includes slopes, R², regional context |
The headline surprise: gemma4:e2b-mlx-bf16 is not a low-memory
option, despite the "e2b" (effective 2B) naming. Its weights are
small but the default 256K-token context allocation dominates resident
memory. Use it on 32 GB+ Macs only.
The benchmark above measures datasight ask only. The other LLM-using
commands have very different shapes, and observed behavior on the
two qwen3.6 variants splits cleanly along those shapes:
| Workload | Calls a tool? | Output budget | Best of the two |
|---|---|---|---|
datasight ask |
yes (multi-turn agent) | small per turn | either; coding MoE for richer prose |
datasight tidy review (LLM advisor) |
yes (single propose_reshapes call) |
4 K | general qwen3.6 |
datasight grounding repair |
no (long-form file rewrite) | 16 K | qwen3.6:35b-a3b-coding-mxfp8 |
The split is consistent with what code-specialized fine-tunes are known
to trade: better long-form structured generation (winning grounding
repair, where the prompt and the output are both large) at the cost of
weaker tool-call adherence (losing tidy review, where the model has
to emit a structured tool call instead of free text). Observed in
practice: the coding variant silently emitted zero proposals on
tidy review's propose_reshapes, while the general variant timed out
on grounding repair against the same database.
Practical setup: pull both models. Use qwen3.6 as your default
OLLAMA_MODEL, and override per-call where the coding variant wins:
datasight grounding repair --model qwen3.6:35b-a3b-coding-mxfp8
datasight tidy review --model qwen3.6 # explicit default; useful in scripts
Both tidy review and grounding repair accept --model, as do
ask, verify, and run.
Apple Silicon recommendations by RAM tier:
| Unified memory | Recommended model | Why |
|---|---|---|
| 16 GB | qwen2.5:7b (GGUF) |
Only option that fits with headroom for the OS, browser, and IDE. |
| 32 GB | qwen2.5:7b or gemma4:e2b-mlx-bf16 |
Either fits. Gemma is faster but its answers are tersest; pick based on whether you want interpretation or just raw data. |
| 48 GB+ | Both qwen3.6 and qwen3.6:35b-a3b-coding-mxfp8 (switch per command) |
Sparse MoE (3B active params) — properly leverages Apple Silicon's unified memory + Metal. The two variants are complementary, not interchangeable: general for tool-use commands (ask, tidy review), coding for long-form generation (grounding repair, ask when you want richer prose). See workload table above. |
If you have an Apple Silicon machine but aren't sure which tag to use,
start with qwen2.5:7b (the cross-platform recommendation above). It
works on every backend and has the smallest memory footprint by far.
On an HPC GPU node¶
If your HPC has GPU nodes, they typically unlock much larger models.
NLR Kestrel as a concrete example. Kestrel has 156 GPU nodes, each with 4 NVIDIA H100 SXM GPUs (80 GB VRAM each, 320 GB per node) and 384–1536 GB system RAM. On a single Kestrel GPU node you can run:
- Llama 3.1 70B at fp16 (~140 GB) with headroom to spare
- Llama 3.1 405B at 4-bit quantization (~200 GB) across the 4 GPUs
- Multiple mid-sized models concurrently
Kestrel's debug partition lets you request up to half a GPU node for
4 hours without a large allocation — a practical way to try local
models before committing resources.
See Run on an HPC compute node for the deployment pattern (datasight runs on the compute node and the web UI is typically forwarded back to your laptop over SSH, preferably via a UNIX domain socket).
When hosted beats local¶
A hosted cheap-tier call (Haiku or GPT-4o-mini) often produces better SQL than a locally-run 8B model, at a fraction of a cent. GitHub Models offers a free tier that handles the full datasight feature set — including visualizations — better than most local models. Don't reach for local models just to avoid hosted costs — reach for them when data sensitivity or offline use requires it.
Factor 4: where the LLM call originates¶
datasight makes its LLM calls from wherever the datasight process is running. That matters when you're combining a remote data backend with any kind of policy or network constraint:
- datasight on your laptop + local data — LLM call from laptop.
- datasight on an HPC compute node — LLM call from the compute node. Good fit if you want to use the compute node's GPU for a local model, or if hosted API keys are configured there.
- datasight on your laptop + remote Flight SQL backend on HPC — LLM call from laptop, SQL executed on HPC. Good fit if your laptop has the GPU you want to use, or if compute-node egress to hosted APIs is blocked.
See the two HPC how-tos for the tradeoffs: Run on an HPC compute node and Connect to a remote Flight SQL backend.
Configuring your choice¶
Once you've picked a provider, see the Install and configure an LLM how-to for the exact environment variables. The short version:
# Anthropic (default)
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
# GitHub Models
LLM_PROVIDER=github
GITHUB_TOKEN=ghp-...
# Ollama (local — use for cost/data-security reasons)
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b # CLI queries; use qwen2.5:14b for web UI with viz
A secure hosted endpoint (Bedrock, Azure OpenAI, corporate proxy) is
configured by setting ANTHROPIC_BASE_URL or OPENAI_BASE_URL alongside
the credentials.