Set up a project¶
This guide walks you through creating a datasight project for your database.
Prerequisites¶
- Python 3.11+
- A database: DuckDB file, SQLite file, PostgreSQL server, or CSV/Parquet files
- An LLM provider: Anthropic API key, OpenAI API key, GitHub account (free, public), or Ollama (free, local). Not sure which? See Choosing an LLM.
Install datasight¶
Don't have uv yet? See Install datasight. All database backends (DuckDB, SQLite, PostgreSQL) and LLM providers (Anthropic, GitHub Models, Ollama) are included.
Store API keys once (recommended)¶
If you haven't already, create a user-global config file for your API keys and tokens. You only do this once per machine:
This writes ~/.config/datasight/.env (honors XDG_CONFIG_HOME). Edit it
and uncomment the credentials you actually use — ANTHROPIC_API_KEY,
OPENAI_API_KEY, and/or GITHUB_TOKEN. Every datasight project on this
machine will pick up these values automatically, so each project's .env
only has to set provider, model, and database. Per-project .env values
still override the global file when you need a different key for one
project. See Configuration reference for
the full precedence rules.
Create a project¶
This creates four template files:
.env- API key and database connection settings.
schema_description.md- Describe your database for the AI.
queries.yaml- Example question/SQL pairs.
time_series.yaml- Declare temporal structure for completeness checks. See Declare time series.
Configure¶
Edit .env with your database path and LLM settings. The examples below
assume you've already run datasight config init and put your API keys in
~/.config/datasight/.env. If not, you can also paste the key directly
into the project .env — both work. For guidance on picking a provider —
data sensitivity, cost tiers, local vs hosted — see
Choosing an LLM.
Option A — Anthropic (cloud API):
Option B — OpenAI (cloud API):
Option C — GitHub Models (free, public):
Option D — Ollama (local, no API key):
First, install and start Ollama, then pull a model with tool-calling support:
Then configure .env:
qwen2.5:7b works well for CLI queries. For the web UI with visualizations,
use qwen2.5:14b (ollama pull qwen2.5:14b). For the best experience,
consider GitHub Models (free tier) instead — only use Ollama when cost or
data-security requirements demand local inference.
Using SQLite or PostgreSQL? Set DB_MODE accordingly:
# SQLite
DB_MODE=sqlite
DB_PATH=./my_database.sqlite
# PostgreSQL
DB_MODE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mydb
POSTGRES_USER=datasight
POSTGRES_PASSWORD=secret
# Or use a connection string instead:
# POSTGRES_URL=postgresql://user:pass@host:5432/dbname
See Configuration reference for all PostgreSQL options.
Auto-generate documentation¶
Instead of writing schema_description.md and queries.yaml by hand,
you can let the AI generate them from your database:
You can also pass existing database files or tabular files directly instead of configuring a database first:
# Existing DuckDB database: referenced directly
datasight generate generation.duckdb
# Existing SQLite database: referenced directly
datasight generate generation.sqlite
# CSV inputs: datasight creates ./database.duckdb
datasight generate generation.csv plants.csv
# Parquet inputs: datasight creates ./database.duckdb
datasight generate generation.parquet plants.parquet
# CSV inputs with a custom output DuckDB path
datasight generate generation.csv plants.csv --db-path db/project.duckdb
# Parquet inputs with a custom output DuckDB path
datasight generate generation.parquet plants.parquet --db-path db/project.duckdb
For a single existing DuckDB or SQLite file, datasight creates or updates
.env to point at that database directly.
When CSV, Parquet, or mixed file inputs are passed, datasight also:
- Writes a persistent DuckDB file (
database.duckdbby default in the project directory) with views pointing at each input file. - Creates or updates
.envwithDB_MODE=duckdbandDB_PATHset to the new database. Existing entries are replaced in place; other env values (likeANTHROPIC_API_KEY) are preserved.
Use --db-path <path> to write that generated DuckDB somewhere else
(for example, --db-path db/project.duckdb). The path may be absolute or
relative to --project-dir. Do not use --db-path with a single existing
DuckDB or SQLite database; those inputs are referenced directly.
The command connects to your database (or creates an ephemeral one from
the given files), inspects tables and columns, samples code/enum columns
to identify their meanings, and produces draft versions of
schema_description.md and queries.yaml. Review and edit the results —
the AI gets you a solid starting point but you know your data best.
It also seeds a measures.yaml file for project-specific semantic
overrides and a time_series.yaml file for temporal completeness
declarations (see Declare time series).
To regenerate after making database changes:
Manual editing¶
You can also write these files by hand, or refine the generated versions.
Edit schema_description.md to explain your data — domain concepts, column
meanings, code lookups, and query tips. The AI uses this context to write
better SQL. See Write a schema description for guidance.
Edit queries.yaml with example questions and their correct SQL. See
Create example queries for guidance.
If your project contains energy metrics, rates, or project-specific formulas,
edit measures.yaml to lock in semantic behavior such as:
- default aggregation
- weighted-average columns
- display name and numeric format
- preferred chart types
- calculated measures such as
net_load_mw
See Configure semantic measures for the full measures.yaml workflow.
Run¶
Open http://localhost:8084 in your browser. The sidebar shows your database tables, recipes, example queries, and saved artifacts. The landing page also lets you start with guided deterministic workflows such as:
- profiling the dataset
- surfacing key dimensions
- finding likely trend charts
- auditing nulls and suspicious ranges
After that first pass, type a question in plain English and the AI will write SQL, run it, and display the results. Ask for a chart and it will generate an interactive Plotly visualization.
Headless mode¶
You can also ask questions from the command line without starting a web server:
datasight ask "What are the top 10 records by the largest numeric column?"
datasight ask "Show trends over time" --chart-format html -o chart.html
datasight ask "Top 5 states" --format csv -o results.csv
datasight ask --file questions.txt --output-dir batch-output
datasight profile
datasight quality --table generation_fuel
datasight dimensions --table generation_fuel
datasight trends --table generation_fuel
See Ask questions from the CLI for batch mode, export options, and diagnostics.
What happens at startup¶
flowchart TB
A[datasight run] --> B[Load .env: project values override global]
B --> C[Connect to database]
C --> D[Introspect schema]
D --> E[Load schema_description.md]
E --> F[Load queries.yaml]
F --> G[Start FastAPI server]
style A fill:#15a8a8,stroke:#023d60,color:#fff
style B fill:#8a7d55,stroke:#6b6040,color:#fff
style C fill:#2e7ebb,stroke:#1a5c8a,color:#fff
style D fill:#2e7ebb,stroke:#1a5c8a,color:#fff
style E fill:#8a7d55,stroke:#6b6040,color:#fff
style F fill:#8a7d55,stroke:#6b6040,color:#fff
style G fill:#fe5d26,stroke:#023d60,color:#fff
datasight auto-discovers your tables, columns, and row counts, then combines that with your description and example queries to give the AI full context about your database.