Database connections¶
datasight supports five database backends. This guide explains when to use each one and how to configure the connection.
Choosing a database¶
| DuckDB | SQLite | PostgreSQL | Flight SQL | Spark | |
|---|---|---|---|---|---|
| Best for | Local analytics on Parquet/CSV files | Existing SQLite databases from other apps | Production databases, multi-user access | Remote HPC or distributed query engines | Multi-TB datasets on a Spark cluster |
| Install | Built in | Built in | Built in | Built in | pip install 'datasight[spark]' |
| DB_MODE | duckdb |
sqlite |
postgres |
flightsql |
spark |
| Connection | Local file path | Local file path | Host/port or connection string | gRPC URI | Spark Connect URI |
| Concurrent users | Single process | Single process | Multi-user | Multi-user | Multi-user |
| SQL dialect | DuckDB SQL (Postgres-like) | SQLite SQL | PostgreSQL | Depends on server | Spark SQL |
DuckDB is the default and recommended for most use cases. It is an embedded OLAP (Online Analytical Processing) database designed for analytical queries — aggregations, joins, window functions, and scans over large datasets are significantly faster than in SQLite or a typical PostgreSQL configuration. It reads Parquet and CSV files natively and requires no external server.
SQLite is useful when you already have a .sqlite or .db file from
another application (Django, mobile apps, embedded systems) and want to
explore it without converting to another format.
PostgreSQL is for connecting to an existing Postgres server — production databases, data warehouses, or managed services like RDS or Cloud SQL.
Flight SQL is for remote query engines that speak the Arrow Flight SQL protocol, such as GizmoSQL on an HPC cluster. See Connect to a remote Flight SQL backend for a full walkthrough.
Spark connects to an Apache Spark cluster via Spark Connect. Use this when your data is too large to fit on a single machine — typically multi-terabyte Parquet/Delta/Iceberg tables managed by Spark. The client streams Arrow batches and truncates the result once it exceeds a byte cap (default 100 MiB), so the web server stays responsive even when the agent forgets to aggregate.
DuckDB¶
DuckDB is the default — no extra install or configuration needed beyond pointing to a database file.
Why DuckDB for data exploration¶
DuckDB is purpose-built for OLAP workloads — the kind of analytical queries that datasight generates (aggregations, GROUP BY, joins across large tables, window functions). Compared to row-oriented databases like SQLite and PostgreSQL:
- Columnar storage — reads only the columns a query needs, so
SELECT state, SUM(mwh) FROM generation GROUP BY statescans far less data than a row store would. - Vectorized execution — processes data in batches using SIMD instructions, making aggregations and scans significantly faster.
- Zero configuration — no server process, no connection management, no tuning. Just a file.
- Native file format support — queries Parquet, CSV, and JSON files directly via SQL without an import step.
- Rich SQL dialect — supports
DATE_TRUNC,UNNEST,PIVOT, window functions, CTEs, and other analytical SQL features out of the box.
For datasets up to tens of gigabytes on a single machine, DuckDB will typically outperform PostgreSQL for analytical queries without any tuning.
Querying files directly¶
DuckDB can query Parquet and CSV files directly using SQL — no import step, no data duplication. Create lightweight views that point at your files and datasight treats them like regular tables. Excel workbooks are also supported, with each sheet materialized as a DuckDB table.
See Query files directly for a full guide covering globs, Hive partitioning, Excel workbooks, and tips.
SQLite¶
SQLite support is built in — no extra install needed.
Limitations compared to DuckDB¶
- No
DATE_TRUNC,EXTRACT, or other advanced date functions — SQLite stores dates as text, so date-based queries may needstrftime()instead. - No window function support before SQLite 3.25.
- No native Parquet or CSV reading.
Tip
If your schema description mentions date columns, note the storage format (e.g. "ISO 8601 text") and any preferred date functions. This helps the AI write correct date queries for SQLite.
PostgreSQL¶
Connect with individual fields¶
DB_MODE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mydb
POSTGRES_USER=datasight
POSTGRES_PASSWORD=secret
Connect with a URL¶
A connection string takes precedence over individual fields:
SSL configuration¶
POSTGRES_SSLMODE |
Behavior |
|---|---|
disable |
No SSL |
prefer (default) |
Use SSL if available, fall back to plain |
require |
Require SSL, don't verify certificate |
verify-ca |
Require SSL, verify the server certificate is signed by a trusted CA |
verify-full |
Require SSL, verify CA and that the server hostname matches the certificate |
Warning
For production databases, use POSTGRES_SSLMODE=verify-full. The
default prefer mode does not protect against man-in-the-middle attacks.
Managed PostgreSQL services¶
For AWS RDS, Google Cloud SQL, Azure Database, and similar managed services, you typically need:
- The connection hostname from your provider's console
- SSL mode set to
requireorverify-full - The CA certificate bundle (usually provided by the service)
DB_MODE=postgres
POSTGRES_URL=postgresql://datasight:secret@mydb.abc123.us-east-1.rds.amazonaws.com:5432/mydb
POSTGRES_SSLMODE=verify-full
Read-only access¶
datasight only runs SELECT queries — it never writes to your database.
For production safety, connect with a read-only database user:
CREATE USER datasight WITH PASSWORD 'secret';
GRANT CONNECT ON DATABASE mydb TO datasight;
GRANT USAGE ON SCHEMA public TO datasight;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datasight;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datasight;
Flight SQL¶
Flight SQL connects to remote query engines over gRPC. See Connect to a remote Flight SQL backend for a detailed guide using GizmoSQL.
DB_MODE=flightsql
FLIGHT_SQL_URI=grpc://localhost:31337
FLIGHT_SQL_USERNAME=gizmosql_user
FLIGHT_SQL_PASSWORD=your_password
For TLS-enabled servers, use grpc+tls:// as the URI scheme:
Spark¶
Spark connects to an Apache Spark cluster over Spark Connect. Use this for multi-terabyte datasets that won't fit on a single machine. See Connect to an Apache Spark backend for the full walkthrough.
DB_MODE=spark
SPARK_REMOTE=sc://spark-connect.example.com:15002
SPARK_TOKEN=your_bearer_token # optional
SPARK_MAX_RESULT_BYTES=104857600 # optional, default 100 MiB
The client streams Arrow batches and truncates results above
SPARK_MAX_RESULT_BYTES to protect the web server from OOMing. Row
counts are skipped at introspection time (a naive SELECT COUNT(*) on
a multi-TB partitioned table would kick off a full-cluster job); document
your partition columns in schema_description.md so the agent uses them
as predicates.
All environment variables¶
See the Configuration reference for a complete list of database-related environment variables.