Database connections¶

datasight supports five database backends. This guide explains when to use each one and how to configure the connection.

Choosing a database¶

	DuckDB	SQLite	PostgreSQL	Flight SQL	Spark
Best for	Local analytics on Parquet/CSV files	Existing SQLite databases from other apps	Production databases, multi-user access	Remote HPC or distributed query engines	Multi-TB datasets on a Spark cluster
Install	Built in	Built in	Built in	Built in	`pip install 'datasight[spark]'`
DB_MODE	`duckdb`	`sqlite`	`postgres`	`flightsql`	`spark`
Connection	Local file path	Local file path	Host/port or connection string	gRPC URI	Spark Connect URI
Concurrent users	Single process	Single process	Multi-user	Multi-user	Multi-user
SQL dialect	DuckDB SQL (Postgres-like)	SQLite SQL	PostgreSQL	Depends on server	Spark SQL

DuckDB is the default and recommended for most use cases. It is an embedded OLAP (Online Analytical Processing) database designed for analytical queries — aggregations, joins, window functions, and scans over large datasets are significantly faster than in SQLite or a typical PostgreSQL configuration. It reads Parquet and CSV files natively and requires no external server.

SQLite is useful when you already have a .sqlite or .db file from another application (Django, mobile apps, embedded systems) and want to explore it without converting to another format.

PostgreSQL is for connecting to an existing Postgres server — production databases, data warehouses, or managed services like RDS or Cloud SQL.

Flight SQL is for remote query engines that speak the Arrow Flight SQL protocol, such as GizmoSQL on an HPC cluster. See Connect to a remote Flight SQL backend for a full walkthrough.

Spark connects to an Apache Spark cluster via Spark Connect. Use this when your data is too large to fit on a single machine — typically multi-terabyte Parquet/Delta/Iceberg tables managed by Spark. The client streams Arrow batches and truncates the result once it exceeds a byte cap (default 100 MiB), so the web server stays responsive even when the agent forgets to aggregate.

DuckDB¶

DuckDB is the default — no extra install or configuration needed beyond pointing to a database file.

DB_MODE=duckdb
DB_PATH=./my_database.duckdb

Why DuckDB for data exploration¶

DuckDB is purpose-built for OLAP workloads — the kind of analytical queries that datasight generates (aggregations, GROUP BY, joins across large tables, window functions). Compared to row-oriented databases like SQLite and PostgreSQL:

Columnar storage — reads only the columns a query needs, so SELECT state, SUM(mwh) FROM generation GROUP BY state scans far less data than a row store would.
Vectorized execution — processes data in batches using SIMD instructions, making aggregations and scans significantly faster.
Zero configuration — no server process, no connection management, no tuning. Just a file.
Native file format support — queries Parquet, CSV, and JSON files directly via SQL without an import step.
Rich SQL dialect — supports DATE_TRUNC, UNNEST, PIVOT, window functions, CTEs, and other analytical SQL features out of the box.

For datasets up to tens of gigabytes on a single machine, DuckDB will typically outperform PostgreSQL for analytical queries without any tuning.

Querying files directly¶

DuckDB can query Parquet and CSV files directly using SQL — no import step, no data duplication. Create lightweight views that point at your files and datasight treats them like regular tables. Excel workbooks are also supported, with each sheet materialized as a DuckDB table.

See Query files directly for a full guide covering globs, Hive partitioning, Excel workbooks, and tips.

SQLite¶

SQLite support is built in — no extra install needed.

DB_MODE=sqlite
DB_PATH=./my_database.sqlite

Limitations compared to DuckDB¶

No DATE_TRUNC, EXTRACT, or other advanced date functions — SQLite stores dates as text, so date-based queries may need strftime() instead.
No window function support before SQLite 3.25.
No native Parquet or CSV reading.

Tip

If your schema description mentions date columns, note the storage format (e.g. "ISO 8601 text") and any preferred date functions. This helps the AI write correct date queries for SQLite.

PostgreSQL¶

Connect with individual fields¶

DB_MODE=postgres
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mydb
POSTGRES_USER=datasight
POSTGRES_PASSWORD=secret

Connect with a URL¶

A connection string takes precedence over individual fields:

DB_MODE=postgres
POSTGRES_URL=postgresql://datasight:secret@localhost:5432/mydb

SSL configuration¶

`POSTGRES_SSLMODE`	Behavior
`disable`	No SSL
`prefer` (default)	Use SSL if available, fall back to plain
`require`	Require SSL, don't verify certificate
`verify-ca`	Require SSL, verify the server certificate is signed by a trusted CA
`verify-full`	Require SSL, verify CA and that the server hostname matches the certificate

Warning

For production databases, use POSTGRES_SSLMODE=verify-full. The default prefer mode does not protect against man-in-the-middle attacks.

Managed PostgreSQL services¶

For AWS RDS, Google Cloud SQL, Azure Database, and similar managed services, you typically need:

The connection hostname from your provider's console
SSL mode set to require or verify-full
The CA certificate bundle (usually provided by the service)

DB_MODE=postgres
POSTGRES_URL=postgresql://datasight:secret@mydb.abc123.us-east-1.rds.amazonaws.com:5432/mydb
POSTGRES_SSLMODE=verify-full

Read-only access¶

datasight only runs SELECT queries — it never writes to your database. For production safety, connect with a read-only database user:

CREATE USER datasight WITH PASSWORD 'secret';
GRANT CONNECT ON DATABASE mydb TO datasight;
GRANT USAGE ON SCHEMA public TO datasight;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datasight;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datasight;

Flight SQL¶

Flight SQL connects to remote query engines over gRPC. See Connect to a remote Flight SQL backend for a detailed guide using GizmoSQL.

DB_MODE=flightsql
FLIGHT_SQL_URI=grpc://localhost:31337
FLIGHT_SQL_USERNAME=gizmosql_user
FLIGHT_SQL_PASSWORD=your_password

For TLS-enabled servers, use grpc+tls:// as the URI scheme:

FLIGHT_SQL_URI=grpc+tls://flight.example.com:31337
FLIGHT_SQL_TOKEN=your_bearer_token

Spark¶

Spark connects to an Apache Spark cluster over Spark Connect. Use this for multi-terabyte datasets that won't fit on a single machine. See Connect to an Apache Spark backend for the full walkthrough.

pip install 'datasight[spark]'

DB_MODE=spark
SPARK_REMOTE=sc://spark-connect.example.com:15002
SPARK_TOKEN=your_bearer_token        # optional
SPARK_MAX_RESULT_BYTES=104857600     # optional, default 100 MiB

The client streams Arrow batches and truncates results above SPARK_MAX_RESULT_BYTES to protect the web server from OOMing. Row counts are skipped at introspection time (a naive SELECT COUNT(*) on a multi-TB partitioned table would kick off a full-cluster job); document your partition columns in schema_description.md so the agent uses them as predicates.

All environment variables¶

See the Configuration reference for a complete list of database-related environment variables.