Query CSV and Parquet files directly¶
DuckDB can query CSV and Parquet files using SQL — without importing or copying any data. datasight creates lightweight views that point at your files and treats them like regular database tables.
Quick explore (no setup)¶
The fastest way to query files is the Explore Files feature on the
landing page. Run datasight run, enter a file or directory path, and click
Explore. datasight creates views automatically and you can start asking
questions immediately.
You can add more files at any time from the sidebar. When you're ready to save your work, click Save in the header to create a project.
See the EIA generation tutorial for a walkthrough.
Manual setup¶
For full control over view definitions, create a DuckDB file and define views yourself.
How it works¶
A DuckDB view over a file is just a pointer. The data stays on disk in its original format, and DuckDB reads it on the fly when a query runs. For Parquet files this is especially efficient because DuckDB can skip entire row groups and read only the columns a query needs.
Set up views¶
Create a .duckdb file and define views that point at your files:
Parquet files¶
CSV files¶
DuckDB's auto_detect option infers column names, types, delimiters,
and date formats automatically. If auto-detection gets something wrong,
you can specify options explicitly:
CREATE VIEW events AS
SELECT * FROM read_csv(
'data/events.csv',
header=true,
delim=',',
dateformat='%Y-%m-%d',
columns={
'event_id': 'INTEGER',
'event_date': 'DATE',
'category': 'VARCHAR',
'value': 'DOUBLE'
}
);
Multiple files with globs¶
You can point a single view at many files using glob patterns:
-- All Parquet files in a directory
CREATE VIEW sensor_data AS
SELECT * FROM read_parquet('data/sensors/*.parquet');
-- Recursive glob
CREATE VIEW all_logs AS
SELECT * FROM read_parquet('data/**/logs_*.parquet');
DuckDB also exposes the filename column so you can tell which file
each row came from:
Hive-partitioned datasets¶
If your files are organized in a Hive-style directory layout
(year=2024/month=01/data.parquet), DuckDB can read the partition keys
as columns:
CREATE VIEW generation AS
SELECT * FROM read_parquet('data/generation/**/*.parquet', hive_partitioning=true);
This avoids scanning partitions that a query doesn't need.
Remote files on S3¶
DuckDB can read Parquet files directly from S3 (and S3-compatible stores like GCS and MinIO) — no download step required. This is especially powerful for large datasets that you don't want to copy locally.
-- Public bucket (no credentials needed)
CREATE VIEW measurements AS
SELECT * FROM read_parquet('s3://my-bucket/data/measurements.parquet');
-- Hive-partitioned dataset on S3
CREATE VIEW generation AS
SELECT * FROM read_parquet('s3://my-bucket/data/generation/**/*.parquet', hive_partitioning=true);
For private buckets, configure credentials before creating views:
-- Set S3 credentials
SET s3_region = 'us-west-2';
SET s3_access_key_id = 'AKIA...';
SET s3_secret_access_key = '...';
-- Or use environment-based credentials (IAM roles, SSO, etc.)
CALL load_aws_credentials();
CREATE VIEW private_data AS
SELECT * FROM read_parquet('s3://private-bucket/data/*.parquet');
DuckDB handles predicate pushdown and column pruning over HTTP range requests, so only the data needed by each query is transferred. See the DuckDB S3 documentation for all configuration options.
Point datasight at the database¶
Once your views are defined, configure your .env to use the DuckDB file:
Run datasight run and your views will appear in the sidebar alongside
any regular tables. The AI can query them with natural language just like
any other table.
Write a schema description¶
Even though datasight auto-discovers view names, columns, and types, a
schema_description.md file helps the AI understand what the data
means. This is especially useful for file-backed views because column
names in CSV and Parquet files are often terse or ambiguous.
See Write a schema description for guidance.
Tips¶
Use Parquet when you can
Parquet files are columnar and compressed, so DuckDB can read only the columns and row groups a query touches. CSV files must be fully scanned for every query. If your data is large and you query it often, converting to Parquet once will make every subsequent query faster.
Relative paths are relative to where you start datasight
File paths in read_parquet() and read_csv() are resolved relative
to the working directory when datasight starts — not relative to the
.duckdb file. Keep your data files alongside your project directory,
or use absolute paths.
Views update automatically
Because views read the file on every query, changes to the underlying file are picked up immediately. There is nothing to reload or re-import.