Tidy a wide-month spreadsheet¶
Many spreadsheets arrive in this shape:
| plant_id | fuel_type | jan | feb | mar | … | dec |
|---|---|---|---|---|---|---|
| 1 | coal | 180 | 165 | 140 | … | 200 |
Each month is its own column. That makes "average generation by month" an awkward sum across twelve columns. The tidy shape datasight prefers looks like:
| plant_id | fuel_type | month | value |
|---|---|---|---|
| 1 | coal | jan | 180 |
| 1 | coal | feb | 165 |
Each row is one observation. Aggregations become simple GROUP BY queries.
This tutorial walks through detecting the first shape and reshaping it into the second. Allow about five minutes. No API key required — none of the steps call the AI.
Experimental
The datasight tidy commands are experimental and their interface is
subject to change. tidy suggest is read-only and safe to run.
tidy table and tidy view write to your database — use --dry-run
first to preview the DDL before applying.
Background reading
The tidy-data concept comes from Hadley Wickham's Tidy Data (J. Stat. Softw., 2014). The R for Data Science chapter on data tidying has more worked examples if you want the full picture.
1. Install datasight¶
Don't have uv yet? See Install datasight for the one-line installer.
2. Save a wide CSV¶
The CSV below mirrors a common spreadsheet shape: one row per plant, with
twelve numeric columns named jan through dec holding monthly net
generation (MWh).
cat > monthly_generation.csv <<'EOF'
plant_id,fuel_type,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec
1,coal,180,165,140,120,110,100,95,105,130,160,175,200
2,gas,220,200,180,170,160,175,200,220,210,200,215,230
3,wind,150,145,160,175,190,200,195,180,170,155,145,140
4,solar,60,90,130,175,215,240,245,225,180,130,90,55
EOF
This is the untidy shape: the month dimension is hidden in the column
headers, so a question like "average generation by month across plants"
becomes an awkward sum across twelve columns instead of a GROUP BY.
3. Detect untidy column shapes — no setup required¶
Point datasight tidy suggest straight at the CSV. It runs in an
ephemeral DuckDB session, so no project directory or .env is needed:
Suggestions
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ Source ┃ Target ┃ Pattern ┃ Period ┃ Columns ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ monthly_generation_mwh │ monthly_generation_mwh_long │ date_in_column_… │ month │ 12 │
└────────────────────────┴─────────────────────────────┴─────────────────────┴────────┴─────────┘
datasight has spotted that twelve column names look like month tokens —
jan, feb, … dec — and proposes to reshape the table into a long
form named monthly_generation_mwh_long.
Need AI help to fix the schema?
The detector here is regex-based: it recognizes time-period pivots
like jan/feb/dec or 2020/2021/2022 from the column names
alone. For category pivots like coal_mwh/gas_mwh/nuclear_mwh,
geography pivots like ca_capacity/tx_capacity, or multi-axis
pivots like coal_2020/gas_2020/coal_2021, see
Curate datasets with tidy review —
it asks an LLM to propose reshapes and lets you approve each one.
You can also pass several files at once, or parquet / xlsx / duckdb
sources:
4. Set up a project to apply the reshape¶
tidy suggest is read-only and ephemeral, but applying the reshape
(tidy view or tidy table) needs a writable database. That means
loading the CSV into a DuckDB file and pointing a .env at it.
mkdir tidy-tutorial && mv monthly_generation.csv tidy-tutorial && cd tidy-tutorial
uv run --with duckdb python -c "
import duckdb
conn = duckdb.connect('generation.duckdb')
conn.execute('CREATE TABLE monthly_generation_mwh AS SELECT * FROM read_csv_auto(\"monthly_generation.csv\")')
"
cat > .env <<'EOF'
DB_MODE=duckdb
DB_PATH=generation.duckdb
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=stub
EOF
uv run --with duckdb runs Python with DuckDB available without
installing it system-wide. The tidy command doesn't call the AI,
so the ANTHROPIC_API_KEY value is a placeholder here — replace it with a real key only when you're ready to start asking
questions.
5. Preview the reshape SQL¶
Before writing anything to the database, see exactly what would run:
You'll see a CREATE OR REPLACE TABLE statement built around a DuckDB
UNPIVOT. The id columns (plant_id, fuel_type) come through
automatically; the twelve wide measure columns are stacked into a
two-column (month, value) pair.
6. Apply the reshape¶
When the preview looks right, drop the --dry-run flag:
datasight tidy view is also available if you want a view that
re-evaluates against the source on every query (useful when the source
table is updated periodically). Because of a regression in the Python
duckdb 1.5.2 binding that breaks UNPIVOT inside views, tidy view
emits a UNION ALL form instead — the result is identical but the SQL
is more verbose. Prefer tidy table unless you specifically need view
semantics.
7. Query the tidy form¶
uv run --with duckdb python -c "
import duckdb
conn = duckdb.connect('generation.duckdb', read_only=True)
rows = conn.execute('''
SELECT month, ROUND(AVG(value), 1) AS avg_mwh
FROM monthly_generation_mwh_long
GROUP BY month
ORDER BY month
''').fetchall()
for month, avg in rows:
print(f'{month}: {avg}')
"
The same question on the wide table would have required summing twelve
named columns by hand. With the long form, monthly aggregation is a
plain GROUP BY.
What's next¶
- Point the agent at the long form. Mention
monthly_generation_mwh_longand its columns in yourschema_description.mdso the LLM agent prefers tidy queries. See Write a schema description. - Audit the rest of your data. Audit data quality
covers
datasight quality, which surfaces tidy suggestions alongside null/range/date-coverage checks during routine audits. - Try a real dataset. Explore US electricity generation (EIA) walks through the same loop on the PUDL EIA dataset.