# Software Architecture

This page describes the dsgrid software architecture.

:::{todo}
Add architecture diagrams
:::

## Overview

dsgrid is built on a distributed architecture designed to handle large-scale energy demand data:

- **Store project and dataset metadata in SQLite** - Centralized registry database
- **Store dimension and dimension-mapping metadata and records in SQLite** - Efficient lookups
- **Store dataset time-series data in Parquet files** - On shared filesystem (e.g., Lustre, S3)
- **Store dependencies between registry components in SQLite** - Version tracking and relationships
- **Store version history of registry components in SQLite** - Full audit trail
- **Load all data tables in Apache Spark** - Use DataFrame API for queries
- **Convert to Pandas DataFrames** - For post-processing and visualizations

## APIs

### Python

dsgrid applications use the dsgrid Python package to perform all operations.

The Python API provides programmatic access to:
- Registry management
- Dataset registration and mapping
- Project creation and configuration
- Query execution and result processing
- Dimension operations

See the [Python API Reference](python_api) for complete documentation.

### HTTP

There is currently an HTTP API to run a limited number of dsgrid operations. This will expand in the future and may become the primary interface for user applications. That will require a persistent dsgrid server.

**Current capabilities:**
- Basic registry browsing
- Project metadata retrieval
- Dataset information queries

**Future plans:**
- Full registry operations
- Query execution via HTTP
- Real-time result streaming

## Applications

Users interact with dsgrid through these applications.

### CLI Toolkit

This is the primary user interface to dsgrid. Users run CLI commands to register dsgrid components and run queries.

**Key features:**
- Hierarchical command structure
- Registry management (`dsgrid registry`)
- Query execution (`dsgrid query`)
- Configuration management (`dsgrid config`)

The CLI consumes the Python API. See [CLI Reference](cli_reference) for complete documentation.

### Project Viewer

Web UI based on Plotly Dash. Allows users to browse and filter project and dataset components.

**Capabilities:**
- Browse registered projects and datasets
- View dimension records
- Filter and search metadata
- Explore project structure

The Project Viewer consumes the HTTP API. See [Browse Registry](../user_guide/how_tos/browse_registry) for usage instructions.

## Current Workflow

Future workflows may change significantly. We may have a persistent database and dsgrid API server running in the cloud with on-demand Spark clusters. For the foreseeable future, this is what we anticipate the user workflow to be:

### Typical HPC Workflow

1. **Start a Spark cluster** on one or more compute nodes
   - See [Start Spark Cluster on Kestrel](../user_guide/how_tos/spark_cluster_on_kestrel)

2. **Connect to a dsgrid registry database** or start your own
   - Existing registry: Connect via database URL
   - New registry: Initialize with `dsgrid registry create`

3. **Run dsgrid CLI commands** from a compute node with access to:
   - Registry database
   - Registry data (Parquet files)
   - Spark compute nodes

   When running on an HPC, this compute node is usually the Spark master node.

### Local Development Workflow

1. **Install dsgrid** in a conda environment
   - See [Installation](../getting_started/installation)

2. **Configure local registry** connection
   - Use `dsgrid config create` to set up database connection

3. **Run operations** directly from your local machine
   - Suitable for small datasets and testing
   - Limited by local computational resources

## Technology Stack

### Core Technologies

- **Apache Spark** - Distributed data processing
- **SQLite** - Registry metadata storage
- **Parquet** - Columnar data format for time-series
- **Pydantic** - Data validation and configuration models
- **Click** - CLI framework

### Data Processing

- **PySpark** - Python interface to Spark
- **Pandas** - Post-processing and analysis
- **PyArrow** - Efficient data interchange

### Web Technologies

- **Plotly Dash** - Interactive web applications
- **FastAPI** - HTTP API framework (planned expansion)

## Data Flow

### Dataset Registration

```
User → CLI → Python API → Registry DB
                        → Parquet Files (validation)
```

### Query Execution

```
User → CLI → Python API → Registry DB (metadata)
                        → Spark Cluster → Parquet Files (data)
                        → Result Files (Parquet)
```

### Web UI Access

```
User → Web Browser → Dash App → HTTP API → Registry DB
```

## Next Steps

- Learn about [CLI fundamentals](cli_fundamentals)
- Explore the [Python API](python_api)
- Understand [data file formats](../user_guide/dataset_registration/data_file_formats)
- See [data models](data_models/index) for configuration schemas