ca-biositing
A geospatial bioeconomy project for biositing analysis in California. This repository provides tools for ETL pipelines to process data from Google Sheets into PostgreSQL databases, geospatial analysis using QGIS, and a REST API for data access.
Project Structure
This project uses a PEP 420 namespace package structure with three main components:
ca_biositing.datamodels: Hand-written SQLModel database models, materialized views, and database configurationca_biositing.pipeline: ETL pipelines orchestrated with Prefect, deployed via Dockerca_biositing.webservice: FastAPI REST API for data access
Directory Layout
ca-biositing/
├── src/ca_biositing/ # Namespace package root
│ ├── datamodels/ # Database models (SQLModel) and Alembic migrations
│ ├── pipeline/ # ETL pipelines (Prefect)
│ └── webservice/ # REST API (FastAPI)
├── resources/ # Deployment resources
│ ├── docker/ # Docker Compose configuration
│ └── prefect/ # Prefect deployment files
├── tests/ # Integration tests
├── pixi.toml # Pixi dependencies and tasks
│ └── pixi.lock # Dependency lock file
Quick Start
Prerequisites
- Pixi (v0.55.0+): Installation Guide
- Docker: For running the ETL pipeline
- Google Cloud credentials: For Google Sheets access (optional)
Installation
# Clone the repository
git clone https://github.com/sustainability-software-lab/ca-biositing.git
cd ca-biositing
# Install dependencies with Pixi
pixi install
# Install pre-commit hooks
pixi run pre-commit-install
Running Components
ETL Pipeline (Prefect + Docker)
Note: Before starting the services for the first time, create the required environment file from the template:
cp resources/docker/.env.example resources/docker/.env
CRITICAL (PostgreSQL 15 Upgrade): If you are upgrading from a version prior to Feb 2026, you must wipe your local volumes to support the PostgreSQL 15 image:
pixi run teardown-services-volumes
Then start and use the services:
# 1. Start all services (PostgreSQL, Prefect server, worker)
# This will also automatically apply any pending database migrations.
pixi run start-services
# 2. Deploy flows to Prefect
pixi run deploy
# 3. Run the ETL pipeline
pixi run run-etl
# Monitor via Prefect UI: http://localhost:4200
# To apply new migrations after the initial setup
pixi run migrate
# Stop services
pixi run teardown-services
See resources/README.md for detailed pipeline
documentation.
Web Service (FastAPI)
# Start the web service
pixi run start-webservice
# Access API docs: http://localhost:8000/docs
QGIS (Geospatial Analysis)
pixi run qgis
Note: On macOS, you may see a Python faulthandler error - this is expected and can be ignored. See QGIS Issue #52987.
Development
Running Tests
# Run all tests
pixi run test
# Run tests with coverage
pixi run test-cov
Code Quality
# Run pre-commit checks on staged files
pixi run pre-commit
# Run pre-commit on all files (before PR)
pixi run pre-commit-all
Available Pixi Tasks
View all available tasks:
pixi task list
Key tasks:
- Service Management:
start-services,teardown-services,service-status - ETL Operations:
deploy,run-etl - Development:
test,test-cov,pre-commit,pre-commit-all - Applications:
start-webservice,qgis - Database:
access-db,check-db-health - Schema Management:
migrate,migrate-autogenerate,refresh-views - Validation (pgschema):
schema-plan,schema-analytics-plan,schema-dump,schema-analytics-list
Architecture
Namespace Packages
This project uses PEP 420 namespace packages to organize code into independently installable components that share a common namespace:
- Each component has its own
pyproject.tomland can be installed separately - Shared models in
datamodelsare used by bothpipelineandwebservice - Clear separation of concerns while maintaining type consistency
ETL Pipeline
The ETL pipeline uses:
- Prefect: Workflow orchestration and monitoring
- Docker: Containerized execution environment
- PostgreSQL: Data persistence
- Google Sheets API: Primary data source
Pipeline architecture:
- Extract: Pull data from Google Sheets
- Transform: Clean and normalize data with pandas
- Load: Insert/update records in PostgreSQL via SQLAlchemy
Database Models
Database models are hand-written SQLModel classes organized into 15 domain
subdirectories under
src/ca_biositing/datamodels/ca_biositing/datamodels/models/. All schema
changes are managed through Alembic migrations.
Development workflow:
- Edit SQLModel classes in
models/ - Auto-generate a migration:
pixi run migrate-autogenerate -m "Description" - Apply the migration:
pixi run migrate
SQLModel-based models provide:
- Type-safe database operations (SQLAlchemy + Pydantic in one class)
- Versioned schema migrations (via Alembic)
- Shared models across ETL and API components
- Built-in Pydantic validation
Seven materialized views are defined in views.py and managed through Alembic
migrations. Refresh them after loading data with pixi run refresh-views.
Project Components
1. Data Models (ca_biositing.datamodels)
Database models for:
- Biomass data (field samples, measurements)
- Geographic locations
- Experiments and analysis
- Metadata and samples
- Organizations and contacts
Documentation: datamodels/README.md
2. ETL Pipeline (ca_biositing.pipeline)
Prefect-orchestrated workflows for:
- Data extraction from Google Sheets
- Data transformation and validation
- Database loading and updates
- Lookup table management
Documentation: pipeline/README.md
Guides:
3. Web Service (ca_biositing.webservice)
FastAPI REST API providing:
- Read access to database records
- Interactive API documentation (Swagger/OpenAPI)
- Type-safe endpoints using Pydantic
Documentation: webservice/README.md
4. Deployment Resources (resources/)
Docker and Prefect configuration for:
- Service orchestration (Docker Compose)
- Prefect deployments
- Database initialization
Documentation: resources/README.md
Adding Dependencies
For Local Development (Pixi)
# Add conda package to default environment
pixi add <package-name>
# Add PyPI package to default environment
pixi add --pypi <package-name>
# Add to specific feature (e.g., pipeline)
pixi add --feature pipeline --pypi <package-name>
For ETL Pipeline (Docker)
The pipeline dependencies are managed by Pixi's etl environment feature in
pixi.toml. When you add dependencies and rebuild Docker images, they are
automatically included:
# Add dependency to pipeline feature
pixi add --feature pipeline --pypi <package-name>
# Rebuild Docker images
pixi run rebuild-services
# Restart services
pixi run start-services
Environment Management
This project uses Pixi environments for different workflows:
default: General development, testing, pre-commit hooksgis: QGIS and geospatial analysis toolsetl: ETL pipeline (used in Docker containers)webservice: FastAPI web servicefrontend: Node.js/npm for frontend development
Frontend Integration
This repository now includes the Cal Bioscape Frontend as a Git submodule
located in the frontend/ directory.
Initializing the Submodule
When you first clone this repository, you can initialize and pull only the
frontend submodule with:
pixi run submodule-frontend-init
📘 Documentation
This project uses MkDocs Material for documentation.
Local Preview
You can preview the documentation locally using Pixi:
pixi install -e docs
pixi run -e docs docs-serve
Then open your browser and go to:
http://127.0.0.1:8000
Contributing Documentation
Most documentation should live in the relevant directories within the docs
folder.
When adding new pages to the documentation, make sure you update the
mkdocs.yml file
so they can be rendered on the website.
If you need to add documentation referencing a file that lives elsewhere in the repository, you'll need to do the following (this is an example, run from the package root directory)
# symlink the file to its destination
# Be sure to use relative paths here, otherwise it won't work!
ln -s ../../deployment/README.md docs/deployment/README.md
# stage your new file
git add docs/deployment/README.md
Be sure to preview the documentation to make sure it's accurate before submitting a PR.