Schema Drift: SQLModel/Alembic vs pgschema Reference SQL

This document describes the drift between the current database schema (generated by SQLModel models + Alembic migrations) and the previous reference SQL (captured by pgschema from the old LinkML-generated schema on the main branch).

The drift was measured by running:

pgschema plan --file <old-main.sql> --db <new-alembic-db> --schema public

Summary

Category	Count	Direction
Tables removed (abstract bases)	4	Old SQL has tables the new DB does not
Column type changes (`text` -> `varchar`)	~250 columns across 80 tables	Narrower but functionally equivalent
Column added	1 (`contact.middle_name`)	Old SQL has a column the new DB does not
Index changes	1 (`polygon`)	Old SQL defines a different index expression
PostGIS system views	2	Not application-managed; always differ
New tables	0	No new tables introduced
Dropped columns	0	No columns removed (except on removed tables)

1. Removed Tables (4)

The old LinkML-generated ORM materialized abstract base classes as real database tables. SQLModel uses Python-only base classes (mixins with table=False), so these tables no longer exist in the database.

Table	Old Purpose	SQLModel Equivalent
`base_entity`	Shared `id`, `created_at`, `updated_at`, `etl_run_id`, `lineage_group_id` columns	`BaseEntity` mixin in `models/base.py`
`lookup_base`	Shared `name`, `description`, `uri` columns for lookup/enum tables	`LookupBase` mixin in `models/base.py`
`aim1_record_base`	Shared columns for Aim1 analysis record tables	`Aim1RecordBase` mixin in `models/base.py`
`aim2_record_base`	Shared columns for Aim2 analysis record tables	`Aim2RecordBase` mixin in `models/base.py`

Impact: None. These tables held no data and existed only as an artifact of LinkML's concrete table inheritance strategy. All child tables still have the same columns via Python mixin inheritance.

2. Column Type: `text` -> `varchar` (~250 columns, 80 tables)

The dominant drift. The old LinkML-generated schema used PostgreSQL text for all string columns. SQLModel's default mapping for Python str is VARCHAR (no length limit).

Example — autoclave_record:

Column	Old (LinkML)	New (SQLModel)
`record_id`	`text NOT NULL`	`varchar NOT NULL`
`qc_pass`	`text`	`varchar`
`note`	`text`	`varchar`

Affected table categories:

Lookup tables (20): ag_treatment, analysis_type, collection_method, dimension_type, equipment, harvest_method, etc. — name, description, uri columns
Aim1 record tables (12): autoclave_record, calorimetry_record, compositional_record, fermentation_record, etc. — record_id, qc_pass, note columns
Infrastructure tables (12): infrastructure_biodiesel_plants, infrastructure_biosolids_facilities, etc. — all string columns
USDA tables (8): usda_census_record, usda_commodity, usda_market_record, etc.
Core domain tables (28): contact, data_source, dataset, etl_run, experiment, field_sample, observation, resource, etc.

Impact: None in PostgreSQL. text and varchar (without a length specifier) are identical in PostgreSQL: both store variable-length strings with no length limit and the same performance characteristics. This is a cosmetic difference only.

3. Missing Column: `contact.middle_name`

The old LinkML schema defined a middle_name field on the Contact model. The new SQLModel Contact model does not include this field.

Column	Old (LinkML)	New (SQLModel)
`middle_name`	`text`, nullable	Not present

Impact: None. This field was added as a test change to validate the old pgschema SQL-first workflow (see plans/sql_first_workflow.md). It has zero references in Python code — no ETL pipeline, webservice, or test code uses it. It is safe to omit.

4. Index Change: `polygon`

The polygon table has a functional unique index that differs between the two schemas:

Index	Old (LinkML)	New (SQLModel)
`idx_polygon_geom`	Not present	`CREATE INDEX idx_polygon_geom ON polygon USING gist (geom)`
`unique_geom_dataset_md5`	`btree (md5(geom), dataset_id)` — where `geom` is `text`	`btree (md5((geom)::text), dataset_id)` — where `geom` is `geometry`

The old schema stored geom as text so md5(geom) worked directly. The new schema stores geom as a native PostGIS geometry type, which requires explicit ::text cast for the md5() function. Additionally, the new schema adds a GIST spatial index (idx_polygon_geom) that the old schema did not have.

Impact: Positive. The new schema is more correct — native geometry type enables spatial queries and the GIST index supports spatial operations.

5. PostGIS System Views (2)

The geography_columns and geometry_columns views are managed by the PostGIS extension, not by application code. pgschema detects minor differences in view definitions between the dump and the live database, but these are artifacts of PostGIS version differences.

Impact: None. These should be excluded from drift analysis.

Column Ordering

The old LinkML schema placed domain-specific columns before inherited base columns (e.g., record_id before id). The new SQLModel schema places inherited base columns first (id, created_at, updated_at, etl_run_id, lineage_group_id) followed by domain columns. pgschema does not flag this as drift since PostgreSQL column order has no semantic impact.

Recommendations

No action needed for text -> varchar drift. The types are identical in PostgreSQL. If strict parity with the old schema is desired, add sa_type=Text to string fields in the SQLModel models, but this adds verbosity with no functional benefit.
contact.middle_name — no action needed. The field was a SQL-first workflow test artifact with zero references in application code.
Keep the reference SQL updated — after any Alembic migration, run pixi run schema-dump to regenerate the reference SQL file.