Skip to content

Schema Drift: SQLModel/Alembic vs pgschema Reference SQL

This document describes the drift between the current database schema (generated by SQLModel models + Alembic migrations) and the previous reference SQL (captured by pgschema from the old LinkML-generated schema on the main branch).

The drift was measured by running:

pgschema plan --file <old-main.sql> --db <new-alembic-db> --schema public

Summary

Category Count Direction
Tables removed (abstract bases) 4 Old SQL has tables the new DB does not
Column type changes (text -> varchar) ~250 columns across 80 tables Narrower but functionally equivalent
Column added 1 (contact.middle_name) Old SQL has a column the new DB does not
Index changes 1 (polygon) Old SQL defines a different index expression
PostGIS system views 2 Not application-managed; always differ
New tables 0 No new tables introduced
Dropped columns 0 No columns removed (except on removed tables)

1. Removed Tables (4)

The old LinkML-generated ORM materialized abstract base classes as real database tables. SQLModel uses Python-only base classes (mixins with table=False), so these tables no longer exist in the database.

Table Old Purpose SQLModel Equivalent
base_entity Shared id, created_at, updated_at, etl_run_id, lineage_group_id columns BaseEntity mixin in models/base.py
lookup_base Shared name, description, uri columns for lookup/enum tables LookupBase mixin in models/base.py
aim1_record_base Shared columns for Aim1 analysis record tables Aim1RecordBase mixin in models/base.py
aim2_record_base Shared columns for Aim2 analysis record tables Aim2RecordBase mixin in models/base.py

Impact: None. These tables held no data and existed only as an artifact of LinkML's concrete table inheritance strategy. All child tables still have the same columns via Python mixin inheritance.

2. Column Type: text -> varchar (~250 columns, 80 tables)

The dominant drift. The old LinkML-generated schema used PostgreSQL text for all string columns. SQLModel's default mapping for Python str is VARCHAR (no length limit).

Example — autoclave_record:

Column Old (LinkML) New (SQLModel)
record_id text NOT NULL varchar NOT NULL
qc_pass text varchar
note text varchar

Affected table categories:

  • Lookup tables (20): ag_treatment, analysis_type, collection_method, dimension_type, equipment, harvest_method, etc. — name, description, uri columns
  • Aim1 record tables (12): autoclave_record, calorimetry_record, compositional_record, fermentation_record, etc. — record_id, qc_pass, note columns
  • Infrastructure tables (12): infrastructure_biodiesel_plants, infrastructure_biosolids_facilities, etc. — all string columns
  • USDA tables (8): usda_census_record, usda_commodity, usda_market_record, etc.
  • Core domain tables (28): contact, data_source, dataset, etl_run, experiment, field_sample, observation, resource, etc.

Impact: None in PostgreSQL. text and varchar (without a length specifier) are identical in PostgreSQL: both store variable-length strings with no length limit and the same performance characteristics. This is a cosmetic difference only.

3. Missing Column: contact.middle_name

The old LinkML schema defined a middle_name field on the Contact model. The new SQLModel Contact model does not include this field.

Column Old (LinkML) New (SQLModel)
middle_name text, nullable Not present

Impact: None. This field was added as a test change to validate the old pgschema SQL-first workflow (see plans/sql_first_workflow.md). It has zero references in Python code — no ETL pipeline, webservice, or test code uses it. It is safe to omit.

4. Index Change: polygon

The polygon table has a functional unique index that differs between the two schemas:

Index Old (LinkML) New (SQLModel)
idx_polygon_geom Not present CREATE INDEX idx_polygon_geom ON polygon USING gist (geom)
unique_geom_dataset_md5 btree (md5(geom), dataset_id) — where geom is text btree (md5((geom)::text), dataset_id) — where geom is geometry

The old schema stored geom as text so md5(geom) worked directly. The new schema stores geom as a native PostGIS geometry type, which requires explicit ::text cast for the md5() function. Additionally, the new schema adds a GIST spatial index (idx_polygon_geom) that the old schema did not have.

Impact: Positive. The new schema is more correct — native geometry type enables spatial queries and the GIST index supports spatial operations.

5. PostGIS System Views (2)

The geography_columns and geometry_columns views are managed by the PostGIS extension, not by application code. pgschema detects minor differences in view definitions between the dump and the live database, but these are artifacts of PostGIS version differences.

Impact: None. These should be excluded from drift analysis.

Column Ordering

The old LinkML schema placed domain-specific columns before inherited base columns (e.g., record_id before id). The new SQLModel schema places inherited base columns first (id, created_at, updated_at, etl_run_id, lineage_group_id) followed by domain columns. pgschema does not flag this as drift since PostgreSQL column order has no semantic impact.

Recommendations

  1. No action needed for text -> varchar drift. The types are identical in PostgreSQL. If strict parity with the old schema is desired, add sa_type=Text to string fields in the SQLModel models, but this adds verbosity with no functional benefit.

  2. contact.middle_name — no action needed. The field was a SQL-first workflow test artifact with zero references in application code.

  3. Keep the reference SQL updated — after any Alembic migration, run pixi run schema-dump to regenerate the reference SQL file.