Schema Drift: SQLModel/Alembic vs pgschema Reference SQL
This document describes the drift between the current database schema
(generated by SQLModel models + Alembic migrations) and the previous reference
SQL (captured by pgschema from the old LinkML-generated schema on the main
branch).
The drift was measured by running:
pgschema plan --file <old-main.sql> --db <new-alembic-db> --schema public
Summary
| Category | Count | Direction |
|---|---|---|
| Tables removed (abstract bases) | 4 | Old SQL has tables the new DB does not |
Column type changes (text -> varchar) |
~250 columns across 80 tables | Narrower but functionally equivalent |
| Column added | 1 (contact.middle_name) |
Old SQL has a column the new DB does not |
| Index changes | 1 (polygon) |
Old SQL defines a different index expression |
| PostGIS system views | 2 | Not application-managed; always differ |
| New tables | 0 | No new tables introduced |
| Dropped columns | 0 | No columns removed (except on removed tables) |
1. Removed Tables (4)
The old LinkML-generated ORM materialized abstract base classes as real database
tables. SQLModel uses Python-only base classes (mixins with table=False), so
these tables no longer exist in the database.
| Table | Old Purpose | SQLModel Equivalent |
|---|---|---|
base_entity |
Shared id, created_at, updated_at, etl_run_id, lineage_group_id columns |
BaseEntity mixin in models/base.py |
lookup_base |
Shared name, description, uri columns for lookup/enum tables |
LookupBase mixin in models/base.py |
aim1_record_base |
Shared columns for Aim1 analysis record tables | Aim1RecordBase mixin in models/base.py |
aim2_record_base |
Shared columns for Aim2 analysis record tables | Aim2RecordBase mixin in models/base.py |
Impact: None. These tables held no data and existed only as an artifact of LinkML's concrete table inheritance strategy. All child tables still have the same columns via Python mixin inheritance.
2. Column Type: text -> varchar (~250 columns, 80 tables)
The dominant drift. The old LinkML-generated schema used PostgreSQL text for
all string columns. SQLModel's default mapping for Python str is VARCHAR (no
length limit).
Example — autoclave_record:
| Column | Old (LinkML) | New (SQLModel) |
|---|---|---|
record_id |
text NOT NULL |
varchar NOT NULL |
qc_pass |
text |
varchar |
note |
text |
varchar |
Affected table categories:
- Lookup tables (20):
ag_treatment,analysis_type,collection_method,dimension_type,equipment,harvest_method, etc. —name,description,uricolumns - Aim1 record tables (12):
autoclave_record,calorimetry_record,compositional_record,fermentation_record, etc. —record_id,qc_pass,notecolumns - Infrastructure tables (12):
infrastructure_biodiesel_plants,infrastructure_biosolids_facilities, etc. — all string columns - USDA tables (8):
usda_census_record,usda_commodity,usda_market_record, etc. - Core domain tables (28):
contact,data_source,dataset,etl_run,experiment,field_sample,observation,resource, etc.
Impact: None in PostgreSQL. text and varchar (without a length
specifier) are
identical in PostgreSQL:
both store variable-length strings with no length limit and the same performance
characteristics. This is a cosmetic difference only.
3. Missing Column: contact.middle_name
The old LinkML schema defined a middle_name field on the Contact model. The
new SQLModel Contact model does not include this field.
| Column | Old (LinkML) | New (SQLModel) |
|---|---|---|
middle_name |
text, nullable |
Not present |
Impact: None. This field was added as a test change to validate the old
pgschema SQL-first workflow (see plans/sql_first_workflow.md). It has zero
references in Python code — no ETL pipeline, webservice, or test code uses it.
It is safe to omit.
4. Index Change: polygon
The polygon table has a functional unique index that differs between the two
schemas:
| Index | Old (LinkML) | New (SQLModel) |
|---|---|---|
idx_polygon_geom |
Not present | CREATE INDEX idx_polygon_geom ON polygon USING gist (geom) |
unique_geom_dataset_md5 |
btree (md5(geom), dataset_id) — where geom is text |
btree (md5((geom)::text), dataset_id) — where geom is geometry |
The old schema stored geom as text so md5(geom) worked directly. The new
schema stores geom as a native PostGIS geometry type, which requires
explicit ::text cast for the md5() function. Additionally, the new schema
adds a GIST spatial index (idx_polygon_geom) that the old schema did not have.
Impact: Positive. The new schema is more correct — native geometry type
enables spatial queries and the GIST index supports spatial operations.
5. PostGIS System Views (2)
The geography_columns and geometry_columns views are managed by the PostGIS
extension, not by application code. pgschema detects minor differences in view
definitions between the dump and the live database, but these are artifacts of
PostGIS version differences.
Impact: None. These should be excluded from drift analysis.
Column Ordering
The old LinkML schema placed domain-specific columns before inherited base
columns (e.g., record_id before id). The new SQLModel schema places
inherited base columns first (id, created_at, updated_at, etl_run_id,
lineage_group_id) followed by domain columns. pgschema does not flag this as
drift since PostgreSQL column order has no semantic impact.
Recommendations
-
No action needed for
text->varchardrift. The types are identical in PostgreSQL. If strict parity with the old schema is desired, addsa_type=Textto string fields in the SQLModel models, but this adds verbosity with no functional benefit. -
contact.middle_name— no action needed. The field was a SQL-first workflow test artifact with zero references in application code. -
Keep the reference SQL updated — after any Alembic migration, run
pixi run schema-dumpto regenerate the reference SQL file.