The boring parts of MLOps. — Subbusainath Rengasamy

There is a popular shape of MLOps post that goes: “we used MLflow + Airflow + Feast and now we ship models continuously.” This is not that post.

This is the post about the three things that quietly determine whether your retraining pipeline survives twelve months of organizational entropy: schemas, artifacts, and the audit trail.

Schemas

A model is downstream of a feature contract whether you’ve written one down or not. The question is whether the contract is checked at every layer or only by accident.

What “checked” looks like in practice:

A Pydantic / Protobuf / Avro schema for the inference payload, validated at the API boundary.
A separate schema for the training data — overlapping but not identical (training has labels, inference doesn’t).
A reconciliation step that fails the pipeline if the training schema and inference schema diverge in incompatible ways.

That last one is the boring one nobody builds until they get burned.

Artifacts

An artifact is anything the pipeline produces that another part of the system reads. Models, of course — but also feature stats, calibration tables, evaluation reports, the prompt template for an LLM-judge eval.

The discipline that pays:

Every artifact has a stable ID. Hash of inputs + version of code that produced it. Re-running with the same inputs produces the same ID.
Every artifact is immutable once produced. No overwrites. New version → new ID.
Every artifact has lineage — a record of which other artifacts it was produced from.

This is what lets you answer the question “why is the model that’s serving traffic right now this version?” in fewer than five steps.

The audit trail

Six months from now, someone — possibly you — will need to answer one of:

“Which model was running on date X?”
“Why did we promote version Y over version Z?”
“What changed between the model that worked in staging and the one that’s misbehaving in prod?”

The audit trail is the answer. It is not a Slack scroll. It is:

A model registry with promotion events recorded as immutable rows.
An eval report attached to each promotion, with the dataset version it was evaluated against.
A deployment manifest that records which artifact ID is serving traffic, refreshed on every deploy.

What it looks like, end-to-end

The simplest version of all of this that actually works:

Ingestion writes raw data with schema validation at the boundary. Bad records go to a quarantine bucket.
Feature pipeline produces feature artifacts with hash-based IDs. Stats artifacts produced alongside.
Training consumes feature artifact IDs (not paths), produces a model artifact ID + an eval artifact ID.
Promotion is a recorded event in the registry, gated on eval thresholds.
Deploy reads the registry, writes a deployment manifest with the artifact ID it pulled.

None of this is novel. None of it requires a hyperscaler-grade platform. Three engineers can build the bones of it in two months.

The reason it’s worth building is that the day you need it — when an auditor asks, when a model misbehaves, when someone asks why version 7 outperformed version 8 — you have actual evidence instead of a theory.