On retrieval that survives the on-call rotation.
Why 'RAG' is a research toy until you treat ingestion, eval, and observability as first-class concerns. Field-tested patterns from three production deployments.
Most “RAG demos” survive about thirty seconds of real production traffic before the wheels come off. The model is fine. The retriever is fine. What breaks is the boring connective tissue between them — ingestion, schema, eval, and observability.
This is a rough field guide built from three deployments: an asset-management research tool, a customer-support assistant for a B2B SaaS, and an internal docs search at a 600-person company. Different shapes, same lessons.
Ingestion is half the system
The thing nobody writes about: getting good documents into your store is harder than the retrieval itself. PDFs are a war crime. Confluence pages have a half-life of six months. Slack threads are useful but contextually orphaned the moment they leave the channel.
The patterns that worked:
- Idempotent ingestion, keyed on a stable document ID, with a content hash so re-runs don’t churn the store.
- Layered chunking — ingest the document at three resolutions (page, section, paragraph) and pick at retrieval time.
- Metadata over cleverness — author, last-edited timestamp, ACL tags. Retrieval can filter on these before embedding similarity, which is faster and more correct.
Eval is not a launch-day ritual
The day you ship is the day you should already have an eval set running on every pull request. If you’re hand-checking outputs in a Notion doc, you’ve already lost.
Two evals that pay rent:
- A golden set of 50–200 queries with known-good answers. Run on every model/prompt change. Nothing fancy — exact-match for citation IDs, LLM-as-judge for quality.
- A regression set built from real production failures. Every time a user flags a bad answer, that query becomes a permanent fixture.
Observability beats prompt engineering
You will spend more time staring at the trace of a single bad query than rewriting prompts. Tools that helped:
- Per-query span: prompt → retrieval results → final output, with the retrieval scores visible.
- A replay affordance: pin a bad query, change one variable (k, prompt, model), re-run, diff.
- Alarms on the boring metrics: median retrieval latency, cache hit rate, fraction of queries where the top result has a similarity score below threshold.
The takeaways
If you’re building this for the second time:
- Treat ingestion as a real pipeline, with backfills and idempotency, not a one-shot script.
- Build the eval harness before you ship.
- Spend on observability before you spend on a fancier model.
The unsexy half is where the production value lives. The model is, by 2026, a commodity.