Drag any card to rearrange · click a case study to expand The wall remembers —
Open for Q3 — Q4 2026

Architecting AI-native systems, end to end.

Senior software engineer working at the seam between cloud-native infrastructure and generative AI. I take agentic systems from prototype to the on-call rotation — and write down what I learned along the way.

Selected work Consulting enquiry@subbusainath.com
drag the cards →
they remember where you put them.
Case study · 01

Voyager — agentic research

Anthropic · LangGraph · Modal

Essay · 03

On retrieval that survives prod.

9 min read

Case study · 02

ERP → event-driven serverless

AWS · EventBridge · Step

Talk · 2025

The boring parts of MLOps

DevConf'25, Bangalore

Status

Open for work — Q3 / Q4

Agentic AI · cloud · MLOps. Reach me.

§ 01

A short note, pinned.

About · stack · this week

I'm Subbusainath — a senior software engineer based in Chennai, Tamil Nadu, India.

Over the past 7+ years I've designed, shipped, and operated production systems — most of them at the intersection of cloud-native infrastructure and machine learning. The past two years have been almost entirely about agentic AI: building systems where models are first-class runtimes, not chat widgets bolted onto a backend.

Day-to-day I write Python, TypeScript, and Go, draw architecture diagrams that people actually use, and care a great deal about the boring parts — observability, retries, schemas, cost. I write here because writing is how I think; I take consulting work because production is where systems prove themselves.

Off-screen: long-form blogging, slow coffee, and an unreasonable interest in old typography.

This week · now

What I'm shipping right now

  • MonEval harness for a multi-agent research tool — stress-testing its memory contracts.
  • WedDrafting an essay on the four agent failure modes I see most often.
  • FriDiscovery call with a fintech team migrating off a Lambda monolith.
  • SatRe-reading Designing Data-Intensive Applications, ch. 11.
Currently

Q3 — Q4 2026 open

  • Independent consulting
  • 2 active engagements
  • 1 retainer slot left
Stack & comfort
  • Languages — Python, TypeScript, Go, Rust, Java
  • Data & ML — PySpark, LangChain, LangGraph
  • Frontend — Next.js, Astro, React + TanStack
  • Mobile — Flutter, React Native
  • IaC — AWS CDK, Pulumi, Terraform
  • Cloud — AWS, GCP (strong on both)
  • Platform — Kubernetes, Docker, Podman
Me · in 3D 3D-rendered Pixar-style portrait of Subbusainath, generated from a real photo with FLUX.1-Kontext

Pixar-rendered from my actual photo with FLUX.1-Kontext.

Recognition · Year 3
AWS Community Builder — Year 3 badge

AWS Community Builder · DevTools

Three years running in the DevTools cohort — recognized by AWS for deep-dive writing & talks on serverless, IaC, and the developer tooling that makes cloud-native work less painful.

Certification · archive

AWS Solutions Architect · Associate

Status: expired — kept here for the record.

View on Credly ↗
↑ the yellow note
changes weekly.
§ 02

What I help teams build.

Six service lines · drag to reorder
№ 01 / Service

Agentic AI systems

From single-model prototypes to multi-agent production systems with eval, observability, and graceful failure modes.

LangGraph · Anthropic · Modal · Postgres
№ 02 / Featured

Cloud-native & serverless migrations

Lifting legacy monoliths into event-driven, function-first architectures — without freezing the business for six months.

AWS · GCP · EventBridge · Step Functions
№ 03 / Service

End-to-end MLOps pipelines

Training to inference to retraining. CI/CD for models, drift detection, and the artifact discipline that lets juniors ship safely.

MLflow · Argo · Vertex · SageMaker
№ 04 / Service

Data pipelines

Streaming and batch, designed around schema evolution and replay rather than dashboards. Cost-aware by default.

Kafka · Flink · dbt · BigQuery
№ 05 / Service

Harness & delivery engineering

CI/CD that engineers don't fight on Mondays. GitOps, deployment safety, and the unglamorous templating layer.

Harness · ArgoCD · GitHub Actions
№ 06 / Service

Production-grade builds

End-to-end product delivery for AI-first teams who need a senior engineer who can also run the architecture conversation.

Python · Go · TypeScript · Postgres
§ 03

Selected work.

Click a card to expand · 6 of 18
VOYAGER · RUNTIME GRAPH ANALYST QUERY PLANNER task graph · memory keys RETRIEVAL corpus + rerank CRITIC source · uncertainty DRAFTER memo · cites POSTGRES pgvector · audit log ANTHROPIC API claude · streaming REPLAN MODAL · serverless GPUs · per-step trace
Case study · 01 · Agentic AI

Voyager — an agentic research assistant for analysts

Designed and shipped a multi-agent research system that drafts memos, validates sources, and surfaces uncertainties — operated by 40+ analysts daily.

Anthropic · LangGraph · Modal · Postgres · pgvector

Brief

An asset manager wanted internal analysts to draft research memos with model assistance — but their first attempt was a chat bolted onto a backend, and the analysts didn't trust it.

What we built

A four-agent system with explicit memory contracts: a planner, a retrieval agent over their internal corpus, a critic, and a drafter. Every agent step is logged with the inputs it actually saw; eval is continuous, not a launch-day ritual.

Outcome

Adoption hit 40+ daily users in eight weeks. Memo first-draft time dropped from a half-day to roughly 25 minutes; analysts kept editorial control end-to-end.

Role Architect & lead engineer
Team 3 eng · 1 PM · 2 analysts
Duration 14 weeks
Stack Anthropic, LangGraph, Modal, Postgres, pgvector
Outcome 40+ DAU · 25-min drafts
ERP · STRANGLER-FIG CUTOVER LEGACY ERP monolith · 14y orders · billing · stock on-prem SHADOW READER dual-read + diff EVENTBRIDGE domain events λ order-svc short tasks λ stock-svc short tasks STEP FUNCTIONS billing run eod reconciliation AURORA multi-AZ · pitr · per-domain schemas STRANGLER FIG · 6 MOD −62% RUN COST · ZERO-DOWNTIME CUTOVER · P95 SECONDS
Case study · 02 · Cloud Native

14-year-old ERP → event-driven serverless

Rebuilt a legacy ERP backbone as an event-driven, function-first system. Zero-downtime cutover, 62% lower run cost.

AWS · EventBridge · Step · Aurora

Brief

A 14-year-old monolithic ERP was the backbone of a manufacturing operation. The vendor had stopped updating it; the on-call burden had not.

What we built

An event-driven core on AWS — EventBridge plus orchestrated Step Functions for the long-running flows. Strangler-fig migration over six months, with a shadowing layer reading from both systems before each cutover.

Outcome

Zero-downtime cutover across six modules. Run-cost dropped 62%. P95 order-processing latency improved from minutes to single-digit seconds.

Role Migration architect
Team 5 eng · 2 ops · 1 PM
Duration 6 months
Stack AWS Lambda, EventBridge, Step Functions, Aurora
Outcome −62% cost · zero downtime
MLOPS · 5-PERSON STACK FEATURE STORE offline + online MLFLOW runs · params · metrics model registry ARGO WORKFLOWS train · eval · package artifact discipline HARNESS CD templated pipeline · approvals · rollback SHADOW parity check no user impact PRODUCTION inference canary 10 → 100% DRIFT RETRAIN 2 PROD MODELS · 12 MONTHS · NO PIPELINE PAGES
Case study · 03 · MLOps

A pragmatic MLOps stack for small teams

End-to-end training to inference for a 5-person ML team without the hyperscaler overhead.

MLflow · Argo · Harness

Brief

A five-person ML team needed real CI/CD for models, but didn't have the headcount to maintain a hyperscaler-grade platform.

What we built

An MLflow + Argo + Harness stack tuned for small teams: artifact discipline by default, shadow deployments before promotion, and a retraining trigger driven by drift detection rather than a calendar.

Outcome

Two production models retrained autonomously over twelve months with no on-call escalations attributable to the pipeline itself.

Role Tech lead
Team 5 ML eng
Duration 10 weeks
Stack MLflow, Argo Workflows, Harness
STREAMING · REPLAY-SAFE TOPOLOGY PRODUCER · bid edge sdk PRODUCER · imp edge sdk PRODUCER · clk edge sdk SCHEMA REGISTRY versioned · compat-gate KAFKA topic-partitioned record carries schema-id FLINK stateful · evolves fwd checkpointed CONSUMERS attrib · billing · ods DETERMINISTIC REPLAY offset rewind · same schema-id DLQ + replayer incompat → quarantine MTTR 3.5H → 12 MIN · SCHEMA CHANGES BECOME PRS
Case study · 04 · Data

Replay-safe streaming for an ad-tech platform

Schema-aware streaming with deterministic replay, cutting incident recovery from hours to minutes.

Kafka · Flink · Schema Registry

Brief

An ad-tech platform was losing thousands of dollars per minute during stream incidents because replay required guesswork on schema versions.

What we built

A schema-registry-backed streaming pipeline where every record carries the version it was written with, and consumers know how to evolve it forward. Deterministic replay built on top.

Outcome

Mean-time-to-recovery dropped from ~3.5 hours to ~12 minutes. The schema review process became a normal PR rather than a launch-day fire drill.

Role Streaming architect
Team 4 eng
Duration 9 weeks
Stack Kafka, Flink, Confluent Schema Registry
HARNESS-TEMPLATES · GITOPS GIT · harness-templates /python-svc.yaml /ml-shadow.yaml /secrets · /approvals /rollback IMPORT HARNESS pipeline resolver templates → stages approvals · gates artifact rollback PYTHON SERVICE build · scan · canary health-gated promote rollback hook ML MODEL shadow deploy parity gate promote on approval OPS secrets approvals audit log ★ 240 · 9 CONTRIBUTORS · ~30 TEAMS · MIT
Side project · Open source

harness-templates — opinionated CD pipelines

A small library of Harness templates for ML and Python services. ★ 240, 9 contributors.

Harness · YAML · Go

Brief

I kept rewriting the same Harness pipelines from scratch on every engagement. So I extracted the patterns I trusted into a small, opinionated library.

What's in it

Templates for Python services, ML models with shadow deployments, and the boring-but-essential parts: secrets, approvals, rollbacks. Documented on the assumption you've never used Harness before.

Adoption

★ 240 on GitHub, 9 outside contributors, used by ~30 teams I know about.

Type Open source library
Stars ★ 240
Contribs 9
Stack Harness YAML, Go
VOICE PROCTORING · PIPELINE SERVERLESS CONTAINER · GPU VERTEX AI · RAG INTERVIEW VIDEO CLOUD RUN demux · extract audio PREPROCESS denoise · 16kHz mono PYANNOTE · DIARIZATION speaker labels · word-level timestamps DIARIZED SEGMENTS SPK_A · SPK_B · ts VERTEX AI · RAG policy ctx · grounded verdict REPORT · PDF flagged clips attached as proof
Case study · 06 · AI Pipeline

AI voice proctoring — diarized interviews with grounded reports

Hybrid serverless + GPU pipeline that diarizes interview audio with Pyannote, then RAGs the transcript into a PDF report with timestamped proof of malpractice.

GCP · Cloud Run · Vertex AI · Pyannote · RAG

Brief

Interview platform needed evidence-grade detection of cheating during live remote interviews — multiple voices, prompts being read off-screen, third-party assistance. A raw transcript was not enough; reviewers needed who said what, when, and why it counted as malpractice.

What we built

A hybrid pipeline: Cloud Run handles ingest, audio extraction from video, and pre-processing on serverless. The diarization step — Pyannote — runs on a containerized GPU instance, since the model is too heavy for serverless cold-starts. Speaker-labeled segments with timestamps flow into a RAG pipeline on Vertex AI Studio, which retrieves policy context and generates a structured report. A PDF render step attaches the offending audio clips as proof inside the report.

Why hybrid

Cloud Run keeps ingest cheap and elastic for spiky interview volume. The GPU container stays warm only for the inference window — diarization is the expensive step and the only one that needs accelerators. Splitting the two cut idle cost while keeping P95 turnaround under three minutes per interview.

Outcome

Reviewers stopped scrubbing full recordings — they jump straight to the flagged timestamps with grounded citations. False-flag rate dropped because the RAG step grounds verdicts in the platform’s own malpractice policy rather than free-form LLM judgement.

Role Pipeline architect
Stack GCP, Cloud Run, Vertex AI Studio, Pyannote, GPU containers
Pattern Serverless + containerized GPU
Output PDF report · timestamped proof
§ 04

From the journal.

Essays · long-form notes · all posts →
Essay · featured · 9 min

On retrieval that survives the on-call rotation.

Why 'RAG' is a research toy until you treat ingestion, eval, and observability as first-class concerns. Field-tested patterns from three production deployments.

2026.04.18  ·  Read →
Essay · featured · 12 min

Why your agent loops — and the structural fixes.

Most agent loops are a symptom of bad memory contracts, not bad prompts. A short taxonomy and the four interventions that actually help.

2026.03.02  ·  Read →
Essay · 6 min

Harness, in plain English.

What it actually does, when it's worth the friction, and the three places small teams should not adopt it.

2026.01.17  ·  Read →
Essay · 14 min

The boring parts of MLOps.

Schemas, artifacts, and the audit trail. The unsexy infrastructure that lets a model retraining pipeline survive a quarterly review.

2025.11.04  ·  Read →
Essay · 4 min

Three diagrams that pay rent.

The architecture, sequence, and topology drawings I redraw on every engagement. Templates included.

2025.09.21  ·  Read →
Cross-posts · also live on Medium · DEV · AntStack
DEV DEV · AWS Builders

Building production-ready Lambda Extensions

Best practices for Lambda Extensions that survive prod: lifecycle, telemetry, failure modes — from six years of serverless.

2025.03.05  ·  Read here →
M Medium

Building a production clinical AI pipeline on AWS — raw data to deployed model

End-to-end clinical AI pipeline on AWS: ingest, label, train, deploy. The boring infra decisions that decide whether the model ships.

2025.01.12  ·  Read here →
M Medium

L1, L2 and L3 CDK constructs — and when to use each

A practical mental model for the three CDK construct levels and when each one is the right tool.

2024.02.11  ·  Read here →
M Medium

Pick the latest S3 prefix and unzip it inside Databricks

Short ops recipe — find the most recent S3 prefix and decompress on the fly inside a Databricks notebook.

2023.07.01  ·  Read here →
AS AntStack

SNS · SQS · Step Functions — when to use what

Decision guide for the three AWS messaging/workflow primitives, with realistic failure scenarios.

2023.03.14  ·  Read here →
AWS AWS in Plain English

AWS AppSync with a Lambda Authorizer via CDK v2 nested stacks

Stand up AppSync with a Lambda authorizer using a nested-stack CDK pattern that scales cleanly.

2022.11.14  ·  Read here →
AWS AWS in Plain English

A least-privilege S3 bucket in CDK TypeScript

No wildcards, no surprises — a tight bucket policy recipe in CDK TypeScript.

2021.10.02  ·  Read here →
DEV DEV

Server-side pagination with Node.js, Prisma and Postgres

A no-magic walkthrough of cursor-based server-side pagination using Prisma against Postgres.

2021.07.19  ·  Read here →
§ 05

Talks & appearances.

Pinned · drag to reposition · @subbucodingvideos
★ Pinned YouTube · @subbucodingvideos

How to Setup Cloudfront Distribution with S3 bucket using Terraform or AWS CDK | AWS

40 min Watch ↗ Channel ↗
§ 06

Get in touch.

Currently open

Have an AI-native system to ship?
Let's talk.

I'm taking on a small number of consulting engagements through Q3 — Q4 2026. The best fit is a team that has a working prototype and needs a senior engineer to help take it the last mile into production.

Open
for
work
GitHub @subbusainath
Instagram @_valandhavaney_
DEV.to @subbusainath
LocationChennai, Tamil Nadu, IN
Time zoneIST · UTC+5:30
↑ all of these are real ways
to actually reach me.
Newsletter · low-volume

A note when something new lands.

Long-form essays on agentic AI, MLOps, and production systems. No drips, no funnels — one mail when there is something worth reading.

No spam. Unsubscribe anytime.