Subbusainath Rengasamy — Architecting AI-native systems, end to end.

§ 01

A short note, pinned.

About · stack · this week

I'm Subbusainath — a lead software engineer based in Chennai, Tamil Nadu, India.

Over the past 7+ years I've designed, shipped, and operated production systems — most of them at the intersection of cloud-native infrastructure and machine learning. The past two years have been almost entirely about agentic AI: building systems where models are first-class runtimes, not chat widgets bolted onto a backend.

Day-to-day I write Python, TypeScript, and Go, draw architecture diagrams that people actually use, and care a great deal about the boring parts — observability, retries, schemas, cost. I write here because writing is how I think; I take consulting work because production is where systems prove themselves.

Off-screen: long-form blogging, slow coffee, and an unreasonable interest in old typography.

This week · now

What I'm shipping right now

MonEval harness for a multi-agent research tool — stress-testing its memory contracts.
WedDrafting an essay on the four agent failure modes I see most often.
FriDiscovery call with a fintech team migrating off a Lambda monolith.
SatRe-reading Designing Data-Intensive Applications, ch. 11.

Currently

Q3 — Q4 2026 open

Independent consulting
2 active engagements
1 retainer slot left

Stack & comfort

Languages — Python, TypeScript, Go, Rust, Java
Data & ML — PySpark, LangChain, LangGraph
Frontend — Next.js, Astro, React + TanStack
Mobile — Flutter, React Native
IaC — AWS CDK, Pulumi, Terraform
Cloud — AWS, GCP (strong on both)
Platform — Kubernetes, Docker, Podman

Me · in 3D

3D-rendered Pixar-style portrait of Subbusainath, generated from a real photo with FLUX.1-Kontext

Pixar-rendered from my actual photo with FLUX.1-Kontext.

Recognition · Year 4

AWS Community Builder · DevTools

Four years running in the DevTools cohort — recognized by AWS for deep-dive writing & talks on serverless, IaC, and the developer tooling that makes cloud-native work less painful.

Certification · archive

AWS Solutions Architect · Associate

Status: expired — kept here for the record.

View on Credly ↗

On YouTube · @subbucodingvideos

From my channel

How to Setup Cloudfront Distribution with S3 bucket using Terraform or AWS CDK | AWS Dec 2024

↑ the yellow note
changes weekly.

§ 02

What I help teams build.

Six service lines · drag to reorder

№ 01 / Service

Agentic AI systems

From single-model prototypes to multi-agent production systems with eval, observability, and graceful failure modes.

LangGraph · Anthropic · Modal · Postgres

№ 02 / Featured

Cloud-native & serverless migrations

Lifting legacy monoliths into event-driven, function-first architectures — without freezing the business for six months.

AWS · GCP · EventBridge · Step Functions

№ 03 / Service

End-to-end MLOps pipelines

Training to inference to retraining. CI/CD for models, drift detection, and the artifact discipline that lets juniors ship safely.

MLflow · Argo · Vertex · SageMaker

№ 04 / Service

Data pipelines

Streaming and batch, designed around schema evolution and replay rather than dashboards. Cost-aware by default.

Kafka · Flink · dbt · BigQuery

№ 05 / Service

Harness & delivery engineering

CI/CD that engineers don't fight on Mondays. GitOps, deployment safety, and the unglamorous templating layer.

Harness · ArgoCD · GitHub Actions

№ 06 / Service

Production-grade builds

End-to-end product delivery for AI-first teams who need a senior engineer who can also run the architecture conversation.

Python · Go · TypeScript · Postgres

§ 03

Selected work.

Click a card to expand · 6 of 18

Case study · 01 · Agentic AI

Voyager — an agentic research assistant for analysts

Designed and shipped a multi-agent research system that drafts memos, validates sources, and surfaces uncertainties — operated by 40+ analysts daily.

Anthropic · LangGraph · Modal · Postgres · pgvector

Brief

An asset manager wanted internal analysts to draft research memos with model assistance — but their first attempt was a chat bolted onto a backend, and the analysts didn't trust it.

What we built

A four-agent system with explicit memory contracts: a planner, a retrieval agent over their internal corpus, a critic, and a drafter. Every agent step is logged with the inputs it actually saw; eval is continuous, not a launch-day ritual.

Outcome

Adoption hit 40+ daily users in eight weeks. Memo first-draft time dropped from a half-day to roughly 25 minutes; analysts kept editorial control end-to-end.

Role Architect & lead engineer

Team 3 eng · 1 PM · 2 analysts

Duration 14 weeks

Stack Anthropic, LangGraph, Modal, Postgres, pgvector

Outcome 40+ DAU · 25-min drafts

Case study · 02 · Cloud Native

14-year-old ERP → event-driven serverless

Rebuilt a legacy ERP backbone as an event-driven, function-first system. Zero-downtime cutover, 62% lower run cost.

AWS · EventBridge · Step · Aurora

Brief

A 14-year-old monolithic ERP was the backbone of a manufacturing operation. The vendor had stopped updating it; the on-call burden had not.

What we built

An event-driven core on AWS — EventBridge plus orchestrated Step Functions for the long-running flows. Strangler-fig migration over six months, with a shadowing layer reading from both systems before each cutover.

Outcome

Zero-downtime cutover across six modules. Run-cost dropped 62%. P95 order-processing latency improved from minutes to single-digit seconds.

Role Migration architect

Team 5 eng · 2 ops · 1 PM

Duration 6 months

Stack AWS Lambda, EventBridge, Step Functions, Aurora

Outcome −62% cost · zero downtime

Case study · 03 · MLOps

A pragmatic MLOps stack for small teams

End-to-end training to inference for a 5-person ML team without the hyperscaler overhead.

MLflow · Argo · Harness

Brief

A five-person ML team needed real CI/CD for models, but didn't have the headcount to maintain a hyperscaler-grade platform.

What we built

An MLflow + Argo + Harness stack tuned for small teams: artifact discipline by default, shadow deployments before promotion, and a retraining trigger driven by drift detection rather than a calendar.

Outcome

Two production models retrained autonomously over twelve months with no on-call escalations attributable to the pipeline itself.

Role Tech lead

Team 5 ML eng

Duration 10 weeks

Stack MLflow, Argo Workflows, Harness

Case study · 04 · Data

Replay-safe streaming for an ad-tech platform

Schema-aware streaming with deterministic replay, cutting incident recovery from hours to minutes.

Kafka · Flink · Schema Registry

Brief

An ad-tech platform was losing thousands of dollars per minute during stream incidents because replay required guesswork on schema versions.

What we built

A schema-registry-backed streaming pipeline where every record carries the version it was written with, and consumers know how to evolve it forward. Deterministic replay built on top.

Outcome

Mean-time-to-recovery dropped from ~3.5 hours to ~12 minutes. The schema review process became a normal PR rather than a launch-day fire drill.

Role Streaming architect

Team 4 eng

Duration 9 weeks

Stack Kafka, Flink, Confluent Schema Registry

Side project · Open source

harness-templates — opinionated CD pipelines

A small library of Harness templates for ML and Python services. ★ 240, 9 contributors.

Harness · YAML · Go

Brief

I kept rewriting the same Harness pipelines from scratch on every engagement. So I extracted the patterns I trusted into a small, opinionated library.

What's in it

Templates for Python services, ML models with shadow deployments, and the boring-but-essential parts: secrets, approvals, rollbacks. Documented on the assumption you've never used Harness before.

Adoption

★ 240 on GitHub, 9 outside contributors, used by ~30 teams I know about.

Type Open source library

Stars ★ 240

Contribs 9

Stack Harness YAML, Go

Case study · 06 · AI Pipeline

AI voice proctoring — diarized interviews with grounded reports

Hybrid serverless + GPU pipeline that diarizes interview audio with Pyannote, then RAGs the transcript into a PDF report with timestamped proof of malpractice.

GCP · Cloud Run · Vertex AI · Pyannote · RAG

Brief

Interview platform needed evidence-grade detection of cheating during live remote interviews — multiple voices, prompts being read off-screen, third-party assistance. A raw transcript was not enough; reviewers needed who said what, when, and why it counted as malpractice.

What we built

A hybrid pipeline: Cloud Run handles ingest, audio extraction from video, and pre-processing on serverless. The diarization step — Pyannote — runs on a containerized GPU instance, since the model is too heavy for serverless cold-starts. Speaker-labeled segments with timestamps flow into a RAG pipeline on Vertex AI Studio, which retrieves policy context and generates a structured report. A PDF render step attaches the offending audio clips as proof inside the report.

Why hybrid

Cloud Run keeps ingest cheap and elastic for spiky interview volume. The GPU container stays warm only for the inference window — diarization is the expensive step and the only one that needs accelerators. Splitting the two cut idle cost while keeping P95 turnaround under three minutes per interview.

Outcome

Reviewers stopped scrubbing full recordings — they jump straight to the flagged timestamps with grounded citations. False-flag rate dropped because the RAG step grounds verdicts in the platform’s own malpractice policy rather than free-form LLM judgement.

Role Pipeline architect

Stack GCP, Cloud Run, Vertex AI Studio, Pyannote, GPU containers

Pattern Serverless + containerized GPU

Output PDF report · timestamped proof

§ 04

From the journal.

Essays · long-form notes · all posts →

Essay · featured · 9 min

On retrieval that survives the on-call rotation.

Why 'RAG' is a research toy until you treat ingestion, eval, and observability as first-class concerns. Field-tested patterns from three production deployments.

2026.04.18 · Read →

Essay · featured · 12 min

Why your agent loops — and the structural fixes.

Most agent loops are a symptom of bad memory contracts, not bad prompts. A short taxonomy and the four interventions that actually help.

2026.03.02 · Read →

Essay · 6 min

Harness, in plain English.

What it actually does, when it's worth the friction, and the three places small teams should not adopt it.

2026.01.17 · Read →

Essay · 14 min

The boring parts of MLOps.

Schemas, artifacts, and the audit trail. The unsexy infrastructure that lets a model retraining pipeline survive a quarterly review.

2025.11.04 · Read →

Essay · 4 min

Three diagrams that pay rent.

The architecture, sequence, and topology drawings I redraw on every engagement. Templates included.

2025.09.21 · Read →

Cross-posts · also live on Medium · DEV · AntStack

DEV DEV · AWS Builders

Building production-ready Lambda Extensions

Best practices for Lambda Extensions that survive prod: lifecycle, telemetry, failure modes — from six years of serverless.

2025.03.05 · Read here →

M Medium

Building a production clinical AI pipeline on AWS — raw data to deployed model

End-to-end clinical AI pipeline on AWS: ingest, label, train, deploy. The boring infra decisions that decide whether the model ships.

2025.01.12 · Read here →

M Medium

L1, L2 and L3 CDK constructs — and when to use each

A practical mental model for the three CDK construct levels and when each one is the right tool.

2024.02.11 · Read here →

M Medium

Pick the latest S3 prefix and unzip it inside Databricks

Short ops recipe — find the most recent S3 prefix and decompress on the fly inside a Databricks notebook.

2023.07.01 · Read here →

AS AntStack

SNS · SQS · Step Functions — when to use what

Decision guide for the three AWS messaging/workflow primitives, with realistic failure scenarios.

2023.03.14 · Read here →

AWS AWS in Plain English

AWS AppSync with a Lambda Authorizer via CDK v2 nested stacks

Stand up AppSync with a Lambda authorizer using a nested-stack CDK pattern that scales cleanly.

2022.11.14 · Read here →

AWS AWS in Plain English

A least-privilege S3 bucket in CDK TypeScript

No wildcards, no surprises — a tight bucket policy recipe in CDK TypeScript.

2021.10.02 · Read here →

DEV DEV

Server-side pagination with Node.js, Prisma and Postgres

A no-magic walkthrough of cursor-based server-side pagination using Prisma against Postgres.

2021.07.19 · Read here →

§ 05

Talks & appearances.

Pinned · drag to reposition · @subbucodingvideos

★ Pinned YouTube · @subbucodingvideos

How to Setup Cloudfront Distribution with S3 bucket using Terraform or AWS CDK | AWS

40 min Dec 2024 Watch ↗ Channel ↗

§ 06

Get in touch.

Currently open

Have an AI-native system to ship?
Let's talk.

I'm taking on a small number of consulting engagements through Q3 — Q4 2026. The best fit is a team that has a working prototype and needs a senior engineer to help take it the last mile into production.

Email me

Open
for
work

Email enquiry@subbusainath.com

LinkedIn /in/subbusainath-rengasamy-02609b188

GitHub @subbusainath

X @SubbuSainath

Instagram @_valandhavaney_

Medium @subbusainathr

DEV.to @subbusainath

LocationChennai, Tamil Nadu, IN

Time zoneIST · UTC+5:30

↑ all of these are real ways
to actually reach me.

Architecting AI-native systems, end to end.

Voyager — agentic research

On retrieval that survives prod.

ERP → event-driven serverless

The boring parts of MLOps

Open for work — Q3 / Q4

A short note, pinned.

I'm Subbusainath — a lead software engineer based in Chennai, Tamil Nadu, India.

What I'm shipping right now

Q3 — Q4 2026 open

AWS Community Builder · DevTools

AWS Solutions Architect · Associate

From my channel

What I help teams build.

Agentic AI systems

Cloud-native & serverless migrations

End-to-end MLOps pipelines

Data pipelines

Harness & delivery engineering

Production-grade builds

Selected work.

Voyager — an agentic research assistant for analysts

Brief

What we built

Outcome

14-year-old ERP → event-driven serverless

Brief

What we built

Outcome

A pragmatic MLOps stack for small teams

Brief

What we built

Outcome

Replay-safe streaming for an ad-tech platform

Brief

What we built

Outcome

harness-templates — opinionated CD pipelines

Brief

What's in it

Adoption

AI voice proctoring — diarized interviews with grounded reports

Brief

What we built

Why hybrid

Outcome

From the journal.

On retrieval that survives the on-call rotation.

Why your agent loops — and the structural fixes.

Harness, in plain English.

The boring parts of MLOps.

Three diagrams that pay rent.

Building production-ready Lambda Extensions

Building a production clinical AI pipeline on AWS — raw data to deployed model

L1, L2 and L3 CDK constructs — and when to use each

Pick the latest S3 prefix and unzip it inside Databricks

SNS · SQS · Step Functions — when to use what

AWS AppSync with a Lambda Authorizer via CDK v2 nested stacks

A least-privilege S3 bucket in CDK TypeScript

Server-side pagination with Node.js, Prisma and Postgres

Talks & appearances.

How to Setup Cloudfront Distribution with S3 bucket using Terraform or AWS CDK | AWS

Get in touch.

Have an AI-native system to ship?Let's talk.

Have an AI-native system to ship?
Let's talk.