Building an AI Pipeline with LLM Quality Gates

I built a system that takes raw, messy business events — feedback submissions, call transcripts, compliance reports — and turns them into structured insights, recommendations, and policy documents. Automatically. With quality guarantees on every generated output.

This post covers the architecture: how intelligent batching worked, how LLM-as-judge quality gates prevented bad output from reaching users, how a data flywheel kept the system compounding over time, and why the original SNS/SQS design eventually forced a migration to AWS Step Functions.

For the full case study, see Predictive HR.

The Problem

The client’s platform ingested thousands of events across hundreds of organizations. A single event — a piece of employee feedback, a call transcript, a compliance filing — wasn’t interesting on its own. The value was in the patterns: what’s this organization struggling with? What does this individual need? What needs to be escalated right now?

The system needed to:

Ingest events from multiple sources continuously
Group them intelligently before running expensive LLM analysis
Run a sequential multi-stage generation pipeline (insight → recommendation → solution → notice)
Guarantee quality on every AI-generated output
Deliver targeted content to different audiences from the same batch of events

The last point turned out to be the hardest. “Quality guarantee” is easy to say and expensive to implement. This is what the LLM-as-judge pattern solved.

Intelligent Batching

Running an LLM generation pipeline on every incoming event would be wasteful and expensive. The system batched events before processing — but batching strategy had to account for urgency.

Dual-Level Batches

Every event was simultaneously added to two batch pools: a company-level batch (organizational patterns) and a user-level batch (individual patterns). Priority cascaded downward — a HIGH priority event counted toward MEDIUM and LOW thresholds too, so urgent signals weren’t buried behind slower accumulation requirements.

Priority	Company Threshold	User Threshold	Pipeline Route	Use Cases
HIGH	1 event	2 events	Main pipeline	Critical compliance, urgent issues, call transcripts needing immediate analysis
MEDIUM	100 events	100 events	Main pipeline	Regular feedback, routine surveys, performance reviews
LOW	2,000 events	2,000 events	Drift analysis	Historical imports, background analytics, long-term trend detection

A HIGH event triggered a company-level batch immediately. A LOW event accumulated for thousands of events before firing. Same infrastructure, different urgency response.

Cost Controls

To prevent runaway LLM costs during event spikes, the system enforced a hard cap: maximum 100 solutions generated per 24-hour window per organization. Batches that would exceed the window were queued as PENDING until the window reset. This was a simple throttle, but it made cost behavior predictable at scale.

Pipeline Stages

Once a batch was ready, it flowed through up to ten sequential stages:

Data Lake — persistent event storage
Vectorization — embeddings for semantic search and clustering
Batch Collection — smart grouping by priority, organization, and individual
Gap Detection — identify missing context, pause and collect if needed
Insight Generation — analyze batched events, produce structured insights
Recommendation — actionable next steps derived from insights
Solution Generation — full documents: policies, corrective action plans, job descriptions
Notice Creation — audience-targeted communications
Editorial Review — AI-driven content refinement and quality pass
Broadcasting — delivery to stakeholders

Every stage from Insight Generation onward was an LLM call. And every one of those LLM calls passed through a quality gate before the pipeline advanced.

LLM-as-Judge Quality Gates

This was the core quality control mechanism. After each generation stage, a second LLM call evaluated the output against five criteria:

Criterion	What It Checks
Relevance	Is this content relevant and actionable for the recipient?
Brand Alignment	Does this match the expected voice and tone?
No Internals Exposed	Does it avoid leaking internal system details?
No Meta-Requests	Does it avoid asking users to improve the platform itself?
Clarity	Is the content clear, specific, and usable?

Each criterion was scored 0–1. The threshold was 0.7 across all criteria. Below that, the stage retried with backoff. If the retry budget was exhausted, the batch failed with a structured error — no degraded output silently reaching users.

The pattern in pseudocode:

for each generation stage:
  output = generate(batch, prompt)
  scores = evaluate(output, criteria)
  if all(scores >= 0.7):
    advance to next stage
  else if retries_remaining:
    retry with backoff
  else:
    fail batch, run AI debugger

The evaluation call used a separate model — typically a faster, cheaper option than the generation model. The cost overhead was real but predictable, and it bought a meaningful quality floor: every piece of AI-generated content that reached a user had passed automated review.

AI Debugger

When any stage failed — whether from an LLM eval failure, an exception, or a timeout — an AI debugger ran automatically. It extracted the relevant source from the stack trace, sent it with context to a fast model at low temperature, and produced structured output:

summary — 1-2 sentence explanation of what went wrong
rootCause — detailed analysis
suggestedFix — step-by-step for a developer
claudePrompt — a copy-paste prompt ready for debugging in Claude Code

This transformed failure triage from “grep CloudWatch for the batch ID” into “read the structured analysis that’s already been done for you.”

Data Flywheel

The pipeline wasn’t a one-shot processor. It was designed to compound.

Gap Detection and Wizards

The Gap Detection stage analyzed incoming events and asked: is there missing information that would make downstream generation significantly better? If yes, the batch paused, and the system generated a dynamic wizard surfaced to the user — specific questions targeting exactly what was missing.

The wizard responses weren’t discarded. They re-entered the system as new events. More events → richer future batches → better gap detection → more targeted wizard questions. Each cycle improved the quality of the next one.

The Flywheel Loop

The mechanism worked like this:

Events enter the system (feedback, transcripts, wizard responses)
Batching and AI pipeline produce insights, solutions, and notices
Users act on recommendations and provide feedback
Gap detection identifies what context is still missing
Dynamic wizards collect that targeted data
Wizard responses re-enter as new events — go to step 1

The practical result: onboarding a new organization immediately seeded the flywheel. A structured intake wizard at signup was classified as HIGH priority, triggering the pipeline immediately. The organization got their first outputs within minutes of onboarding, not after waiting for events to accumulate. That initial engagement generated more events, which improved future pipeline runs.

Why v1 (SNS/SQS) Broke Down

The first implementation was fully event-driven: each pipeline stage was a Lambda function consuming from its own SQS queue, publishing completion events to SNS topics that fanned out to the next queue.

What worked well:

Each stage was independently deployable
SQS gave natural backpressure during spikes
Simple mental model: message in, message out

Where it failed at scale:

During HIGH event spikes, all SQS consumers competed for Lambda concurrency. Work for different batches was interleaved across shared worker pools. A single batch’s end-to-end time became unpredictable — sometimes minutes, sometimes hours.

More fundamentally: there was no unit of work. A “batch” was a concept spread across 12 separate queue/handler pairs. You couldn’t look at a batch and know “this took 47 seconds” or “it failed at the Solution stage.” Debugging meant reconstructing a timeline from CloudWatch logs across a dozen log groups.

Other pain points:

Notice generation was sequential through a queue — one notice at a time, a hard throughput ceiling for large batches
Each handler implemented its own retry and backoff logic, with no consistent view of retry state across the pipeline
No way to enforce that batch A completes before batch B starts — ordering was emergent, not guaranteed

The core issue: SNS/SQS gives you decoupled stages but no process isolation per unit of work.

Step Functions Migration

Once a batch was ready, the remaining work had a clear shape: a defined sequence of stages, mostly sequential, with one parallelizable fan-out at notice generation. This is exactly what a state machine models.

The Hybrid Boundary

Not everything moved. The ingestion path — ingesting events, generating embeddings, collecting into batches — stayed on SNS/SQS. That work is per-event: high-throughput, stateless, fire-and-forget. Message queues are the right tool.

Everything downstream of batch collection moved to Step Functions. That work is per-batch: stateful, sequential, needs to be observable as a single unit.

The boundary was intentional. Ingestion handled thousands of events per second with minimal latency requirements. Batch processing needed isolation, observability, and coordination. Different requirements, different tools.

What Changed

Concern	v1 (SNS/SQS)	v2 (Step Functions)
Process isolation	None — all batches share Lambda pools	One execution per batch
Observability	Trace across 12 queues in CloudWatch	Single execution graph, visual in console
Duration tracking	Impossible to measure per batch	Built-in start/end per batch and per stage
Error location	Grep logs across handlers	Failed state visible at a glance
Retries	Custom code per handler	Declarative retry policies on each state
Notice generation	Sequential queue, one at a time	Map state fan-out, parallel per batch
Throughput under load	Batches time-slice and slow each other	Isolated — batch A doesn’t affect batch B

The retry pattern in particular got much cleaner. In v1, each handler had its own retry logic. In v2, the state machine definition owned the retry policy — max attempts, backoff intervals, which error types were retryable. The Lambda handlers became pure functions again.

Notice Generation: From Bottleneck to Fan-Out

In v1, each notice was generated one at a time through a queue. A batch producing 20 notices waited for 20 sequential Lambda invocations to complete.

In v2, a Step Functions Map state parallelized this. The Solution stage produced an array of notice targets; the Map state spun up a parallel execution for each one — each with its own generate/evaluate retry loop. MaxConcurrency was configurable per track.

For large batches, this was the single biggest throughput win.

Multi-Track Architecture

v1 had a single pipeline producing company-wide content for the business owner. v2 introduced tracks — parallel state machines targeting different audiences from the same batch.

Track	Audience	Scope
Owner	Business owner	Company-wide insights, full pipeline with gap detection, all solution types
Manager	Managers	Role-scoped recommendations, team-level notices
Employee	Individual employees	Personalized insights, delivered to specific users

Each track was its own Step Function definition. Tracks ran independently — a failure in the Manager track didn’t block the Owner track. A single batch could trigger one or more tracks depending on event types and priority.

LOW-priority batches bypassed the main tracks entirely and went to a separate drift analysis pipeline, which analyzed long-term emotional and behavioral trends. Significant detected drift auto-generated a HIGH-priority event that re-entered the main pipeline — the system escalated its own findings.

Performance Results

The migration produced a clear before-and-after:

Before (v1, SNS/SQS): During HIGH event spikes, batch processing times stretched to hours. No way to measure individual batch progress. Debugging required manually correlating log entries across a dozen log groups.

After (v2, Step Functions): Same spike produced more concurrent executions, not slower individual batches. End-to-end batch time: minutes. Every execution had exact timing, per-stage duration, and a visual error location when something failed. Debugging went from hours of log archaeology to “open the execution, see the red state, click it.”

Patterns Worth Stealing

If you’re building a multi-stage AI generation pipeline, these are the patterns worth taking:

LLM-as-judge at every stage. Don’t let bad output propagate. A second LLM call to score the first is cheap compared to a user receiving garbage. Score against explicit criteria, set a numeric threshold, retry on failure.

Separate ingestion from batch processing. High-throughput per-event work and sequential per-batch work have different requirements. Use the right infrastructure for each. SNS/SQS for the former, Step Functions (or any orchestrator) for the latter.

One execution per unit of work. Whatever your “unit” is, give it an isolated execution. This makes duration measurable, failures locatable, and load predictable under spikes.

Flywheel over one-shot. Design your pipeline outputs to generate new inputs. Every insight delivered creates an opportunity for user action. Every gap detected is a chance to collect better data. The system gets smarter with use instead of staying static.

Make errors useful. When a stage fails, automated analysis of why is worth building. A structured failure record with root cause and a suggested fix is faster to act on than a stack trace and a CloudWatch query.

If you’re designing something similar — or evaluating whether your current AI pipeline architecture will hold up at scale — let’s talk. I’ve built this pattern from scratch and I can help you avoid the expensive parts of getting there.

Originally adapted from content on nateross.dev.