← Back to blog

Building an AI Pipeline with LLM Quality Gates

AIArchitectureAWSPipeline

I built a system that takes raw, messy business events — feedback submissions, call transcripts, compliance reports — and turns them into structured insights, recommendations, and policy documents. Automatically. With quality guarantees on every generated output.

This post covers the architecture: how intelligent batching worked, how LLM-as-judge quality gates prevented bad output from reaching users, how a data flywheel kept the system compounding over time, and why the original SNS/SQS design eventually forced a migration to AWS Step Functions.

For the full case study, see Predictive HR.


The Problem

The client’s platform ingested thousands of events across hundreds of organizations. A single event — a piece of employee feedback, a call transcript, a compliance filing — wasn’t interesting on its own. The value was in the patterns: what’s this organization struggling with? What does this individual need? What needs to be escalated right now?

The system needed to:

  • Ingest events from multiple sources continuously
  • Group them intelligently before running expensive LLM analysis
  • Run a sequential multi-stage generation pipeline (insight → recommendation → solution → notice)
  • Guarantee quality on every AI-generated output
  • Deliver targeted content to different audiences from the same batch of events

The last point turned out to be the hardest. “Quality guarantee” is easy to say and expensive to implement. This is what the LLM-as-judge pattern solved.


Intelligent Batching

Running an LLM generation pipeline on every incoming event would be wasteful and expensive. The system batched events before processing — but batching strategy had to account for urgency.

Dual-Level Batches

Every event was simultaneously added to two batch pools: a company-level batch (organizational patterns) and a user-level batch (individual patterns). Priority cascaded downward — a HIGH priority event counted toward MEDIUM and LOW thresholds too, so urgent signals weren’t buried behind slower accumulation requirements.

PriorityCompany ThresholdUser ThresholdPipeline RouteUse Cases
HIGH1 event2 eventsMain pipelineCritical compliance, urgent issues, call transcripts needing immediate analysis
MEDIUM100 events100 eventsMain pipelineRegular feedback, routine surveys, performance reviews
LOW2,000 events2,000 eventsDrift analysisHistorical imports, background analytics, long-term trend detection

A HIGH event triggered a company-level batch immediately. A LOW event accumulated for thousands of events before firing. Same infrastructure, different urgency response.

Cost Controls

To prevent runaway LLM costs during event spikes, the system enforced a hard cap: maximum 100 solutions generated per 24-hour window per organization. Batches that would exceed the window were queued as PENDING until the window reset. This was a simple throttle, but it made cost behavior predictable at scale.


Pipeline Stages

Once a batch was ready, it flowed through up to ten sequential stages:

  1. Data Lake — persistent event storage
  2. Vectorization — embeddings for semantic search and clustering
  3. Batch Collection — smart grouping by priority, organization, and individual
  4. Gap Detection — identify missing context, pause and collect if needed
  5. Insight Generation — analyze batched events, produce structured insights
  6. Recommendation — actionable next steps derived from insights
  7. Solution Generation — full documents: policies, corrective action plans, job descriptions
  8. Notice Creation — audience-targeted communications
  9. Editorial Review — AI-driven content refinement and quality pass
  10. Broadcasting — delivery to stakeholders

Every stage from Insight Generation onward was an LLM call. And every one of those LLM calls passed through a quality gate before the pipeline advanced.


LLM-as-Judge Quality Gates

This was the core quality control mechanism. After each generation stage, a second LLM call evaluated the output against five criteria:

CriterionWhat It Checks
RelevanceIs this content relevant and actionable for the recipient?
Brand AlignmentDoes this match the expected voice and tone?
No Internals ExposedDoes it avoid leaking internal system details?
No Meta-RequestsDoes it avoid asking users to improve the platform itself?
ClarityIs the content clear, specific, and usable?

Each criterion was scored 0–1. The threshold was 0.7 across all criteria. Below that, the stage retried with backoff. If the retry budget was exhausted, the batch failed with a structured error — no degraded output silently reaching users.

The pattern in pseudocode:

for each generation stage:
  output = generate(batch, prompt)
  scores = evaluate(output, criteria)
  if all(scores >= 0.7):
    advance to next stage
  else if retries_remaining:
    retry with backoff
  else:
    fail batch, run AI debugger

The evaluation call used a separate model — typically a faster, cheaper option than the generation model. The cost overhead was real but predictable, and it bought a meaningful quality floor: every piece of AI-generated content that reached a user had passed automated review.

AI Debugger

When any stage failed — whether from an LLM eval failure, an exception, or a timeout — an AI debugger ran automatically. It extracted the relevant source from the stack trace, sent it with context to a fast model at low temperature, and produced structured output:

  • summary — 1-2 sentence explanation of what went wrong
  • rootCause — detailed analysis
  • suggestedFix — step-by-step for a developer
  • claudePrompt — a copy-paste prompt ready for debugging in Claude Code

This transformed failure triage from “grep CloudWatch for the batch ID” into “read the structured analysis that’s already been done for you.”


Data Flywheel

The pipeline wasn’t a one-shot processor. It was designed to compound.

Gap Detection and Wizards

The Gap Detection stage analyzed incoming events and asked: is there missing information that would make downstream generation significantly better? If yes, the batch paused, and the system generated a dynamic wizard surfaced to the user — specific questions targeting exactly what was missing.

The wizard responses weren’t discarded. They re-entered the system as new events. More events → richer future batches → better gap detection → more targeted wizard questions. Each cycle improved the quality of the next one.

The Flywheel Loop

The mechanism worked like this:

  1. Events enter the system (feedback, transcripts, wizard responses)
  2. Batching and AI pipeline produce insights, solutions, and notices
  3. Users act on recommendations and provide feedback
  4. Gap detection identifies what context is still missing
  5. Dynamic wizards collect that targeted data
  6. Wizard responses re-enter as new events — go to step 1

The practical result: onboarding a new organization immediately seeded the flywheel. A structured intake wizard at signup was classified as HIGH priority, triggering the pipeline immediately. The organization got their first outputs within minutes of onboarding, not after waiting for events to accumulate. That initial engagement generated more events, which improved future pipeline runs.


Why v1 (SNS/SQS) Broke Down

The first implementation was fully event-driven: each pipeline stage was a Lambda function consuming from its own SQS queue, publishing completion events to SNS topics that fanned out to the next queue.

What worked well:

  • Each stage was independently deployable
  • SQS gave natural backpressure during spikes
  • Simple mental model: message in, message out

Where it failed at scale:

During HIGH event spikes, all SQS consumers competed for Lambda concurrency. Work for different batches was interleaved across shared worker pools. A single batch’s end-to-end time became unpredictable — sometimes minutes, sometimes hours.

More fundamentally: there was no unit of work. A “batch” was a concept spread across 12 separate queue/handler pairs. You couldn’t look at a batch and know “this took 47 seconds” or “it failed at the Solution stage.” Debugging meant reconstructing a timeline from CloudWatch logs across a dozen log groups.

Other pain points:

  • Notice generation was sequential through a queue — one notice at a time, a hard throughput ceiling for large batches
  • Each handler implemented its own retry and backoff logic, with no consistent view of retry state across the pipeline
  • No way to enforce that batch A completes before batch B starts — ordering was emergent, not guaranteed

The core issue: SNS/SQS gives you decoupled stages but no process isolation per unit of work.


Step Functions Migration

Once a batch was ready, the remaining work had a clear shape: a defined sequence of stages, mostly sequential, with one parallelizable fan-out at notice generation. This is exactly what a state machine models.

The Hybrid Boundary

Not everything moved. The ingestion path — ingesting events, generating embeddings, collecting into batches — stayed on SNS/SQS. That work is per-event: high-throughput, stateless, fire-and-forget. Message queues are the right tool.

Everything downstream of batch collection moved to Step Functions. That work is per-batch: stateful, sequential, needs to be observable as a single unit.

The boundary was intentional. Ingestion handled thousands of events per second with minimal latency requirements. Batch processing needed isolation, observability, and coordination. Different requirements, different tools.

What Changed

Concernv1 (SNS/SQS)v2 (Step Functions)
Process isolationNone — all batches share Lambda poolsOne execution per batch
ObservabilityTrace across 12 queues in CloudWatchSingle execution graph, visual in console
Duration trackingImpossible to measure per batchBuilt-in start/end per batch and per stage
Error locationGrep logs across handlersFailed state visible at a glance
RetriesCustom code per handlerDeclarative retry policies on each state
Notice generationSequential queue, one at a timeMap state fan-out, parallel per batch
Throughput under loadBatches time-slice and slow each otherIsolated — batch A doesn’t affect batch B

The retry pattern in particular got much cleaner. In v1, each handler had its own retry logic. In v2, the state machine definition owned the retry policy — max attempts, backoff intervals, which error types were retryable. The Lambda handlers became pure functions again.

Notice Generation: From Bottleneck to Fan-Out

In v1, each notice was generated one at a time through a queue. A batch producing 20 notices waited for 20 sequential Lambda invocations to complete.

In v2, a Step Functions Map state parallelized this. The Solution stage produced an array of notice targets; the Map state spun up a parallel execution for each one — each with its own generate/evaluate retry loop. MaxConcurrency was configurable per track.

For large batches, this was the single biggest throughput win.


Multi-Track Architecture

v1 had a single pipeline producing company-wide content for the business owner. v2 introduced tracks — parallel state machines targeting different audiences from the same batch.

TrackAudienceScope
OwnerBusiness ownerCompany-wide insights, full pipeline with gap detection, all solution types
ManagerManagersRole-scoped recommendations, team-level notices
EmployeeIndividual employeesPersonalized insights, delivered to specific users

Each track was its own Step Function definition. Tracks ran independently — a failure in the Manager track didn’t block the Owner track. A single batch could trigger one or more tracks depending on event types and priority.

LOW-priority batches bypassed the main tracks entirely and went to a separate drift analysis pipeline, which analyzed long-term emotional and behavioral trends. Significant detected drift auto-generated a HIGH-priority event that re-entered the main pipeline — the system escalated its own findings.


Performance Results

The migration produced a clear before-and-after:

Before (v1, SNS/SQS): During HIGH event spikes, batch processing times stretched to hours. No way to measure individual batch progress. Debugging required manually correlating log entries across a dozen log groups.

After (v2, Step Functions): Same spike produced more concurrent executions, not slower individual batches. End-to-end batch time: minutes. Every execution had exact timing, per-stage duration, and a visual error location when something failed. Debugging went from hours of log archaeology to “open the execution, see the red state, click it.”


Patterns Worth Stealing

If you’re building a multi-stage AI generation pipeline, these are the patterns worth taking:

LLM-as-judge at every stage. Don’t let bad output propagate. A second LLM call to score the first is cheap compared to a user receiving garbage. Score against explicit criteria, set a numeric threshold, retry on failure.

Separate ingestion from batch processing. High-throughput per-event work and sequential per-batch work have different requirements. Use the right infrastructure for each. SNS/SQS for the former, Step Functions (or any orchestrator) for the latter.

One execution per unit of work. Whatever your “unit” is, give it an isolated execution. This makes duration measurable, failures locatable, and load predictable under spikes.

Flywheel over one-shot. Design your pipeline outputs to generate new inputs. Every insight delivered creates an opportunity for user action. Every gap detected is a chance to collect better data. The system gets smarter with use instead of staying static.

Make errors useful. When a stage fails, automated analysis of why is worth building. A structured failure record with root cause and a suggested fix is faster to act on than a stack trace and a CloudWatch query.


If you’re designing something similar — or evaluating whether your current AI pipeline architecture will hold up at scale — let’s talk. I’ve built this pattern from scratch and I can help you avoid the expensive parts of getting there.


Originally adapted from content on nateross.dev.