Stop Operating Blind: Why AI Teams Need a ‘Diagnostician’ Mindset

A practical guide for AI and product teams to move beyond reactive fixes—learn why recurring model and retrieval tweaks don’t solve user complaints, and discover a four-step diagnostic approach to systematically identify, categorize, and address the real causes of error in agentic RAG systems.
AI
product-management
AI Evals
Published

November 27, 2025

You’re reviewing user feedback on your agentic RAG system. The complaints are piling up:

As an AI or QE engineer, your “Builder” instinct kicks in. Maybe we need a better model. Maybe we need to add more documents. Maybe we need a multi-agent system. These are tangible actions, things we can control.

But here’s the problem: these user complaints are symptoms, not the disease.

“Inaccurate” could mean retrieval pulled the wrong document, or the model hallucinated, or the prompt was bad. “Outdated” could mean stale content, or the user is asking about a product that isn’t even in your system.

This is the “Builder’s Trap”: jumping to solutions before we understand the actual problem. This natural instinct often leads teams further away from fixing what’s really broken.


The Two Paths: Why We’re “Flying Blind”

This reactive approach stems from a deeper issue. Most AI teams are “flying blind” by focusing on the wrong signals. It’s easy to fall into the tool-first trap (debating vector databases) or rely on generic evals (like RAGAS) for a false sense of security.

These generic model evaluations are essential health checks, but a dashboard showing your hallucination score improved from 0.85 to 0.88 doesn’t tell you if you’ve solved the user’s problem.

This creates two very different paths for an engineering team:

Aspect Path 1: The Builder’s Trap (Operating Blind) Path 2: The Diagnostician’s Path (Gaining Clarity)
Symptom Vague user complaint (“Inaccurate”) Vague user complaint (“Inaccurate”)
Action Jump to a tangible solution (“Swap Model!”) Systematic 4-Step Error Analysis
Measurement Rely on generic evals (“RAGAS score is 0.88”) A specific, quantified diagnosis (“40% of failures are ‘Incorrect Product Focus’”)
Outcome We’re still operating blind, and users are still unhappy. A targeted fix that measurably improves the real user-facing problem.

The Solution: Trade the Hard-Hat for a Lab Coat

The solution is to adopt a new identity. We must shift from being just “Builders” to also being “Diagnosticians.”

Instead of operating blind, we start by running analysis and tests, just as a physician would order labs before making a diagnosis. This diagnostic approach follows a four-step process borrowed from qualitative research.

Step 1: Annotate Failures (Open Coding)

Like a doctor reviewing symptoms, we start by examining individual user interactions. A domain expert reviews a trace and writes a free-form critique: What went wrong and why?

Example Annotation:

Field Value
Trace ID #47
User Query “How do I configure SSO for Product A?”
System Response Retrieved documentation for Product B deployment on Azure
Pass/Fail ❌ Fail
Annotation (detailed enough for a new teammate to understand) “User asked about Product A (SSO), but retrieval returned docs for Product B. The retrieved content is technically accurate but completely irrelevant to the user’s actual question.”

Step 2: Categorize into Failure Modes (Axial Coding)

Once you have dozens of annotations, patterns emerge. You group similar failures, transforming scattered observations into a clear taxonomy of failure modes. This is where you move from “users say it’s inaccurate” to discovering and naming the specific types of failures.

Example — From Annotations to Taxonomy:

After reviewing 70+ traces, you notice patterns in your annotations:

Raw Annotations (Step 1) Failure Mode Category (Step 2)
“Returned docs for Product B instead of Product A” Incorrect Product Focus
“User asked about Ansible, got Keycloak docs” Incorrect Product Focus
“Relevant doc exists but system said ‘No Sources Found’” Retrieval Recall Failure
“Found right doc but only used intro section, missed solution in Chapter 3” Incomplete Documentation Context

You’ve now transformed vague “inaccurate” complaints into a named taxonomy you can measure and fix.

Step 3: Quantify and Validate (Selective Coding)

After identifying your failure modes, you go back and count them. How prevalent is each problem? This confirms your diagnosis and reveals which diseases are most critical.

Example — Quantifying Your Failure Modes:

After labeling all 70 traces against your taxonomy:

Failure Mode Count % of Failures
Missing Tool Call 15 21%
Incorrect Product Focus 12 17%
Retrieval Recall Failure 10 14%
Context Misprocessing 8 11%
Outdated Information 6 9%
Other (5 categories) 19 28%

Now you know: “Missing Tool Call” and “Incorrect Product Focus” account for nearly 40% of all failures. These are your highest-leverage targets.

Step 4: Prioritize and Fix (Targeted Interventions)

With a confirmed, quantified diagnosis, you can prescribe the right treatment. Instead of generic solutions, you create targeted experiments.

Example — From Diagnosis to Action Plan:

Failure Mode % of Failures Targeted Fix
Missing Tool Call 21% Refine tool-triggering logic; add query patterns that should always invoke search
Incorrect Product Focus 17% Improve retrieval filtering; add product-name extraction to query preprocessing
Retrieval Recall Failure 14% Audit content coverage gaps; expand indexing to include missing documentation

Each fix now directly addresses a quantified, user-facing problem — not a hunch.


Our “Aha!” Moment: The Disease We Couldn’t See

This process gives you surprising, critical insights. When our team ran through “Categorize into Failure Modes,” we discovered a failure mode we hadn’t even considered: “Incorrect Product Focus.”

Definition: The assistant retrieves a solution for a different product than the one the user asked about.

  • Query: “How do I configure SSO for Product A?”
  • Response: A setup guide for Product B’s authentication system.

The system wasn’t technically broken. It called the tool, retrieved a source, and generated a fluent answer. But the source was for the wrong product entirely.

A generic, automated eval would have missed this. It might have even scored the answer as “high quality” because it was factually grounded in the retrieved (but irrelevant) source.

This single, human-driven discovery proved the value of the entire process. It gave us a high-priority, user-facing problem that we could never have found by just looking at generic dashboards.


How to Do This: Overcoming the “It’s Too Manual” Objection

The most common objection is: “This sounds time-consuming and manual. How do we scale it?”

This is where we must be pragmatic. Don’t let the lack of perfect tooling prevent you from beginning the diagnostic process.

How to Start Today (With No Tooling)

Start with a spreadsheet. This is how we began. Each row was a trace, with columns for the user query, AI response, pass/fail judgment, and our free-form annotations (detailed enough for a new teammate to understand).

Was it elegant? No. Did it work? Absolutely. We reviewed 70+ traces, built our initial failure mode taxonomy, and had a prioritized backlog in a matter of days. Starting simple with a spreadsheet beats operating blind.

How We Evolved: Custom Tooling for Multi-Turn Conversations

As we scaled, two pain points emerged:

  1. Multi-turn conversations are hard to review in spreadsheets. Long back-and-forth dialogues require constant scrolling and context-switching.
  2. Annotation is time-consuming. Even with a clear taxonomy, writing detailed annotations for every trace takes effort.

We built a custom annotation interface that addressed both:

  • Turn-level review: Each turn in a multi-turn conversation can be annotated individually, making it easier to pinpoint exactly where the system failed.
  • LLM-suggested annotations: The tool prompts an LLM with our 11-mode failure taxonomy and the conversation trace. The LLM suggests an annotation as a starting point.

The Human + LLM Partnership (Not Replacement)

Here’s the critical part: the LLM suggests, the human decides.

The workflow looks like this:

  1. LLM generates a suggested annotation based on the trace and the failure mode taxonomy
  2. Human reviewer reads the suggestion — it might be 80% correct, or it might miss a nuance
  3. Human accepts, edits, or overwrites — adding domain-specific insight the LLM can’t provide

Why this works: The LLM reduces the “blank page” problem and speeds up annotation, but the human stays in control. You’re not blindly trusting the LLM’s judgment — you’re using it to reduce context-switching and cognitive load.

This keeps the process scalable while preserving the quality and insight that only human domain experts can provide.


The Mindset That Fixes AI

You can keep cranking out features, but if you don’t shift your team’s focus from tools to measurement, you’ll be operating blind. The danger is real: you’ll prioritize new capabilities while unaddressed failure modes compound, making your product progressively worse until users abandon it.

The mindset shift is simple but profound: move from Builder mode to Diagnostician mode.

Start by understanding the failures your users are actually facing. Establish a regular cadence to turn those complaints into a failure mode taxonomy. This diagnostic process is the foundation for everything. Once you have it, you can build custom evaluations that matter, monitor them in production, and catch regressions before they become user complaints.

The best AI engineers aren’t the ones with the fanciest architectures. They’re the ones who know how to systematically diagnose what’s broken and prove they’ve fixed it.