ai-healthcare

AI Hallucinations in Clinical Documentation: A 2026 QA Workflow Clinics Can Actually Follow

A lightweight QA checklist for catching AI scribe hallucinations: 3 things to verify every time, red-flag categories, and a practical sampling and audit cadence for clinics.

Published on February 1, 202619 min read
D

Written by

Dya Clinical Team

Clinical Documentation Experts

You sign an AI-generated note. It looks right. The format is clean, the sections are complete, and the language is professional. Two weeks later, a colleague references that note during a follow-up—and discovers it documents a physical exam finding that never happened.

This isn't a theoretical risk. A 2025 study in npj Digital Medicine analysing 12,999 clinician-annotated sentences across 450 AI-generated clinical notes found a 1.47% hallucination rate—and 44% of those hallucinated sentences were classified as "major," meaning they could directly impact diagnosis or treatment if left uncorrected. The same study observed a 3.45% omission rate, with omissions being far more frequent though individually less dangerous.

One-point-four-seven percent sounds small. Multiply it across every note, every day, every clinician in your practice, and the numbers stop looking small. A solo practitioner seeing 25 patients per day generates roughly 250 documentable sentences. Statistically, three to four of those sentences will contain hallucinated content every single day.

The problem isn't that AI scribes are unreliable—they save real time and reduce burnout. The problem is that most clinics have no systematic process for catching the errors these tools introduce. This guide provides one.

What "Hallucination" Actually Means in Clinical Documentation

In AI research, "hallucination" refers to generated content that appears plausible but has no basis in the source material. In clinical documentation, this translates to specific failure modes that differ from traditional transcription errors.

The Four Error Categories

Research from multiple studies converges on four distinct categories of AI scribe errors:

1. Fabrications The AI invents content that was never discussed or observed. This is the most dangerous category. Documented examples include:

  • Physical exam findings that were never performed (the AI "fills in" expected findings based on the chief complaint)
  • Medications the patient never mentioned—in one reported case, an AI scribe replaced "Aveli for cellulite" with "Qwo for cellulite," a product no longer on the market, because Qwo was more common in its training data
  • Diagnoses inferred from context rather than stated by the clinician
  • Lab values or imaging results that were not discussed

2. Omissions Critical information discussed during the encounter is absent from the note. While individually less dangerous than fabrications, omissions erode the note's clinical utility over time:

  • Patient-reported symptoms mentioned in conversation but missing from the HPI
  • Medication changes discussed but not reflected in the plan
  • Social history details relevant to treatment that the AI deemed non-essential
  • Contraindications or allergies mentioned verbally but not documented

3. Misinterpretations The AI captures something that was said but assigns it the wrong clinical meaning:

  • A patient reports discontinuing a medication, and the note records a new prescription
  • A differential diagnosis discussed as unlikely gets documented as a confirmed finding
  • Dosage changes are captured with incorrect values
  • Temporal relationships are inverted ("improving" becomes "worsening" or vice versa)

4. Misattribution The system confuses who said what. This matters because clinical reasoning depends on whether a statement is a patient report, a clinician assessment, or a referenced finding:

  • Patient concerns documented as clinician assessments
  • Clinician-initiated counselling recorded as patient-initiated complaints
  • Family history attributed to the patient's own history
  • Third-party information (from a referring provider or family member) attributed to the wrong source

Where Hallucinations Cluster

Not all sections of a clinical note carry equal risk. Research consistently identifies certain note sections as more prone to AI hallucination:

Note Section Hallucination Risk Why
Plan Highest (21% of major hallucinations) Requires clinical reasoning the AI can only approximate
Physical Exam Very High AI tends to "fill in" expected findings based on chief complaint
Assessment High (10.5% of major hallucinations) Synthesising information requires judgment, not just transcription
Symptoms / HPI Moderate (5.2%) AI may infer symptoms from context rather than from what was stated
Medications Moderate–High (18.5% of safety feedback) Drug names, dosages, and instructions are frequently garbled
Subjective / History Lower but present Generally more faithful to spoken content

Understanding this distribution is the foundation of an efficient QA process. You don't need to verify every sentence with equal scrutiny—you need to know where to look hardest.

The 3-Item Verification Checklist: What to Check Every Time

Before signing any AI-generated note, run through these three verification steps. They're ordered by clinical severity and designed to catch the categories of error that matter most.

Check 1: Did I Actually Do, Say, or Order This?

Target: Fabrications in the Physical Exam, Assessment, and Plan sections.

Read the physical exam section and ask one question: did I actually perform and document each of these findings? AI scribes are particularly prone to generating "template" exam findings that match the chief complaint but were never actually assessed. If the patient came in for knee pain, the AI may generate a full musculoskeletal exam even if you only palpated the affected joint.

Then check the Plan. Every order, referral, prescription, and follow-up instruction should match what you actually discussed. Watch specifically for:

  • Medications you didn't prescribe — the AI may infer a prescription from a discussion about medication options
  • Diagnoses you didn't confirm — differential diagnoses discussed as possibilities may appear as confirmed assessments
  • Follow-up timelines you didn't set — the AI may insert "standard" follow-up intervals based on the diagnosis

Red-flag signal: Any finding, order, or diagnosis that you don't specifically remember discussing or performing.

Check 2: Is Anything Clinically Backward?

Target: Misinterpretations, especially in Medications, Symptoms, and the temporal narrative.

This check catches errors where the AI captured the right topic but got the direction wrong. Scan for:

  • Medication direction: Was a medication started, stopped, increased, or decreased? Verify each change matches what was discussed. The most dangerous misinterpretation is documenting a discontinuation as a continuation (or vice versa).
  • Symptom trajectory: Does the note reflect whether symptoms are improving, stable, or worsening? AI can flip these, especially when the conversation includes both historical and current status.
  • Negations: "Patient denies chest pain" vs. "Patient reports chest pain" — a single missed negation reverses the clinical picture. Negation errors account for roughly 30% of hallucinated sentences.
  • Laterality and anatomy: Left vs. right, upper vs. lower, proximal vs. distal. These errors are easy to make and hard to catch on a quick skim.

Red-flag signal: Any medication change, symptom description, or finding that feels directionally "off" from what you recall.

Check 3: Is Anything Important Missing?

Target: Omissions across all sections.

This is the hardest check because you're looking for what isn't there. Focus on:

  • The chief complaint and any secondary concerns the patient raised — did all of them make it into the note?
  • Medication changes — were all discussed adjustments captured, including the rationale?
  • Patient-reported allergies, contraindications, or adverse reactions discussed during the visit
  • Counselling and shared decision-making — if you discussed risks, alternatives, or obtained verbal consent for a procedure, is it documented?
  • Social determinants mentioned by the patient that affect the care plan (housing instability, transportation barriers, caregiver status)

Red-flag signal: A conversation topic you distinctly remember that doesn't appear anywhere in the note.

Putting It Into Practice

This checklist should take 60–90 seconds per note once it becomes habit. For context, research shows clinicians spend 5–10 minutes reviewing and editing AI-generated notes versus 30–45 minutes writing from scratch. Adding structured verification does not eliminate the time savings—it protects them.

A practical approach:

  1. Read the Plan section first (highest hallucination risk)
  2. Scan the Physical Exam for any finding you didn't perform
  3. Check each medication entry for correct drug, dose, direction, and instructions
  4. Verify symptom trajectory and negations in the HPI
  5. Mentally replay the encounter and look for missing topics

Red-Flag Categories: Patterns That Demand Extra Scrutiny

Beyond the per-note checklist, certain encounter types and clinical scenarios carry elevated hallucination risk. When you recognise one of these patterns, slow down.

1. Encounters With Multiple Medication Changes

The more medications discussed, the more opportunities for the AI to confuse names, doses, or directions. Polypharmacy discussions and medication reconciliation visits deserve line-by-line verification of every drug mentioned.

Why it's risky: AI models can substitute one drug name for another if the discussed drug is uncommon in its training data. The Aveli/Qwo substitution mentioned above is one example, but the same pattern applies to generic/brand name confusion, similar-sounding drugs, and off-label uses the model hasn't encountered frequently.

2. Complex Differential Diagnoses

When you discuss multiple possible diagnoses and then narrow to one, the AI may document one of the ruled-out conditions as confirmed. This is especially dangerous for conditions with significantly different treatment pathways.

Why it's risky: The Assessment and Plan sections require clinical reasoning that LLMs approximate through pattern matching. The model can't distinguish "we discussed X as a possibility" from "the diagnosis is X" with the same reliability that it transcribes factual statements.

3. Conversations With Significant Non-Verbal Context

If a key clinical decision was informed by something you observed (gait abnormality, affect, skin appearance, wound characteristics) rather than something spoken aloud, the AI has no source material to work from. It may either omit the finding entirely or—worse—fabricate a finding based on what it expects given the diagnosis.

Why it's risky: AI scribes are fundamentally limited to audio input. Research confirms they cannot capture nonverbal communication, visual signs of distress, or physical findings observed but not verbalised.

4. Encounters Involving Sensitive Topics

Discussions about mental health, substance use, domestic violence, or sexual health require precise language. The AI may generalise, euphemise, or misattribute statements in ways that misrepresent what the patient disclosed.

Why it's risky: These topics often involve nuanced conversational dynamics—pauses, indirect disclosures, careful phrasing by the clinician—that are difficult for AI to interpret correctly.

5. Multi-Speaker Encounters

When a family member, interpreter, carer, or other provider is present, the AI may struggle with speaker identification. Clinical information attributed to the wrong person can distort the record significantly.

Why it's risky: Speaker diarisation (identifying who said what) is a known limitation of current audio AI. Misattribution rates increase with each additional speaker.

6. Visits Where the Patient Contradicts Prior Records

If a patient provides history that differs from their existing chart—correcting a previous diagnosis, updating medication lists, or clarifying an allergy—the AI may default to the "expected" information rather than the correction.

Why it's risky: LLMs are trained on patterns. When patient-stated information contradicts common medical knowledge or typical patterns, the model may subtly override the patient's actual words with what it considers more probable.

The Audit Cadence: A Sampling and Review Protocol

The per-note checklist catches errors in real time. But you also need a systematic process to monitor whether the AI is drifting—introducing new error patterns, performing worse in certain contexts, or developing blind spots you haven't noticed because they're consistent across notes.

Why Sampling Matters

You can't deeply audit every note. What you can do is periodically pull a random sample and review it with fresh eyes—or better, have a colleague review it. This catches the errors that become invisible when you're reviewing your own notes in real time, particularly omissions and subtle misinterpretations that confirm your expectations.

There's no universal standard for clinical documentation audit frequency. AHIMA's CDI Toolkit acknowledges this and recommends each organisation define its own cadence based on volume and risk. Based on existing healthcare QA literature and the specific risks of AI-generated documentation, here's a practical framework:

For Solo Practitioners and Small Practices (1–3 Clinicians)

Activity Frequency Volume
Full note re-read against audio (if available) Weekly 2–3 notes per clinician
Peer cross-review Monthly 5 notes per clinician, reviewed by a colleague
Error log review Monthly Review all corrections made during the month
Vendor accuracy check Quarterly Compare 10 notes against raw transcripts or audio

For Medium Practices (4–15 Clinicians)

Activity Frequency Volume
Full note re-read against audio Weekly 1–2 notes per clinician
Structured peer review Bi-weekly 3 notes per clinician, using a standardised rubric
Error pattern analysis Monthly Aggregate corrections across all clinicians to identify trends
New clinician onboarding audit First 30 days 100% review of AI-generated notes for any new user
Vendor accuracy check Quarterly Compare 20 notes against source audio or transcripts

For Larger Clinics and Networks (15+ Clinicians)

Activity Frequency Volume
Randomised sampling Weekly 3% of total notes, randomly selected
Specialty-stratified review Monthly At least 5 notes per specialty, reviewed by a specialty peer
Error pattern dashboard Monthly Automated tracking of correction rates and types
Focused audits on high-risk encounters Ongoing All encounters flagged as high-risk (see red-flag categories above)
External audit Annually Independent review of a representative sample

What to Look for in an Audit

When reviewing a note outside of the immediate clinical context, use this structured evaluation:

Accuracy dimensions:

  • □ All documented findings correspond to what was discussed/performed
  • □ No fabricated exam findings, diagnoses, or orders
  • □ Medication names, doses, and instructions are correct
  • □ Negations are accurate (denied vs. reported)
  • □ Temporal relationships are correct (improving, worsening, stable)

Completeness dimensions:

  • □ All chief complaints and secondary concerns are documented
  • □ Medication changes and rationale are captured
  • □ Counselling and shared decision-making are reflected
  • □ Patient-stated preferences and concerns appear in the note

Attribution dimensions:

  • □ Patient statements are attributed to the patient
  • □ Clinician assessments are clearly the clinician's
  • □ Third-party information is correctly sourced

The Error Log: Your Most Valuable QA Asset

Every correction you make to an AI-generated note is a data point. Track them. A simple shared spreadsheet works:

Date Clinician Error Type Note Section Description Severity
2026-02-01 Dr. M Fabrication Physical Exam AI added "clear lung auscultation" — not performed Major
2026-02-01 Dr. M Omission Plan Referral to physiotherapy discussed but not documented Moderate
2026-02-01 Dr. L Misinterpretation Medications Dosage recorded as 20mg, discussed as 10mg Major

Over time, this log reveals:

  • Which error types are most common in your practice
  • Which note sections are least reliable
  • Which encounter types produce the most corrections
  • Whether error rates are trending up or down after software updates

Review the log monthly. If a pattern emerges (e.g., the AI consistently mishandles medication tapering instructions), you can add a targeted check to your per-note workflow and raise the issue with your vendor.

Practical Implementation: Rolling This Out in Your Clinic

Week 1: Baseline Assessment

  1. Select 10 recent AI-generated notes per clinician
  2. Have each clinician re-review them using the 3-item checklist
  3. Log every error found (use the spreadsheet format above)
  4. Calculate a baseline error rate per section and per type

This gives you a snapshot of where your AI scribe stands right now—before you've implemented any systematic QA.

Week 2–4: Embed the Per-Note Checklist

  1. Distribute the 3-item checklist to all clinicians
  2. Encourage clinicians to spend the extra 60–90 seconds before signing each note
  3. Keep the error log running
  4. Hold a brief (15-minute) team meeting at the end of week 4 to discuss patterns

Month 2 Onward: Establish the Audit Cadence

  1. Choose the appropriate audit tier from the tables above based on your practice size
  2. Assign audit responsibilities (who reviews, when, how results are tracked)
  3. Schedule the first peer cross-review
  4. Set a calendar reminder for monthly error log review

Ongoing: Adapt and Refine

  • After vendor updates: Increase audit frequency for two weeks. Software updates can change error patterns.
  • When onboarding new clinicians: 100% note review for the first 30 days. Clinicians new to AI scribes produce different error patterns than experienced users—not because they make more mistakes, but because their review habits aren't yet calibrated.
  • When error rates spike: Investigate the cause before continuing at normal cadence. Common triggers include software updates, changes in encounter types (seasonal patterns), or new clinical workflows.

What to Expect From Your AI Scribe Vendor

A responsible vendor should be transparent about their system's limitations. When evaluating or re-evaluating your AI scribe, ask:

  1. What is your measured hallucination rate? If they can't provide a number, or claim zero, that's a red flag. Published research shows rates of 1–3% across current systems.

  2. Do you provide confidence flags or uncertainty indicators? Some systems flag sections where the AI had low confidence. This is valuable for targeted review.

  3. How do you test across diverse populations? Speech recognition systems exhibit systematic performance disparities—significantly higher error rates for certain accents and dialects. Ask whether accuracy data is stratified by patient demographics.

  4. What happens when you push a model update? Software updates can shift error patterns. Ask whether the vendor provides changelogs, re-validates accuracy, and notifies clinics of changes that might affect documentation quality.

  5. Can I access the raw transcript alongside the generated note? This is the single most useful QA feature. If you can compare the AI's source material against its output, you can catch hallucinations that no checklist would reveal.

  6. Does your system support audit trails? You need to know what the AI generated, what the clinician edited, and what was ultimately signed. This matters for both QA and liability.

The Liability Dimension

This isn't just about quality—it's about legal exposure. The clinician who signs an AI-generated note is legally responsible for its contents. Current regulatory frameworks in the EU, US, and Switzerland place the duty of review squarely on the signing clinician.

Research from medical liability insurers is direct on this point: documentation errors weaken a clinician's defence in malpractice cases. Jurors and investigators may interpret documentation riddled with errors as evidence of inattention. Cases involving good clinical care have been settled because the documentation was unreliable.

A systematic QA process isn't just clinical best practice—it's risk management. An error log, a documented audit cadence, and a consistent review workflow demonstrate due diligence in a way that "I always glance at the note before signing" does not.

For clinics operating in the EU, the EU AI Act's evolving requirements add another layer. Even if your AI scribe is classified as non-high-risk, you're still expected to understand its limitations and maintain appropriate oversight. For Swiss practices, the FADP imposes its own data protection obligations on how AI-processed patient data is handled.

The Bigger Picture

AI scribes are not going away. They reduce documentation time by measurable margins—studies show a median reduction of 2.6 minutes per appointment and a 29.3% decrease in after-hours EHR work. For clinicians drowning in administrative load, that's meaningful.

But the shift from "I wrote this note" to "I approved this note" demands a corresponding shift in how clinics think about quality assurance. The verification step is not optional overhead—it's the price of the time savings.

The workflow described in this guide is deliberately lightweight. Three checks per note. A structured audit at a cadence that matches your practice size. An error log that turns individual corrections into systemic insight. None of this requires new software, new hires, or a committee. It requires a decision that AI-generated documentation deserves the same scrutiny you'd give a note from a junior colleague—because functionally, that's exactly what it is.


Looking for an AI scribe that's built with clinician review in mind? Try Dya free for 14 days — designed for European clinical workflows with built-in quality safeguards.

References

#documentation#best-practices#compliance#automation#patient-safety

Related Articles