AI Hallucinations in Clinical Documentation: A 2026 QA Workflow Clinics Can Actually Follow

You sign an AI-generated note. It looks right. The format is clean, the sections are complete, and the language is professional. Two weeks later, a colleague references that note during a follow-up—and discovers it documents a physical exam finding that never happened.

This isn't a theoretical risk. A 2025 study in npj Digital Medicine analysing 12,999 clinician-annotated sentences across 450 AI-generated clinical notes found a 1.47% hallucination rate—and 44% of those hallucinated sentences were classified as "major," meaning they could directly impact diagnosis or treatment if left uncorrected. The same study observed a 3.45% omission rate, with omissions being far more frequent though individually less dangerous.

One-point-four-seven percent sounds small. Multiply it across every note, every day, every clinician in your practice, and the numbers stop looking small. A solo practitioner seeing 25 patients per day generates roughly 250 documentable sentences. Statistically, three to four of those sentences will contain hallucinated content every single day.

The problem isn't that AI scribes are unreliable—they save real time and reduce burnout. The problem is that most clinics have no systematic process for catching the errors these tools introduce. This guide provides one.

What "Hallucination" Actually Means in Clinical Documentation

In AI research, "hallucination" refers to generated content that appears plausible but has no basis in the source material. In clinical documentation, this translates to specific failure modes that differ from traditional transcription errors.

The Four Error Categories

Research from multiple studies converges on four distinct categories of AI scribe errors:

1. Fabrications The AI invents content that was never discussed or observed. This is the most dangerous category. Documented examples include:

Physical exam findings that were never performed (the AI "fills in" expected findings based on the chief complaint)
Medications the patient never mentioned—in one reported case, an AI scribe replaced "Aveli for cellulite" with "Qwo for cellulite," a product no longer on the market, because Qwo was more common in its training data
Diagnoses inferred from context rather than stated by the clinician
Lab values or imaging results that were not discussed

2. Omissions Critical information discussed during the encounter is absent from the note. While individually less dangerous than fabrications, omissions erode the note's clinical utility over time:

Patient-reported symptoms mentioned in conversation but missing from the HPI
Medication changes discussed but not reflected in the plan
Social history details relevant to treatment that the AI deemed non-essential
Contraindications or allergies mentioned verbally but not documented

3. Misinterpretations The AI captures something that was said but assigns it the wrong clinical meaning:

A patient reports discontinuing a medication, and the note records a new prescription
A differential diagnosis discussed as unlikely gets documented as a confirmed finding
Dosage changes are captured with incorrect values
Temporal relationships are inverted ("improving" becomes "worsening" or vice versa)

4. Misattribution The system confuses who said what. This matters because clinical reasoning depends on whether a statement is a patient report, a clinician assessment, or a referenced finding:

Patient concerns documented as clinician assessments
Clinician-initiated counselling recorded as patient-initiated complaints
Family history attributed to the patient's own history
Third-party information (from a referring provider or family member) attributed to the wrong source

Where Hallucinations Cluster

Not all sections of a clinical note carry equal risk. Research consistently identifies certain note sections as more prone to AI hallucination:

Note Section	Hallucination Risk	Why
Plan	Highest (21% of major hallucinations)	Requires clinical reasoning the AI can only approximate
Physical Exam	Very High	AI tends to "fill in" expected findings based on chief complaint
Assessment	High (10.5% of major hallucinations)	Synthesising information requires judgment, not just transcription
Symptoms / HPI	Moderate (5.2%)	AI may infer symptoms from context rather than from what was stated
Medications	Moderate–High (18.5% of safety feedback)	Drug names, dosages, and instructions are frequently garbled
Subjective / History	Lower but present	Generally more faithful to spoken content

Understanding this distribution is the foundation of an efficient QA process. You don't need to verify every sentence with equal scrutiny—you need to know where to look hardest.

The 3-Item Verification Checklist: What to Check Every Time

Before signing any AI-generated note, run through these three verification steps. They're ordered by clinical severity and designed to catch the categories of error that matter most.

Check 1: Did I Actually Do, Say, or Order This?

Target: Fabrications in the Physical Exam, Assessment, and Plan sections.

Read the physical exam section and ask one question: did I actually perform and document each of these findings? AI scribes are particularly prone to generating "template" exam findings that match the chief complaint but were never actually assessed. If the patient came in for knee pain, the AI may generate a full musculoskeletal exam even if you only palpated the affected joint.

Then check the Plan. Every order, referral, prescription, and follow-up instruction should match what you actually discussed. Watch specifically for:

Medications you didn't prescribe — the AI may infer a prescription from a discussion about medication options
Diagnoses you didn't confirm — differential diagnoses discussed as possibilities may appear as confirmed assessments
Follow-up timelines you didn't set — the AI may insert "standard" follow-up intervals based on the diagnosis

Red-flag signal: Any finding, order, or diagnosis that you don't specifically remember discussing or performing.

Check 2: Is Anything Clinically Backward?

Target: Misinterpretations, especially in Medications, Symptoms, and the temporal narrative.

This check catches errors where the AI captured the right topic but got the direction wrong. Scan for:

Medication direction: Was a medication started, stopped, increased, or decreased? Verify each change matches what was discussed. The most dangerous misinterpretation is documenting a discontinuation as a continuation (or vice versa).
Symptom trajectory: Does the note reflect whether symptoms are improving, stable, or worsening? AI can flip these, especially when the conversation includes both historical and current status.
Negations: "Patient denies chest pain" vs. "Patient reports chest pain" — a single missed negation reverses the clinical picture. Negation errors account for roughly 30% of hallucinated sentences.
Laterality and anatomy: Left vs. right, upper vs. lower, proximal vs. distal. These errors are easy to make and hard to catch on a quick skim.

Red-flag signal: Any medication change, symptom description, or finding that feels directionally "off" from what you recall.

Check 3: Is Anything Important Missing?

Target: Omissions across all sections.

This is the hardest check because you're looking for what isn't there. Focus on:

The chief complaint and any secondary concerns the patient raised — did all of them make it into the note?
Medication changes — were all discussed adjustments captured, including the rationale?
Patient-reported allergies, contraindications, or adverse reactions discussed during the visit
Counselling and shared decision-making — if you discussed risks, alternatives, or obtained verbal consent for a procedure, is it documented?
Social determinants mentioned by the patient that affect the care plan (housing instability, transportation barriers, caregiver status)

Red-flag signal: A conversation topic you distinctly remember that doesn't appear anywhere in the note.

Putting It Into Practice

This checklist should take 60–90 seconds per note once it becomes habit. For context, research shows clinicians spend 5–10 minutes reviewing and editing AI-generated notes versus 30–45 minutes writing from scratch. Adding structured verification does not eliminate the time savings—it protects them.

A practical approach:

Read the Plan section first (highest hallucination risk)
Scan the Physical Exam for any finding you didn't perform
Check each medication entry for correct drug, dose, direction, and instructions
Verify symptom trajectory and negations in the HPI
Mentally replay the encounter and look for missing topics

Red-Flag Categories: Patterns That Demand Extra Scrutiny

Beyond the per-note checklist, certain encounter types and clinical scenarios carry elevated hallucination risk. When you recognise one of these patterns, slow down.

1. Encounters With Multiple Medication Changes

The more medications discussed, the more opportunities for the AI to confuse names, doses, or directions. Polypharmacy discussions and medication reconciliation visits deserve line-by-line verification of every drug mentioned.

Why it's risky: AI models can substitute one drug name for another if the discussed drug is uncommon in its training data. The Aveli/Qwo substitution mentioned above is one example, but the same pattern applies to generic/brand name confusion, similar-sounding drugs, and off-label uses the model hasn't encountered frequently.

2. Complex Differential Diagnoses

When you discuss multiple possible diagnoses and then narrow to one, the AI may document one of the ruled-out conditions as confirmed. This is especially dangerous for conditions with significantly different treatment pathways.

Why it's risky: The Assessment and Plan sections require clinical reasoning that LLMs approximate through pattern matching. The model can't distinguish "we discussed X as a possibility" from "the diagnosis is X" with the same reliability that it transcribes factual statements.

3. Conversations With Significant Non-Verbal Context

If a key clinical decision was informed by something you observed (gait abnormality, affect, skin appearance, wound characteristics) rather than something spoken aloud, the AI has no source material to work from. It may either omit the finding entirely or—worse—fabricate a finding based on what it expects given the diagnosis.

Why it's risky: AI scribes are fundamentally limited to audio input. Research confirms they cannot capture nonverbal communication, visual signs of distress, or physical findings observed but not verbalised.

4. Encounters Involving Sensitive Topics

Discussions about mental health, substance use, domestic violence, or sexual health require precise language. The AI may generalise, euphemise, or misattribute statements in ways that misrepresent what the patient disclosed.

Why it's risky: These topics often involve nuanced conversational dynamics—pauses, indirect disclosures, careful phrasing by the clinician—that are difficult for AI to interpret correctly.

5. Multi-Speaker Encounters

When a family member, interpreter, carer, or other provider is present, the AI may struggle with speaker identification. Clinical information attributed to the wrong person can distort the record significantly.

Why it's risky: Speaker diarisation (identifying who said what) is a known limitation of current audio AI. Misattribution rates increase with each additional speaker.

6. Visits Where the Patient Contradicts Prior Records

If a patient provides history that differs from their existing chart—correcting a previous diagnosis, updating medication lists, or clarifying an allergy—the AI may default to the "expected" information rather than the correction.

Why it's risky: LLMs are trained on patterns. When patient-stated information contradicts common medical knowledge or typical patterns, the model may subtly override the patient's actual words with what it considers more probable.

The Audit Cadence: A Sampling and Review Protocol

The per-note checklist catches errors in real time. But you also need a systematic process to monitor whether the AI is drifting—introducing new error patterns, performing worse in certain contexts, or developing blind spots you haven't noticed because they're consistent across notes.

Why Sampling Matters

You can't deeply audit every note. What you can do is periodically pull a random sample and review it with fresh eyes—or better, have a colleague review it. This catches the errors that become invisible when you're reviewing your own notes in real time, particularly omissions and subtle misinterpretations that confirm your expectations.

Recommended Audit Cadence

There's no universal standard for clinical documentation audit frequency. AHIMA's CDI Toolkit acknowledges this and recommends each organisation define its own cadence based on volume and risk. Based on existing healthcare QA literature and the specific risks of AI-generated documentation, here's a practical framework:

For Solo Practitioners and Small Practices (1–3 Clinicians)

Activity	Frequency	Volume
Full note re-read against audio (if available)	Weekly	2–3 notes per clinician
Peer cross-review	Monthly	5 notes per clinician, reviewed by a colleague
Error log review	Monthly	Review all corrections made during the month
Vendor accuracy check	Quarterly	Compare 10 notes against raw transcripts or audio

For Medium Practices (4–15 Clinicians)

Activity	Frequency	Volume
Full note re-read against audio	Weekly	1–2 notes per clinician
Structured peer review	Bi-weekly	3 notes per clinician, using a standardised rubric
Error pattern analysis	Monthly	Aggregate corrections across all clinicians to identify trends
New clinician onboarding audit	First 30 days	100% review of AI-generated notes for any new user
Vendor accuracy check	Quarterly	Compare 20 notes against source audio or transcripts

For Larger Clinics and Networks (15+ Clinicians)

Activity	Frequency	Volume
Randomised sampling	Weekly	3% of total notes, randomly selected
Specialty-stratified review	Monthly	At least 5 notes per specialty, reviewed by a specialty peer
Error pattern dashboard	Monthly	Automated tracking of correction rates and types
Focused audits on high-risk encounters	Ongoing	All encounters flagged as high-risk (see red-flag categories above)
External audit	Annually	Independent review of a representative sample

What to Look for in an Audit

When reviewing a note outside of the immediate clinical context, use this structured evaluation:

Accuracy dimensions:

□ All documented findings correspond to what was discussed/performed
□ No fabricated exam findings, diagnoses, or orders
□ Medication names, doses, and instructions are correct
□ Negations are accurate (denied vs. reported)
□ Temporal relationships are correct (improving, worsening, stable)

Completeness dimensions:

□ All chief complaints and secondary concerns are documented
□ Medication changes and rationale are captured
□ Counselling and shared decision-making are reflected
□ Patient-stated preferences and concerns appear in the note

Attribution dimensions:

□ Patient statements are attributed to the patient
□ Clinician assessments are clearly the clinician's
□ Third-party information is correctly sourced

The Error Log: Your Most Valuable QA Asset

Every correction you make to an AI-generated note is a data point. Track them. A simple shared spreadsheet works:

Date	Clinician	Error Type	Note Section	Description	Severity
2026-02-01	Dr. M	Fabrication	Physical Exam	AI added "clear lung auscultation" — not performed	Major
2026-02-01	Dr. M	Omission	Plan	Referral to physiotherapy discussed but not documented	Moderate
2026-02-01	Dr. L	Misinterpretation	Medications	Dosage recorded as 20mg, discussed as 10mg	Major

Over time, this log reveals:

Which error types are most common in your practice
Which note sections are least reliable
Which encounter types produce the most corrections
Whether error rates are trending up or down after software updates

Review the log monthly. If a pattern emerges (e.g., the AI consistently mishandles medication tapering instructions), you can add a targeted check to your per-note workflow and raise the issue with your vendor.

Practical Implementation: Rolling This Out in Your Clinic

Week 1: Baseline Assessment

Select 10 recent AI-generated notes per clinician
Have each clinician re-review them using the 3-item checklist
Log every error found (use the spreadsheet format above)
Calculate a baseline error rate per section and per type

This gives you a snapshot of where your AI scribe stands right now—before you've implemented any systematic QA.

Week 2–4: Embed the Per-Note Checklist

Distribute the 3-item checklist to all clinicians
Encourage clinicians to spend the extra 60–90 seconds before signing each note
Keep the error log running
Hold a brief (15-minute) team meeting at the end of week 4 to discuss patterns

Month 2 Onward: Establish the Audit Cadence

Choose the appropriate audit tier from the tables above based on your practice size
Assign audit responsibilities (who reviews, when, how results are tracked)
Schedule the first peer cross-review
Set a calendar reminder for monthly error log review

Ongoing: Adapt and Refine

After vendor updates: Increase audit frequency for two weeks. Software updates can change error patterns.
When onboarding new clinicians: 100% note review for the first 30 days. Clinicians new to AI scribes produce different error patterns than experienced users—not because they make more mistakes, but because their review habits aren't yet calibrated.
When error rates spike: Investigate the cause before continuing at normal cadence. Common triggers include software updates, changes in encounter types (seasonal patterns), or new clinical workflows.

What to Expect From Your AI Scribe Vendor

A responsible vendor should be transparent about their system's limitations. When evaluating or re-evaluating your AI scribe, ask:

What is your measured hallucination rate? If they can't provide a number, or claim zero, that's a red flag. Published research shows rates of 1–3% across current systems.
Do you provide confidence flags or uncertainty indicators? Some systems flag sections where the AI had low confidence. This is valuable for targeted review.
How do you test across diverse populations? Speech recognition systems exhibit systematic performance disparities—significantly higher error rates for certain accents and dialects. Ask whether accuracy data is stratified by patient demographics.
What happens when you push a model update? Software updates can shift error patterns. Ask whether the vendor provides changelogs, re-validates accuracy, and notifies clinics of changes that might affect documentation quality.
Can I access the raw transcript alongside the generated note? This is the single most useful QA feature. If you can compare the AI's source material against its output, you can catch hallucinations that no checklist would reveal.
Does your system support audit trails? You need to know what the AI generated, what the clinician edited, and what was ultimately signed. This matters for both QA and liability.

The Liability Dimension

This isn't just about quality—it's about legal exposure. The clinician who signs an AI-generated note is legally responsible for its contents. Current regulatory frameworks in the EU, US, and Switzerland place the duty of review squarely on the signing clinician.

Research from medical liability insurers is direct on this point: documentation errors weaken a clinician's defence in malpractice cases. Jurors and investigators may interpret documentation riddled with errors as evidence of inattention. Cases involving good clinical care have been settled because the documentation was unreliable.

A systematic QA process isn't just clinical best practice—it's risk management. An error log, a documented audit cadence, and a consistent review workflow demonstrate due diligence in a way that "I always glance at the note before signing" does not.

For clinics operating in the EU, the EU AI Act's evolving requirements add another layer. Even if your AI scribe is classified as non-high-risk, you're still expected to understand its limitations and maintain appropriate oversight. For Swiss practices, the FADP imposes its own data protection obligations on how AI-processed patient data is handled.

The Bigger Picture

AI scribes are not going away. They reduce documentation time by measurable margins—studies show a median reduction of 2.6 minutes per appointment and a 29.3% decrease in after-hours EHR work. For clinicians drowning in administrative load, that's meaningful.

But the shift from "I wrote this note" to "I approved this note" demands a corresponding shift in how clinics think about quality assurance. The verification step is not optional overhead—it's the price of the time savings.

The workflow described in this guide is deliberately lightweight. Three checks per note. A structured audit at a cadence that matches your practice size. An error log that turns individual corrections into systemic insight. None of this requires new software, new hires, or a committee. It requires a decision that AI-generated documentation deserves the same scrutiny you'd give a note from a junior colleague—because functionally, that's exactly what it is.

Looking for an AI scribe that's built with clinician review in mind? Try Dya free for 14 days — designed for European clinical workflows with built-in quality safeguards.

AI Scribe vs. Dictation vs. Manual Note-Taking: A Practical Comparison — How AI scribes compare to other documentation methods on accuracy, time, and workflow fit.
EU AI Act in 2026: Does Your AI Scribe Count as "High-Risk"? — A plain-English decision tree for understanding your regulatory obligations.
AI Medical Transcription in Switzerland: The FADP Compliance Checklist — Data protection requirements for AI-processed clinical documentation under Swiss law.
Two-Layer Clinical Notes: Separating the Clinical Record From the Patient Summary — A documentation structure that improves both AI-generated and clinician-authored notes.
Template Governance for Multi-Practitioner Clinics — How to standardise documentation practices across your team.
Ambient Clinical Intelligence in 2026: Consent and Patient Trust — Practical consent scripts and strategies for clinics using ambient AI.

References

Nayak, A. et al. "A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation." npj Digital Medicine (2025). https://www.nature.com/articles/s41746-025-01670-7
Nayak, A. et al. "Beyond human ears: navigating the uncharted risks of AI scribes in clinical practice." npj Digital Medicine (2025). https://www.nature.com/articles/s41746-025-01895-6
"Assessing the quality of AI-generated clinical notes: validated evaluation of a large language model ambient scribe." Frontiers in Artificial Intelligence (2025). https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1691499/full
"Patient Safety Risks from AI Scribes: Signals from End-User Feedback." arXiv (2025). https://arxiv.org/html/2512.04118
"Evaluating the Usability, Technical Performance, and Accuracy of Artificial Intelligence Scribes for Primary Care." JMIR Human Factors (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC12309782/
"AI Scribes Pose Liability Risks." MICA Insurance (2025). https://www.mica-insurance.com/blog/posts/ai-scribes-pose-liability-risks/
"Artificial Intelligence Scribe and Large Language Model Technology in Healthcare Documentation: Advantages, Limitations, and Recommendations." PMC (2025). https://pmc.ncbi.nlm.nih.gov/articles/PMC11737491/
"Using AI Medical Scribes: Risk Management Considerations." TMLT (2025). https://www.tmlt.org/resource/using-ai-medical-scribes-risk-management-considerations
AHIMA Clinical Documentation Improvement Toolkit. https://www.ahima.org/
"Artificial Intelligence Scribes Shape Health Care Delivery." AAFP (2025). https://www.aafp.org/pubs/afp/issues/2025/0400/graham-center-artificial-intelligence-scribes.html