Methodology · AI Scribe Defensibility

AI Scribe Defensibility After the 2026 Auditor General Report

A reference document for Ontario clinicians, hospital CIOs, OntarioMD-aware procurement officers, and the privacy counsel who advise them. Substantive treatment of the Auditor General of Ontario's findings on AI Scribe systems, turned into positive design requirements an architecture must meet to be defensible.


In May 2026, the Office of the Auditor General of Ontario published its Performance Audit on the Use of Artificial Intelligence in the Ontario Government. Section 4.3 of that report — Safe Use of AI Scribe Systems — and Recommendations 5 through 9 examined how Supply Ontario, in consultation with OntarioMD, Ontario Health, and the Ministry of Health, procured the 20 AI Scribe vendors currently approved for use across Ontario's broader health sector.

The findings were specific. All 20 approved vendors produced notes with inaccuracies. Forty-five percent hallucinated content not present in the recordings — fabricated treatment plans, asserted findings the physician never made. Sixty percent transcribed the wrong drug name. Eighty-five percent missed key mental-health details. Eleven of the twenty did not submit third-party security audit reports. Five did not submit threat risk assessments or privacy impact assessments. Bias controls were not evaluated against evidence. No vendor was required to demonstrate their system live.

Each of those findings points to a category of risk that any next-generation AI scribe architecture should design against. Each can be turned into a positive design requirement: what an AI scribe architecture should provide, in named technical terms, to meet the bar the Auditor General's report implies the sector is moving toward. This document does that translation.

It is reference methodology, not legal counsel, and not a vendor pitch. The 20 approved vendors are not named or compared here. The reader can run the framework themselves against any vendor — including against ArcaKey AI, which builds the encrypted private-AI workspace ArcaVoice was designed against exactly these failure modes.


What the Auditor General found, finding by finding

Section 4.3 of the Auditor General's report identifies five categories of failure across the 20 approved AI Scribe vendors. Each finding cited below is verbatim from the report. The 'design requirement' column translates each finding into the specific architectural property that prevents the failure mode.

  1. Section 4.3.1 — Scoring weights
    What the AG found

    Critical criteria — system security controls (2%), SOC 2 Type 2 (4%), bias controls (2%), and accuracy of medical notes (4%) — accounted for less than 15% of the maximum possible score combined. A vendor could score zero on every one of these and still meet the minimum aggregate to be approved.

    Design requirement that prevents this failure mode

    Security, accuracy, and bias controls cannot be evaluated as line items inside a procurement scorecard. They are foundational properties; an architecture either has them or it does not. The defensible position is that these are prerequisites, not weighted criteria — a vendor must pass each independently before the procurement evaluates anything else.

  2. Section 4.3.2 — Inaccuracies in generated notes
    What the AG found

    All 20 approved vendors produced inaccuracies. Nine of 20 (45%) hallucinated content not present in the recordings, including fabricated treatment plans, fabricated 'no masses found' findings, and fabricated assertions of patient anxiety. Twelve of 20 (60%) transcribed the wrong drug name. Seventeen of 20 (85%) missed key mental-health details. Six of 20 (30%) missed mental-health details on both test recordings.

    Design requirement that prevents this failure mode

    Accuracy must be a documented merge gate — a published threshold the system is measured against on every release, with bake-off numbers in every PR description that touches the speech-to-text path. Drug-name fidelity must be cross-checked against an authoritative formulary (the Health Canada Drug Product Database, for Ontario use). Hallucination must be guarded against at the inference layer (a word-rate-vs-audio-duration check that refuses output exceeding plausible human speech). Mental-health phrasing must be a named test metric, not an emergent property of general transcription accuracy.

  3. Section 4.3.3 — Failed security documentation
    What the AG found

    Eleven of 20 approved vendors did not submit SOC 2 Type 2 reports, HITRUST certification, or ISO 27001 certification. Five did not submit threat risk assessments (TRAs) or privacy impact assessments (PIAs). Vendors provided declarations of compliance instead. The evaluation process did not assess whether submitted SOC reports included the expected AI-specific controls. The evaluation relied on vendors' confirmation that data is processed and stored in Canada, without supporting evidence or independent verification.

    Design requirement that prevents this failure mode

    Self-attestation is not security documentation. A defensible architecture publishes its security artifacts — sub-processor inventory, BAA chain, TEE attestation digests, audit chain integrity reports — and makes each verifiable independently. Cryptographic signing of every output (ML-DSA-65 / FIPS 204 post-quantum signature) provides verification the customer can run without trusting the vendor.

  4. Section 4.3.4 — Bias controls not evaluated against evidence
    What the AG found

    Vendors described their organizational processes for mitigating bias but were not required to provide evidence of bias testing results. The Auditor General observed that bias in AI Scribe systems can arise from misinterpretation of accents or speech patterns, producing inaccuracies in medical records that may lead to adverse health outcomes.

    Design requirement that prevents this failure mode

    Bias mitigation must be measured, not described. A defensible architecture includes an out-of-distribution (OOD) test corpus — non-clinical audio (silence, casual conversation, counting) that the system must refuse rather than fabricate content for. Refusal correctness on OOD samples and zero-fabrication on out-of-distribution input are testable metrics, not procedural claims.

  5. Section 4.3.5 — No live demonstrations required
    What the AG found

    Vendors received the two simulated test recordings and processed them offline before submitting generated notes. The evaluation relied on a vendor attestation that the submitted notes were unedited. The Auditor General observed that the absence of a live demonstration created the risk that vendors could process recordings multiple times or alter the generated notes, compromising evaluation integrity.

    Design requirement that prevents this failure mode

    Defensibility requires that any procurement evaluation be conducted on the live, running system, with the evaluator providing the recordings and observing the output in real time. The architecture supports this when it can be demonstrated end-to-end on a live URL, with each generated note ML-DSA-65 signed at generation time and verifiable against a published public key.

Read together, the five findings define a forward direction for the AI scribe sector — what 'defensible' will need to mean as the technology matures and as the bar for procurement, evaluation, and clinical adoption rises. The findings emerged from a cooperative process between the Auditor General and the procurement bodies that designed the Vendor of Record arrangement — Supply Ontario, OntarioMD, Ontario Health, and the Ministry of Health — and that transparency itself is what allows the sector to respond. The rest of this document treats each finding as a positive design requirement: what an AI scribe architecture should provide so that the next iteration of procurement, evaluation, and adoption builds on what the report surfaced.


Seven architecture patterns that satisfy the Auditor General's findings

Each pattern below is a concrete technical decision a defensible AI scribe architecture makes. Each maps to one or more of the Auditor General's findings. Each is either currently shipping in production at ArcaKey AI, in active build, or named here as the standard the platform was designed against.

  1. Bare-transcription with post-text deterministic correction

    Audio is sent to the speech-to-text model with no system prompt — no medical-context priming, no vocabulary seed, no clinical framing. The raw transcription is then corrected by a deterministic regex layer that handles known phonetic mistranscriptions (drug-name corrections, dictation commands, vitals formatting). The model cannot fabricate clinical content from priming because the priming is not there.

    AG finding addressed

    Section 4.3.2 — hallucinations. Most generative AI scribes use a chat-completions endpoint with a clinical-context system prompt; that prompt becomes the source of fabricated clinical content when the audio is thin or non-clinical. Removing the prompt closes that failure mode at the architecture level.

  2. Documented Accuracy Gate as a merge requirement

    A published gate (`docs/voice-accuracy/ACCURACY-GATE.md` in the ArcaKey codebase) requires that every PR touching the STT path includes before/after bake-off numbers across nine metrics: drug-name accuracy, word error rate, mental-health-phrase capture, fabrication count, negation preservation, wrong-language output, OOD fabrications, OOD refusal correctness, and latency. Any regression past threshold blocks merge. The merge gate is enforced in code, not in marketing copy.

    AG finding addressed

    Section 4.3.2 — inaccuracy across all categories. Section 4.3.4 — bias measurement requires evidence. A documented merge gate is the evidence; the bake-off corpus is testable against any vendor's claims.

  3. Multi-pass drug verifier against authoritative formulary

    Every potential drug token in the generated transcript is checked against the Health Canada Drug Product Database (DPD) via three parallel passes: brand-name embedding similarity, generic-name exact match, and structural-anomaly detection (any token immediately preceding a dose unit that is not in DPD or known salt-forms). Each token receives a verified / verify / unknown status that surfaces inline in the transcript.

    AG finding addressed

    Section 4.3.2 — 60% of vendors transcribed the wrong drug name. A drug verifier with DPD cross-check turns this from an emergent failure mode into a positive feature: every drug name in the record is either verified against the formulary or flagged for physician review before sign-off.

  4. Hallucination guard at the inference layer

    After transcription, the system computes the word rate relative to the audio duration. Any output exceeding six words per second is refused with a structured error code — the upper bound of plausible human speech is below this threshold, and exceeding it indicates priming-overrun or fabrication. The refusal is logged in the audit chain.

    AG finding addressed

    Section 4.3.2 — fabricated content. A 15-second 'testing 1, 2, 3' audio file should not produce a 250-word fabricated clinical note. The hallucination guard prevents this category of output from ever reaching the physician's sign-off surface.

  5. Per-encounter ML-DSA-65 signing + audit chain

    Every generated encounter is signed with ML-DSA-65 (NIST FIPS 204 post-quantum digital signature) at generation time. The signature is verifiable against a public key published at the vendor's `/docs/...attestation-pubkeys.json`. Every signed encounter is also recorded as a chained audit log entry (Ed25519 + ML-DSA-65), making tampering with the record detectable after the fact.

    AG finding addressed

    Section 4.3.3 — self-attestation cannot replace third-party documentation. Cryptographic signing replaces self-attestation with independent verifiability. The customer can verify any encounter's integrity without trusting the vendor.

  6. TEE-attested inference (Intel TDX + NVIDIA Confidential Computing)

    Speech-to-text and downstream LLM inference run inside a Trusted Execution Environment — Intel TDX + NVIDIA Confidential Computing, via the Phala CVM relay. The processing environment is cryptographically attested: the relay publishes a measurement digest verifiable against the expected measurement. The model provider's operators cannot access PHI during inference because the runtime memory is encrypted. And because the vendor never holds the plaintext, customer content is excluded from model-training and human-review pipelines by construction — not opted out by a policy toggle the clinician must trust, but mathematically excluded. The practical consequence is that a physician can use AI on real clinical material — dictating an encounter, or working a question through the private AI workspace — without those dictations or queries becoming anyone's training data.

    AG finding addressed

    Section 4.3.3 — protection of personal health data. Encryption in transit and at rest are baseline; encryption in use (the third layer of the triad) is what closes the data-processing exposure surface. PHIPA section 12(1)'s reasonable-steps standard is increasingly being read to include encryption in use where the technology is generally available, which it now is.

  7. OOD refusal corpus + zero-fabrication on out-of-distribution input

    The bake-off corpus includes deliberately non-clinical audio: silence, casual conversation, counting, dictated 'testing 1, 2, 3'. The architecture's refusal correctness on these samples is measured and published. Current ArcaVoice production: 0 OOD fabrications, 100% OOD refusal correctness.

    AG finding addressed

    Section 4.3.4 — bias mitigation requires evidence. OOD refusal correctness is concrete evidence of fabrication-resistance. A vendor that cannot demonstrate this metric against the test corpus has not actually tested for the failure mode the Auditor General identified.

The seven patterns above are the practical answer to the Auditor General's five findings. They are not the only valid architecture — a different vendor could build a different combination of controls that satisfies the same requirements. But each requirement, in the form the Auditor General named, has a concrete technical answer. A defensible AI scribe procurement asks for those answers, not vendor self-attestations.


The Accuracy Gate pattern in detail

The single most consequential pattern in the Auditor General's findings is the absence of a documented accuracy gate. The 20 approved vendors did not, as a procurement requirement, publish accuracy thresholds against a fixed corpus. Each vendor's accuracy was evaluated on two simulated recordings by OntarioMD and Ontario Health medical evaluators after the fact. Every one of the 20 produced inaccuracies. The procurement had no objective floor to test against.

A documented accuracy gate is the answer. It looks like this: a published threshold per metric, a fixed test corpus, a bake-off harness that runs every change, and a merge gate in code that blocks any release that regresses past the threshold. The gate is not a policy document; it is an enforced engineering control.

The ArcaKey codebase publishes its accuracy gate at `docs/voice-accuracy/ACCURACY-GATE.md`. The gate enforces nine metrics. Each metric has a threshold relative to the prior production baseline. Any PR that touches the STT path is required to include before/after numbers in the PR description; any regression past threshold blocks merge.

Metric
Threshold
Rationale
Drug-name accuracy
≥ baseline (zero drop)
The Auditor General found 60% of vendors transcribed the wrong drug name. The gate prevents any regression — a release that drops drug accuracy by even one percentage point cannot merge.
Word error rate (WER)
≤ baseline + 0.01
General transcription accuracy. 1-percentage-point slack absorbs run-to-run noise.
Mental-health phrase capture
≥ baseline
The Auditor General found 85% of vendors missed key mental-health details. The gate makes this a measured, threshold-protected metric in its own right — not an emergent property of general transcription quality.
Fabrication count
≤ baseline
Hallucinated content. The Auditor General identified this as the most safety-relevant failure mode.
Negation preservation
≥ baseline
The 'no masses found' / 'patient denies anxiety' / 'no allergies' category. Losing a 'no' in a clinical record inverts the clinical meaning entirely.
Wrong-language output
= 0
A clinical English dictation that produces French output (or vice versa) is a record-integrity failure. Zero tolerance.
OOD fabrications
= 0
Non-clinical audio (silence, casual conversation) must not produce fabricated clinical content. Zero tolerance.
OOD refusal correctness
= 100%
Every non-clinical audio sample must produce an explicit refusal rather than any clinical content. The bake-off corpus is the test.
Latency p50
≤ baseline × 1.25
Secondary metric. Speed cannot regress more than 25% to keep clinical workflow viable.

The Accuracy Gate is the procurement requirement Supply Ontario's RFB lacked. A defensible AI scribe procurement asks the vendor for their published accuracy gate, their fixed test corpus, and their last 12 months of bake-off run logs. A vendor that cannot produce these documents has not been engineering against the failure modes the Auditor General identified.


The cryptographic chain — what replaces self-attestation

The Auditor General's Section 4.3.3 finding is structural: self-attestation cannot be the basis for security trust. Eleven of twenty vendors declared compliance with SOC 2 / HITRUST / ISO 27001 without submitting the reports themselves. The procurement evaluation did not verify the declarations. The customer (Ontario clinicians, ultimately) has no way to independently confirm any of those declarations after the fact.

The architecture pattern that replaces self-attestation is cryptographic signing. Every encounter the AI scribe produces is signed at generation time with ML-DSA-65 (NIST FIPS 204 post-quantum digital signature). The signature is verifiable against a public key the vendor publishes at a fixed URL on their domain. Every encounter is also recorded as a chained audit log entry — modification or deletion of any entry breaks the chain in a way verifiable against the chain head's signature.

What this gives the customer is independent verifiability. A custodian asked by the Information and Privacy Commissioner to demonstrate that their AI scribe records have not been tampered with can verify the signature on every encounter against the public key, without involving the vendor at all. A custodian responding to a sponsor audit can produce the chain head's signature as proof that no entries have been altered since the encounter occurred. The vendor's role is the architecture; the verification belongs to the customer.

This is what the Auditor General's third-party-report requirement was reaching for. Third-party reports (SOC 2, HITRUST, ISO 27001) attest to a vendor's processes; they do not attest to any individual encounter's integrity. Cryptographic signing attests to each encounter directly. The defensible position is to publish both: third-party reports for the vendor's organizational controls, ML-DSA-65 signatures for each encounter's integrity. Neither alone is sufficient. Both together are.


What a defensible AI scribe RFP would actually require

This is the section a hospital or clinic CIO copies into their own RFP when evaluating any AI scribe vendor. Each criterion translates an Auditor General finding into a procurement requirement. The criteria are independent — a vendor must satisfy each, not aggregate across them.

  1. Criterion 1

    Documented accuracy gate with fixed test corpus

    Vendor publishes a thresholds document. Bake-off harness runnable on demand. Every release ships with bake-off numbers. Any regression past threshold blocks merge in the vendor's CI/CD. Vendor produces the last 12 months of run logs on request.

  2. Criterion 2

    Drug-name verification against named authoritative formulary

    Vendor names the formulary used (Health Canada DPD for Ontario use). Vendor describes the verification mechanism (embedding similarity, exact match, structural anomaly). Drug-name flags appear inline in the transcript. The customer can audit which drugs were verified, which were flagged for review, and which were unknown.

  3. Criterion 3

    Hallucination guard at the inference layer

    Vendor publishes the word-rate threshold (or equivalent fabrication-resistance control). The control runs server-side, not as a UI affordance. Refusals are logged in the audit chain. The customer can verify that the control is operating.

  4. Criterion 4

    Per-encounter cryptographic signing (post-quantum)

    Vendor publishes an attestation public key at a fixed URL on their domain. Each encounter ships with an ML-DSA-65 (or equivalent FIPS-204) signature over the encounter contents. The customer can verify the signature independently. Tampering with the encounter content breaks the signature.

  5. Criterion 5

    TEE-attested inference

    Vendor specifies the TEE technology used (Intel TDX, AMD SEV-SNP, NVIDIA Confidential Computing). Vendor publishes an attestation endpoint (e.g., `/attestation`) that returns the live measurement digest. Customer can compare the live digest against the expected digest. Customer can verify the runtime environment cryptographically.

  6. Criterion 6

    Out-of-distribution refusal correctness corpus

    Vendor's bake-off corpus includes deliberately non-clinical samples. Vendor publishes refusal-correctness numbers on each release. Customer can submit OOD samples and observe refusal behavior in real time during procurement evaluation.

  7. Criterion 7

    Live demonstration of end-to-end workflow

    Vendor demonstrates on a live URL with customer-supplied audio recordings provided in real time. The customer provides a non-clinical OOD sample as part of the demonstration to observe refusal behavior. The generated notes are signed at the time of generation and verifiable in real time.

  8. Criterion 8

    Sub-processor inventory and BAA chain

    Vendor publishes a sub-processor list with current BAA / DPA status per vendor. The list is updated within seven days of any sub-processor change. The customer's own BAA chain has a current upstream view.

  9. Criterion 9

    Zero Data Retention contractual posture

    Vendor's contracts with downstream AI providers (model provider, cloud LLM) include Zero Data Retention provisions where applicable. The contracts are produced for review. Audio is not persisted by the vendor — transcript ciphertext is the only retained artifact.

  10. Criterion 10

    Exclusion from model training — architectural, not a policy toggle

    Vendor demonstrates that clinician dictations, prompts, and completions are not used to train models and are not routed to human-review pipelines. The strongest form is architectural: where inference runs inside a TEE and the vendor never holds the plaintext, the content is mathematically excluded from training rather than opted out by a setting the customer must trust. Where a downstream model provider is used, a Zero Data Retention contract that explicitly bars training and human review is produced for review. A clinician should be able to use AI on real clinical material without their content becoming someone else's training data.

  11. Criterion 11

    Independent technical review

    Vendor names a credentialed external expert who has reviewed the architecture. Customer can verify the expert's credentials and engagement letter. The expert signs releases on a documented cadence (quarterly methodology review minimum).


Self-audit checklist for Ontario clinics evaluating AI scribes today

An Ontario clinic that has already adopted an AI scribe — or is evaluating one — can run this checklist against their current or candidate vendor without engaging any external party. The output is a defensible internal record. Items that cannot be answered affirmatively are the remediation backlog, and become the questions to ask the vendor directly.

  1. 1. Can your AI scribe vendor produce their published accuracy gate?

    If yes — what metrics does it cover, and what are the current production baselines? If no — they are not engineering against the failure modes the Auditor General identified in Section 4.3.2.

  2. 2. Can your vendor produce bake-off run logs for the last 12 months?

    If yes — review the trend. Are drug-name accuracy and mental-health-phrase capture rising or falling release-over-release? If no — there is no auditable record that accuracy is being managed at all.

  3. 3. Are drug names cross-checked against the Health Canada DPD?

    If yes — what visible mechanism flags drugs that are unverified, mis-spelled, or off-formulary in the transcript? If no — the 60% wrong-drug failure mode the Auditor General found is uncontrolled in your records.

  4. 4. Is hallucinated content prevented at the inference layer, not the UI?

    A UI affordance ('please review the note before signing') is not a hallucination control. A server-side check that refuses output exceeding a published threshold is. Ask the vendor which one their system uses.

  5. 5. Is each encounter cryptographically signed?

    If yes — what algorithm (ML-DSA-65 is the post-quantum standard), and where is the public key published? If no — you cannot prove an encounter has not been altered since generation, which means you cannot satisfy PHIPA section 12(2)'s record-of-contravention requirement if a question arises.

  6. 6. Does the vendor publish a TEE attestation digest?

    If yes — what TEE technology (Intel TDX, AMD SEV-SNP, NVIDIA Confidential Computing), and how do you verify the live digest matches expected? If no — PHI is being processed in plaintext memory that the model provider's operators can in principle access.

  7. 7. Does the vendor publish their out-of-distribution refusal corpus?

    If yes — can you observe refusal behavior on a non-clinical sample (silence, counting, casual conversation) during procurement evaluation? If no — the bias mitigation the Auditor General required in Section 4.3.4 is not measurable.

  8. 8. Is the vendor's sub-processor chain published and current?

    If yes — review it quarterly. Are there changes you were not notified about? If no — your own BAA chain is built on undocumented foundations.

  9. 9. Will the vendor demonstrate the system live, end-to-end, with audio you provide?

    If yes — schedule the demonstration. Bring an OOD sample. If no — you are evaluating offline outputs, not the running system. Section 4.3.5 of the Auditor General's report is uncomfortably specific about why this matters.


Sources and citations

Every claim in this methodology is grounded in one of the named external standards below. Each is publicly available; each is the start point for a custodian's own analysis.

  1. Office of the Auditor General of Ontario — Performance Audit 2026: Use of Artificial Intelligence in the Ontario Government

    Special Report, May 2026. Section 4.3 (Safe Use of AI Scribe Systems) and Recommendations 5-9. Available at auditor.on.ca.

  2. Personal Health Information Protection Act, 2004 (PHIPA)

    Ontario S.O. 2004, c. 3, Sched. A. Sections 12(1), 12(2), and 17 are particularly relevant to AI Scribe defensibility. Current consolidation at ontario.ca/laws.

  3. Enhancing Digital Security and Trust Act, 2024

    Ontario, in force January 2025. The legislative basis for the Ontario AI Directive and the OPS AI Framework cited by the Auditor General.

  4. OPS AI Directive and AI Framework

    Ministry of Public and Business Service Delivery and Procurement, published September 2023, governs AI use across the Ontario Public Service.

  5. NIST FIPS 203 (Module-Lattice-Based Key-Encapsulation Mechanism)

    Post-quantum key encapsulation (ML-KEM-768). The encryption-at-rest and key-management standard cited in the architecture patterns.

  6. NIST FIPS 204 (Module-Lattice-Based Digital Signature)

    Post-quantum signature scheme (ML-DSA). The standard cited for per-encounter signing.

  7. NIST AI Risk Management Framework (AI RMF 1.0) and Generative AI Profile (NIST AI 600-1)

    The AI governance and workflow risk standard, including the generative-AI-specific addendum.

  8. NHS England — AI Scribe Guidance, April 2025

    Cited in Section 4.3.2 of the Ontario AG report as the comparable UK regulatory posture, requiring registration with the UK Medicines and Healthcare products Regulatory Agency as Class I medical devices.

  9. Information and Privacy Commissioner of Ontario — orders and guidance

    Recent orders relevant to AI-assisted workflows, including the December 2024 breach order referenced in Section 4.3.3 of the Auditor General's report. Available at ipc.on.ca.

  10. ArcaKey AI — published security artifacts

    arcakey.ai/security publishes the live attestation endpoint, sub-processor inventory, threat model, and cryptographic whitepaper referenced in the architecture patterns above. The Accuracy Gate document lives in the ArcaKey codebase at docs/voice-accuracy/ACCURACY-GATE.md.

This document is reference methodology, not legal counsel. It is intended for use by Ontario clinicians, hospital CIOs, procurement officers, and the privacy counsel who advise them, as a starting framework for evaluating AI scribe vendors against the failure modes identified by the Office of the Auditor General of Ontario in its 2026 Performance Audit. ArcaKey AI is not a law firm and does not practice law. Where a finding requires legal interpretation, the document explicitly recommends engaging counsel qualified in the relevant jurisdiction. The 20 vendors currently approved by Supply Ontario through the AI Scribe Vendor of Record arrangement are not named in this document and no comparison to specific approved vendors is intended; the framework is presented for the reader to apply independently.

AI Scribe Defensibility After the 2026 Auditor General Report | ArcaKey AI