The Checklist Came Before the Model
Reliability on top of an unreliable component is treated as AI's newest engineering problem. Anesthesia solved it forty years ago and wrote down what worked.
AI engineering treats reliability on top of unreliable components as its newest problem. One deployment firm's whitepaper asks the question plainly: generative AI is transformative and unpredictable, so can dependable systems be built from an undependable ingredient? They can. Medicine has been building them for a century. The undependable ingredient was the clinician.
An operating room is a machine for getting survivable outcomes out of tired, distractible, overconfident, brilliant people. Almost nothing in that room assumes the people are reliable. The room assumes the opposite. The checklist on the wall, the second signature on the blood bag, the pharmacist reviewing the order, the conference where the team takes apart last month's death: each one is engineering around a component that fails. The component was never fixed. The system around it was.
The Anesthesia Record
Start with the cleanest case in the patient safety literature. In the early 1980s, anesthesia killed roughly one patient in ten thousand. The profession knew it, the malpractice carriers knew it, and in 1982 ABC told everyone else in a special called The Deep Sleep: six thousand Americans a year would die or suffer brain damage under anesthesia. Anesthesiologists were about three percent of physicians and eleven percent of malpractice payouts.
What happened next was not a smarter anesthesiologist. Jeffrey Cooper had already shown in 1978 that most anesthesia mishaps were preventable human error with patterns that repeat, which made them a systems problem, not a talent problem. The specialty founded the Anesthesia Patient Safety Foundation in 1985, the first organization anywhere with patient safety in its name, and began reading closed malpractice claims the way crash investigators read wreckage. The files showed that respiratory events, the largest single class of death and brain damage claims, were judged preventable with better monitoring in seventy-two percent of cases. Harvard wrote mandatory minimal monitoring standards for its operating rooms in 1986. Pulse oximetry became an ASA standard for every anesthetic in 1990, capnography for confirming every intubation in 1991.
By the 2000s, anesthesia mortality in healthy patients was being cited on the order of one in one hundred thousand or better. An order of magnitude, without making a single human more reliable. The premiums followed: anesthesia malpractice rates fell by roughly two-thirds in the five years after the monitoring standards arrived. The people improved too, with training and simulation, and that work is real. But the mortality curve bent when the system changed.
Hold the honest caveat next to the number. No randomized trial ever proved the pulse oximeter reduced mortality, and Cochrane reviews say so. The monitors arrived together with standards, training, better drugs, and machines designed to fail safe, so no single component can claim the credit. That is not a hole in the story. That is the story. The system improved, not the component.
exhibit 01
The unreliable component was never fixed
- 1 · 1978Cooper publishes the critical incident study. Most anesthesia mishaps are preventable human error with repeating patterns.
- 2 · 1982ABC airs The Deep Sleep: six thousand Americans a year will die or suffer brain damage under anesthesia.
- 3 · 1985The specialty founds the Anesthesia Patient Safety Foundation and starts the closed claims project.
- 4 · 1986Harvard makes minimal monitoring standards mandatory in its operating rooms.
- 5 · 1990Pulse oximetry becomes an ASA standard for every anesthetic.
- 6 · 1991Capnography becomes the ASA standard for confirming every intubation.
What the Checklist Proved
The surgical checklist made the same point with a controlled before-and-after. In 2009, Haynes and colleagues published the WHO checklist study in the New England Journal of Medicine: nineteen items, three pause points, eight hospitals in eight cities from Seattle to Ifakara, 3,733 operations before and 3,955 after. Deaths fell from 1.5 percent to 0.8 percent. Major complications fell from 11 percent to 7 percent. The checklist taught nobody surgery. It made the room say out loud what the system already knew: who the patient is, what the operation is, what could go wrong, whether the antibiotic is in.
Then hold the second caveat, because it carries the operating lesson. When Ontario mandated the checklist province-wide, the published result showed no significant improvement in deaths or complications. The fight over why is its own literature, but the short version is enough here: a checklist signed without being performed is paper. Mandates buy signatures. They do not buy the behavior.
exhibit 02
Nineteen items, eight cities, one wrapper
Deaths
47% lower
Before the checklist
After the checklist
Major complications
36% lower
Before the checklist
After the checklist
Ontario, 2014. Province-wide mandate, no significant outcome change. A checklist signed without being performed is paper.
A First-Day Resident, Forever
Now place the model. A large language model drafting clinic notes or coding claims is a first-day resident, forever. Fast, widely read, eager, and wrong in confident ways. A resident improves with every rotation. This one does not. It never gets tired, and it never learns your patients unless someone builds the learning around it. Medicine has employed exactly this person for a hundred years, and what it does with that person is not hope harder. It builds the room.
The attending co-signs: the resident's note becomes the chart only after a named physician reviews and signs, and a model's draft should reach the chart, or the claim, the same way. The machine gets checked before the case, not during: anesthesia machines pass a pre-use check while the patient is still outside the room, and a model passes its eval gate before deployment and after every update, not after the first bad note reaches a payer. The monitor watches the live case: a pulse oximeter does not grade the anesthesiologist in general, it tells the room whether this patient is hypoxic right now, and monitoring on live output is the oximeter where the launch benchmark is only the diploma on the wall. The conference reviews the failure: morbidity and mortality conferences take apart the case rather than the person, and the findings feed standards, so logged model failures need the same standing meeting with the same rule, improve the system before blaming the component. And the registry compounds: the closed claims project turned scattered lawsuits into the specialty's shared memory, which is exactly what an incident log that feeds the next eval set does for a deployment.
One economic fact makes the whole wrapper viable, in residency programs and in software: review costs less than authorship. The attending does not rewrite the note. The clinician who reviews a drafted chart entry spends minutes where writing it would have taken the evening. If review were as expensive as the work itself, supervision would be a tax nobody pays. It is cheap enough that medicine has run on it for a century, which is the quiet answer to the question every practice owner asks first: who has time to review all of this?
Medicine never made the human reliable. It made the room reliable.
exhibit 03
The wrapper, translated
The clinic built
The deployment needs
The failure it catches
The clinic built
Attending co-signature. The resident's note becomes the chart only after a named physician reviews and signs.
The deployment needs
A named clinician reviews every draft before it exports to the chart or the claim.
The failure it catches
The confident wrong answer.
The clinic built
Anesthesia machine check before every case, not during.
The deployment needs
An eval gate the system must pass before deployment and after every update.
The failure it catches
The broken instrument.
The clinic built
Pulse oximeter on this patient, this minute.
The deployment needs
Monitoring on live output, not the launch benchmark.
The failure it catches
The failure in progress.
The clinic built
Morbidity and mortality conference. The case gets taken apart, not the person.
The deployment needs
A standing review of the failure log that changes the system, on a schedule.
The failure it catches
The repeating error.
The clinic built
Closed claims registry across the whole specialty.
The deployment needs
An incident log that feeds the next eval set.
The failure it catches
The pattern no single case shows.
The clinic built
Pharmacist double check on high-risk orders.
The deployment needs
An independent second check on high-risk output: doses, codes, dollar amounts.
The failure it catches
The expensive miss.
Where the Analogy Stops
The resident comparison has a hard edge, and the edge is load-bearing. A resident carries a license, malpractice exposure, and a career that ends if the board acts. The model carries nothing. Every consequence lands on the reviewing clinician and the practice that deployed it. That asymmetry is not a reason to avoid the tools. It is the reason the wrapper is not optional: supervision is what you build when the component cannot be accountable. And there is precedent for who eventually enforces it. Malpractice carriers priced the anesthesia wrapper within five years of the standards. Carriers price wrappers. They will price these.
What a Practice Should Ask
The history collapses into four questions a practice can ask any vendor, or itself, before an AI product touches a chart or a claim. Where is the co-signature: which named person reviews the draft before it exports, and is the review real or a signature. Where is the pre-use check: what known cases must the system pass before going live, and after every update. Where is the oximeter: what watches live output, and what trips it. Where is the conference: who reads the failure log, on what schedule, and what changed last month because of it. A vendor with answers is selling a system. A vendor with a benchmark score is selling a diploma.
The signals are ordinary operating numbers. Notes amended after review. Denials per hundred claims. Minutes of after-hours charting, counted net of the minutes now spent reviewing drafts. If the wrapper works, those numbers move within a quarter, and the practice can watch them move.
The patient safety shelf never mentions a model. Cooper on critical incidents, the Harvard standards, the closed claims files, Haynes on the checklist. Every one of them describes how to get good outcomes from a brilliant component that fails without warning, which is a precise description of the thing AI teams are deploying now. The checklist came before the model. Read it that way.