Run Controls Before Patient Samples

Every clinical lab proves its analyzer still tells the truth each morning before the first patient sample. AI engineering calls that eval-driven development and considers it advanced.

Before a clinical laboratory reports a single patient result, a technologist runs samples with known answers through the analyzer and confirms the machine still tells the truth. The method goes back to 1950, when Stanley Levey and E.R. Jennings took Walter Shewhart's industrial control charts from Bell Labs and put them on the lab bench. AI engineering is reinventing the discipline right now under the name eval-driven development. The idea is right. It is also seventy-five years old, and the lab's version is still the more rigorous one.

The morning routine is concrete. Control material is a manufactured sample with a known target value. The technologist runs at least two levels before patient testing: one normal, one abnormal. The easy case and the hard case. The results land on a Levey-Jennings chart, a plot of control values over time with bands at one, two, and three standard deviations from the mean. If the controls land where they should, patient samples run. If they do not, the instrument stops, and the morning becomes troubleshooting instead of testing. Daily controls were not universal in 1950; the ritual hardened over decades, and federal law eventually wrote it down. But the logic has not changed since Shewhart: do not trust an instrument you have not challenged today.

Error Has Shapes

The chart got sharper in 1981, when James Westgard published the multirule that now carries his name. The insight is that error has shapes, and each shape leaves a signature. One point past three standard deviations is a spike: random error, the bad reagent lot, the bubble in the line. Consecutive points past two standard deviations on the same side are a shift: something systematic changed, a recalibration, a new lot, an aging lamp. Ten consecutive points on one side of the mean are drift: the slow failure, the one a glance never catches. A technologist reading the chart is not asking whether the machine is good. The technologist is asking which way it is failing, because each shape has a different cause and a different fix.

exhibit 01

A Levey-Jennings chart for a model

in control1:3s violation, one point past three SD10:x violation, ten consecutive points one side of the mean

An illustrative month for a claims-coding model scored each morning against a fixed known-answer set, plotted in standard deviations from its established mean. The pattern is the point, not the data. Morning 11 is the spike a glance catches. Mornings 21 through 30 are the drift a glance misses, which is exactly why the 10:x rule exists. Control rules from Westgard et al., Clinical Chemistry, 1981.

The Analyzer You Do Not Own

Now swap the analyzer for a model. A practice that codes claims with a hosted model is running an instrument it does not own, cannot inspect, and did not calibrate, and the vendor can change it overnight. This is measured, not hypothetical. In 2023, Stanford and Berkeley researchers tested the same model on the same task three months apart: GPT-4's accuracy at identifying prime numbers fell from 84 percent in March to 51 percent in June. Same product name, same API, same price. Different instrument. Same label on the reagent bottle, different chemistry inside.

To be fair to the vendors, models change on purpose, and a careful team can pin a version or run its own weights. That does not weaken the lab's lesson. It is the lab's lesson. The chart is how you know which world you are in. A laboratory running morning controls would have caught the June change on the first Tuesday, before the first patient sample. A team without controls finds out from its users, or never.

The lab does not trust the analyzer because it worked yesterday.

The mapping is direct. Morning controls become a fixed set of known-answer cases that runs against the live system on a schedule, before the day's work. Two control levels become the easy set and the hard set: the routine office visit, and the messy multi-problem encounter with the drug interaction. The Levey-Jennings chart becomes the score plotted over time, per workflow, somewhere a person actually looks. The Westgard rules become pull rules with shapes: a spike means inspect the run, a shift means something shipped, a drift means the case mix or the model is moving and the system comes out of service until someone knows which. Proficiency testing, the federal requirement that an outside program send blind samples and grade the lab against its peers, becomes the external eval: cases the team did not write, scored by someone with no stake in the answer. And recalibration after maintenance becomes the rule that every change, a prompt edit, a model swap, a vendor update, reruns the controls before the system touches live work.

exhibit 02

Lab discipline, model discipline

The lab runs

The model deployment runs

The failure it names

The lab runs

Two control levels each morning, one normal and one abnormal, before any patient sample runs.

The model deployment runs

A known-answer case set, easy and hard, runs against the live system before the day's work.

The failure it names

The broken instrument.

The lab runs

The Levey-Jennings chart on the wall, one dot per run, bands at one, two, and three SD.

The model deployment runs

Scores plotted over time, per workflow, somewhere a person actually looks.

The failure it names

The trend a glance misses.

The lab runs

Westgard 1:3s. One point past three SD stops the run.

The model deployment runs

Spike rule. One failed morning pulls the workflow for inspection before live work.

The failure it names

Random error. The bad day.

The lab runs

Westgard 2:2s and 4:1s. Consecutive points past the same limit.

The model deployment runs

Shift rule. A vendor update or a prompt edit changed the instrument.

The failure it names

The systematic change.

The lab runs

Westgard 10:x. Ten consecutive points on one side of the mean.

The model deployment runs

Drift rule. The case mix or the model is moving. Out of service until someone knows which.

The failure it names

The slow failure.

The lab runs

CLIA proficiency testing. Blind samples from an outside program, graded against peers.

The model deployment runs

An external eval the team did not write, scored by someone with no stake in the result.

The failure it names

Grading your own homework.

The lab runs

Recalibration and controls after every maintenance event.

The model deployment runs

Controls rerun after every prompt edit, model swap, or vendor update, before live work resumes.

The failure it names

The silent update.

The left column has run in clinical laboratories for decades and is written into federal regulation. The right column is the same discipline pointed at a model. None of it judges the individual case. The clinician still reviews the note, the coder still owns the exception. Controls catch the instrument, not the judgment.

A License, Not a Best Practice

Here is the difference worth sitting with. In software, evals are a best practice, which in practice means a thing deadlines defer. In the laboratory, controls are regulation. CLIA, the 1988 federal law that governs every clinical lab in the country, was written after investigations found Pap smear screening mills missing cancers at volume. It requires quality control and grades every lab on blind proficiency samples sent by an outside program, and a lab that keeps failing loses the test. The discipline did not survive because laboratorians are more virtuous than software engineers. It survived because the alternative was published dead patients, and the law made stopping cheaper than being wrong.

Nobody licenses a claims-coding model, but the artifacts are buildable in a week: a case set, a nightly run, a chart, a pull rule. The chart is the cheap part. The expensive part is the rule that stops the instrument when the chart says stop, on a Tuesday, with claims waiting and a vendor on the phone saying it looks fine on our end. That rule is the whole discipline. Everything else is decoration around it.

Show Me Your Control Chart

This gives a practice, or a fund looking at a vendor, one precise question: show me your control chart. Not the benchmark from the launch post. The chart. What known-answer cases run against the production system every morning, where do the scores live, what rule pulls the system from service, when did it last trip, who got paged, and what the rollback was. A vendor that can produce the chart is running an instrument. A vendor that cannot is selling an analyzer that has never been calibrated, and the patient samples are already running through it.

The boundary is the same one the lab draws. Controls catch instrument failure. They do not judge the individual case, which is why the clinician still reviews the note and the coder still owns the exception. And the control set itself ages: case mix changes, codes change, January's hard set is June's easy set. A lab reviews its control ranges on a schedule, with a name attached. The model's case set needs the same named owner and the same calendar date, because a control that no longer challenges the instrument is a control in name only.

Eval-driven development is the right idea wearing a new name. To see the mature version, do not study the largest software company. Stand in a hospital lab at 6:40 in the morning and watch a technologist refuse to run patient samples because one dot landed outside a line.