Feb 19, 2026
Medical documentation and medical coding sit at the center of healthcare’s administrative burden. Healthcare spending represents ~20% of U.S. GDP, and roughly 25% of that spend is administrative. Two of the biggest drivers are clinical documentation (turning an encounter into a usable clinical note) and medical coding (translating care into billable codes that determine reimbursement and compliance).
And the stakes are real. Denial rates are substantial, at roughly 40–50% in Medicaid and single digits in Medicare. Denials create downstream impact across provider revenue cycles, administrative staff time, and patient financial liability. In parallel, clinical documentation is one of the biggest drivers of clinician time and burnout, which is why AI scribes are being deployed at scale.
Since late 2023, we have also seen a surge of AI tools attempting to automate or assist both documentation and coding. The natural next question is: do these systems actually work in the workflows where they will be used?
This is why we partnered with Vals AI on new healthcare benchmarks: MedScribe (clinical documentation) and MedCode (medical coding). The key point from Protege’s perspective is simple: benchmarks only matter if the data reflects real workflows and is clean enough to measure generalization, not memorization.
We see a world where evaluation-ready, contamination-resistant datasets help drive the AI frontier in healthcare forward. In healthcare, we start from real-world healthcare data, de-identify and curate it into benchmark-grade inputs, design holdout strategies that prevent leakage at the patient level, and work with experts to ensure the evaluation reflects actual care scenarios.
Ultimately, we want to drive beyond general “medical knowledge” testing and instead measure performance on the tasks that drive outcomes in practice: administrative burden, reimbursement integrity, compliance risk, and the day-to-day reliability needed for safe deployment.
MedScribe: benchmarking clinical documentation quality
Why clinical documentation is hard to benchmark
Clinical documentation is one of the biggest drivers of administrative load in healthcare. In many settings, clinicians spend more time documenting care than delivering it. That reality is why AI scribes are being deployed at scale.
But documentation quality is notoriously hard to evaluate. “Does the note read well?” is not a sufficient benchmark. A strong clinical note must be:
Structurally correct (for example, SOAP formatting)
Clinically faithful to the encounter (no invented symptoms, meds, or diagnoses)
Complete in the right places, especially where downstream decisions and follow-up depend on it
Consistent with real documentation workflows, not idealized or synthetic examples
This makes rigorous benchmarking essential.
The limitations of most public scribing benchmarks
Most public benchmarks for clinical documentation run into a few recurring problems:
Synthetic or unrealistic inputs: Many evaluations rely on short prompts or simplified cases that do not match real conversations.
Weak or subjective grading: If a benchmark cannot score the output in a consistent way, it becomes hard to compare models or track progress over time.
Contamination risk: As with coding, broadly distributed datasets can end up in training corpora. That can inflate scores through memorization, rather than testing whether a model can generalize to new encounters.
This is why benchmark-grade scribing evaluation requires both (1) realistic inputs and (2) objective grading scaffolding.
What Protege built: the data engine behind MedScribe
Protege’s role in MedScribe was to prepare de-identified, evaluation-ready clinical documentation data that reflects real workflows and supports clean measurement.
At a high level, Protege curated and prepared:
De-identified doctor-patient conversation transcripts that reflect real clinical encounters
Dataset splits and controls designed to reduce leakage and ensure evaluation measures generalization, not recall
Vals then paired this data with an evaluation framework designed to score documentation quality consistently across models.
Benchmark construction with Vals: rubric-based scoring for clinical notes
Vals constructed MedScribe to evaluate whether an AI scribe can produce reliable SOAP notes from encounter transcripts, using expert-developed rubrics that score documentation quality.
In the MedScribe benchmark:
Documentation is scored against a rubric set (rather than vibe-based ratings)
The evaluation is designed to reflect real documentation requirements, including clinical fidelity and completeness
MedCode: benchmarking medical coding
Why medical coding is hard to benchmark
Medical coding is not a single-label classification task. Coders are often selecting the maximum valid set of billable codes, frequently up to 25 per case, while staying compliant.
In practice, coding is simultaneously:
An evidence extraction problem: what diseases and procedures are actually documented?
An optimization problem: what can be billed compliantly, given coding rules and institutional SOPs?
Coders must reason over clinical documentation, disease severity, co-morbidities, and procedural context. This institutional nuance is exactly why strong benchmarks are necessary. This is also why generic public datasets tend to fail.
The two core failures of most public coding benchmarks
Most public coding benchmarks run into the same structural limitations:
Contamination risk: Public datasets are often scraped or replicated into training corpora. This can inflate apparent model performance through memorization, rather than testing genuine medical and billing reasoning.
Billing is not ground truth: Many EMR datasets include submitted bills, but submission does not equal approval. The real signal is which bills passed payer review, and which bills avoided denial.
This is also why simply hiring expert coders to annotate an EMR dataset is often insufficient without connection to real-world validation signals and outcomes.
What “gold-standard” coding data requires
To benchmark coding systems credibly, you need data that is tightly connected to real billing workflows and outcomes:
EMR linked to billing outcomes.
Visibility into submitted codes, approved codes, and denial patterns.
A way to identify high-performing coders (for example, those associated with low denial rates).
Coverage across diverse diagnoses, complex co-morbid cases, and multi-day inpatient stays.
This is what makes coding benchmarks reflect real financial and compliance stakes, rather than toy tasks.
What Protege built: the data engine behind MedCode
Protege’s role in the MedCode benchmark was to curate de-identified medical EMR datasets linked to bills that successfully passed submission, and to prepare them so they could be used as evaluation-ready benchmark data, not training data.
For each case, we provided raw clinical notes, submitted billing codes, and ancillary codes present in the chart but not billed. This creates visibility into both the evidence available in the record and the coding decisions made in practice.
That distinction is crucial for understanding where a model fails: did it miss evidence in the notes, or did it fail to apply billing logic and coding rules? Models were evaluated on their ability to assign the correct primary ICD, capture appropriate secondary ICDs, and maximize compliant code sets.
A core design constraint for the Protege-provided data was preventing leakage. These datasets were intentionally held out of training datasets, with holdout occurring at the patient level (not just record level). This ensures the benchmark tests true generalization, not recall from near-duplicate encounters.
Why these benchmarks matter
MedScribe helps healthcare organizations and developers answer a practical question:
If we deploy an AI scribe at scale, can it produce notes that meet documentation standards in real workflows?
And it gives the industry a way to measure progress on a task directly tied to:
Clinician administrative burden
Documentation quality and reliability
Workflow adoption risk (trust is the gate for real deployment)
MedCode tests model performance on work that is directly tied to healthcare financial operations, compliance risk, and administrative burden. It evaluates both clinical reasoning and revenue-cycle optimization. This is a step toward AI systems that can operate in real billing environments, not just academic exercises.
If you're interested in the work that DataLab at Protege is doing with benchmarks and evaluations, please reach out at contact@withprotege.ai!

