PhysicianBench:
Evaluating LLM Agents in
Real-World EHR Environments
Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, Jonathan H. Chen
Stanford University
Leaderboard
Pass@1 and reliability metrics averaged over 3 independent runs.
| # | Model | Provider | Pass@1 ±SD | Pass@3 | Pass³ | #Turns |
|---|---|---|---|---|---|---|
| 1 | OpenAI | 46.3±1.2 | 57.4 | 28.0 | 41.9 | |
| 2 | Anthropic | 31.7±2.3 | 41.5 | 18.0 | 25.2 | |
| 3 | Anthropic | 29.3±2.5 | 37.9 | 18.0 | 16.2 | |
| 4 | OpenAI | 27.7±1.5 | 37.7 | 13.0 | 39.8 | |
| 5 | Anthropic | 23.0±2.6 | 33.2 | 9.0 | 22.3 | |
| 6 | DeepSeek | 18.7±2.9 | 27.9 | 6.0 | 35.3 | |
| 7 | Moonshot | 17.0±2.6 | 26.3 | 5.0 | 42.4 | |
| 8 | Xiaomi | 16.7±4.0 | 23.6 | 6.0 | 29.5 | |
| 9 | Alibaba | 13.7±4.0 | 22.6 | 2.0 | 28.0 | |
| 10 | MiniMax | 8.7±1.2 | 15.9 | 1.0 | 29.7 | |
| 11 | 6.0±1.0 | 9.3 | 3.0 | 30.4 | ||
| 12 | xAI | 5.3±3.2 | 9.7 | 1.0 | 16.7 |
Performance by Subgroup
Pass@1 averaged over 3 runs. Darker teal cells indicate higher success rate within that subgroup.
| Model | Cardiology n=6 | Endocrinology n=13 | GI & Hepatology n=14 | Immunol & ID n=12 | Psych / Neuro n=16 | Hem / Onc n=13 | Neph / Urol n=8 | Pulm & Other n=18 | Overall |
|---|---|---|---|---|---|---|---|---|---|
| GPT-5.5 | 55.6 | 59.0 | 57.1 | 38.9 | 33.3 | 48.7 | 29.2 | 48.1 | 46.3 |
| Claude Opus 4.6 | 27.8 | 35.9 | 35.7 | 38.9 | 27.1 | 30.8 | 33.3 | 25.9 | 31.7 |
| Claude Opus 4.7 | 38.9 | 28.2 | 28.6 | 22.2 | 18.8 | 30.8 | 33.3 | 38.9 | 29.3 |
| GPT-5.4 | 27.8 | 30.8 | 21.4 | 27.8 | 22.9 | 38.5 | 20.8 | 29.6 | 27.7 |
| Claude Sonnet 4.6 | 33.3 | 10.3 | 26.2 | 27.8 | 25.0 | 25.6 | 33.3 | 14.8 | 23.0 |
| DeepSeek V4-Pro | 16.7 | 7.7 | 33.3 | 19.4 | 14.6 | 15.4 | 12.5 | 24.1 | 18.7 |
| Kimi-K2.6 | 27.8 | 12.8 | 21.4 | 16.7 | 14.6 | 17.9 | 12.5 | 16.7 | 17.0 |
| MiMo-v2.5-Pro | 11.1 | 10.3 | 23.8 | 16.7 | 27.1 | 17.9 | 4.2 | 13.0 | 16.7 |
| Qwen3.6-Plus | 5.6 | 12.8 | 9.5 | 16.7 | 20.8 | 10.3 | 12.5 | 14.8 | 13.7 |
| MiniMax M2.7 | 0.0 | 5.1 | 11.9 | 11.1 | 8.3 | 7.7 | 4.2 | 13.0 | 8.7 |
| Gemini Pro 3.1 | 5.6 | 10.3 | 7.1 | 0.0 | 8.3 | 5.1 | 0.0 | 7.4 | 6.0 |
| Grok-4.20 | 5.6 | 5.1 | 9.5 | 2.8 | 10.4 | 2.6 | 4.2 | 1.9 | 5.3 |
Where Do Failures Come From?
Each failed checkpoint is classified by its evaluation type. Across all models, clinical reasoning accounts for ~52% of failures — the core bottleneck.
Agent fails to query or surface required EHR data
Incorrect clinical interpretation, diagnosis, or decision
Correct decision but wrong or missing FHIR order
Assessment note missing required clinical elements
Explore Sample Tasks
Each task is a clinician-validated composite workflow with FHIR-grounded evaluation checkpoints. Two examples shown; the full benchmark has 100 tasks.
Adrenal Insufficiency Management with Symptom Evaluation
An endocrinology patient with known adrenal insufficiency has sent a portal message reporting worsening fatigue, blood pressure instability, elevated heart rate, and decreased appetite, asking whether their current hydrocortisone dosing is adequate. Retrieve demographics, the etiology of adrenal insufficiency (primary vs. secondary), current hydrocortisone regimen (AM/PM doses), recent morning cortisol/renin/aldosterone, and recent BP/HR trends. Decide whether the current replacement is adequate, propose a specific dose adjustment if warranted, address blood pressure with appropriate specialty referral, and document a contingency plan and follow-up timeline. Save the plan to workspace/output/management_plan.txt.
Retrieve demographics, AI etiology, current hydrocortisone regimen, recent cortisol/renin/aldosterone labs, and BP/HR trends.
Recognize 15 mg/day is at the low end of physiologic replacement (typical 15–25 mg/day) and link symptoms to under-replacement; correctly identify primary vs. secondary AI.
Propose a specific revised AM/PM regimen with patient-specific rationale.
Create a ServiceRequest for cardiology referral to address blood pressure instability.
Document an explicit contingency if symptoms do not improve on the adjusted regimen.
Plan covers all required elements (etiology, current regimen, assessment, dose adjustment, BP plan, contingency, follow-up) and avoids unsafe recommendations such as fludrocortisone in secondary AI.
Want to see how an agent solves this task? Watch a full trajectory replay:
Watch trajectory →Trajectory Viewer
Step through a real agent session. Each tool call shows the FHIR query and response. Green cards are tool results you can click for full detail.
Methodology
Every design choice maximizes clinical realism while keeping evaluation reproducible and hermetic.
Clinician-validated tasks
Each task was written or reviewed by a practicing physician. Task scope mirrors real e-consult workflows: retrieve relevant EHR data, reason about it, place appropriate orders, and document the plan.
FHIR-compliant EHR environment
Agents interact with an isolated HAPI FHIR JPA server loaded with a realistic synthetic patient record. Every task runs in a fresh Docker container — no state leaks between evaluations.
Checkpoint-level grading
Each task has 5–9 checkpoints evaluated by one of three graders: (a) deterministic FHIR validation for orders, (b) LLM-judge against a clinician-written rubric for reasoning, (c) trajectory analysis for data retrieval.
Multi-run reliability
Every model is evaluated 3 times. We report Pass@1 (mean ± SD), Pass@3 (probability of success in 3 attempts), and Pass³ (consistency across all 3 runs).
“End-to-end completion, not isolated atomic skills. A task only passes when every checkpoint — from data retrieval to final documentation — passes.”
Citation
@article{physicianbench2026,
title = {PhysicianBench: Evaluating LLM Agents on Physician Tasks in Real-World EHR Environments},
author = {Ruoqi Liu and Imran Q. Mohiuddin and Austin J. Schoeffler and Kavita Renduchintala and Ashwin Nayak and Prasantha L. Vemu and Shivam C. Vedak and Kameron C. Black and John L. Havlik and Isaac Ogunmola and Stephen P. Ma and Roopa Dhatt and Jonathan H. Chen},
year = {2026},
eprint = {2605.02240},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2605.02240}
}