PhysicianBench:
Evaluating LLM Agents in
Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, Jonathan H. Chen

Stanford University

View Leaderboard →Explore Tasks Read the Paper →

Results

Leaderboard

Pass@1 and reliability metrics averaged over 3 independent runs.

sort by

#	Model	Provider	Pass@1 ±SD	Pass@3	Pass³	#Turns
1	GPT-5.5	OpenAI	46.3±1.2	57.4	28.0	41.9
2	Claude Opus 4.6	Anthropic	31.7±2.3	41.5	18.0	25.2
3	Claude Opus 4.7	Anthropic	29.3±2.5	37.9	18.0	16.2
4	GPT-5.4	OpenAI	27.7±1.5	37.7	13.0	39.8
5	Claude Sonnet 4.6	Anthropic	23.0±2.6	33.2	9.0	22.3
6	DeepSeek V4-Proopen	DeepSeek	18.7±2.9	27.9	6.0	35.3
7	Kimi-K2.6open	Moonshot	17.0±2.6	26.3	5.0	42.4
8	MiMo-v2.5-Pro	Xiaomi	16.7±4.0	23.6	6.0	29.5
9	Qwen3.6-Plus	Alibaba	13.7±4.0	22.6	2.0	28.0
10	MiniMax M2.7	MiniMax	8.7±1.2	15.9	1.0	29.7
11	Gemini Pro 3.1	Google	6.0±1.0	9.3	3.0	30.4
12	Grok-4.20	xAI	5.3±3.2	9.7	1.0	16.7

Pass@1 fraction of tasks fully completed in a single attempt. Pass@3 probability that at least 1 of 3 runs succeeds. Pass³ fraction of tasks where all 3 runs succeed (consistency). #Turns mean tool calls per task.

Where agents succeed and fail

Performance by Subgroup

Pass@1 averaged over 3 runs. Darker teal cells indicate higher success rate within that subgroup.

Model	Cardiology n=6	Endocrinology n=13	GI & Hepatology n=14	Immunol & ID n=12	Psych / Neuro n=16	Hem / Onc n=13	Neph / Urol n=8	Pulm & Other n=18	Overall
GPT-5.5	55.6	59.0	57.1	38.9	33.3	48.7	29.2	48.1	46.3
Claude Opus 4.6	27.8	35.9	35.7	38.9	27.1	30.8	33.3	25.9	31.7
Claude Opus 4.7	38.9	28.2	28.6	22.2	18.8	30.8	33.3	38.9	29.3
GPT-5.4	27.8	30.8	21.4	27.8	22.9	38.5	20.8	29.6	27.7
Claude Sonnet 4.6	33.3	10.3	26.2	27.8	25.0	25.6	33.3	14.8	23.0
DeepSeek V4-Pro	16.7	7.7	33.3	19.4	14.6	15.4	12.5	24.1	18.7
Kimi-K2.6	27.8	12.8	21.4	16.7	14.6	17.9	12.5	16.7	17.0
MiMo-v2.5-Pro	11.1	10.3	23.8	16.7	27.1	17.9	4.2	13.0	16.7
Qwen3.6-Plus	5.6	12.8	9.5	16.7	20.8	10.3	12.5	14.8	13.7
MiniMax M2.7	0.0	5.1	11.9	11.1	8.3	7.7	4.2	13.0	8.7
Gemini Pro 3.1	5.6	10.3	7.1	0.0	8.3	5.1	0.0	7.4	6.0
Grok-4.20	5.6	5.1	9.5	2.8	10.4	2.6	4.2	1.9	5.3

Error analysis

Where Do Failures Come From?

Each failed checkpoint is classified by its evaluation type. Across all models, clinical reasoning accounts for ~52% of failures — the core bottleneck.

Data Retrieval

Agent fails to query or surface required EHR data

Clinical Reasoning

Incorrect clinical interpretation, diagnosis, or decision

Action Execution

Correct decision but wrong or missing FHIR order

Documentation

Assessment note missing required clinical elements

% of failed checkpoints per model

GPT-5.5

Claude Opus 4.6

Claude Opus 4.7

GPT-5.4

Claude Sonnet 4.6

DeepSeek V4-Pro

Kimi-K2.6

MiMo-v2.5-Pro

Qwen3.6-Plus

MiniMax M2.7

Gemini Pro 3.1

Grok-4.20

Clinical reasoning dominates across all models (48–59%). Stronger models show a lower share of reasoning failures because they resolve more reasoning checkpoints, shifting failures toward other categories like action execution and documentation.

Task pool

Explore Sample Tasks

Each task is a clinician-validated composite workflow with FHIR-grounded evaluation checkpoints. Two examples shown; the full benchmark has 100 tasks.

Task instruction

Adrenal Insufficiency Management with Symptom Evaluation

An endocrinology patient with known adrenal insufficiency has sent a portal message reporting worsening fatigue, blood pressure instability, elevated heart rate, and decreased appetite, asking whether their current hydrocortisone dosing is adequate. Retrieve demographics, the etiology of adrenal insufficiency (primary vs. secondary), current hydrocortisone regimen (AM/PM doses), recent morning cortisol/renin/aldosterone, and recent BP/HR trends. Decide whether the current replacement is adequate, propose a specific dose adjustment if warranted, address blood pressure with appropriate specialty referral, and document a contingency plan and follow-up timeline. Save the plan to workspace/output/management_plan.txt.

Evaluation checkpoints

CP1Data retrievalData Retrievalhybrid

Retrieve demographics, AI etiology, current hydrocortisone regimen, recent cortisol/renin/aldosterone labs, and BP/HR trends.

CP2Replacement adequacy assessmentClinical Reasoningllm-judge

Recognize 15 mg/day is at the low end of physiologic replacement (typical 15–25 mg/day) and link symptoms to under-replacement; correctly identify primary vs. secondary AI.

CP3Hydrocortisone dose adjustmentClinical Reasoningllm-judge

Propose a specific revised AM/PM regimen with patient-specific rationale.

CP4Cardiology referralAction Executioncode

Create a ServiceRequest for cardiology referral to address blood pressure instability.

CP5Contingency planClinical Reasoningllm-judge

Document an explicit contingency if symptoms do not improve on the adjusted regimen.

CP6DocumentationDocumentationllm-judge

Plan covers all required elements (etiology, current regimen, assessment, dose adjustment, BP plan, contingency, follow-up) and avoids unsafe recommendations such as fludrocortisone in secondary AI.

Want to see how an agent solves this task? Watch a full trajectory replay:

Watch trajectory →

Watch agents work

Trajectory Viewer

Step through a real agent session. Each tool call shows the FHIR query and response. Green cards are tool results you can click for full detail.

Adrenal Insufficiency Management

adrenal_insufficiency_symptoms · Claude Opus 4.6 · 2 / 6 checkpoints passed

Open in new tab ↗

How it works

Methodology

Every design choice maximizes clinical realism while keeping evaluation reproducible and hermetic.

Clinician-validated tasks

Each task was written or reviewed by a practicing physician. Task scope mirrors real e-consult workflows: retrieve relevant EHR data, reason about it, place appropriate orders, and document the plan.

FHIR-compliant EHR environment

Agents interact with an isolated HAPI FHIR JPA server loaded with a realistic synthetic patient record. Every task runs in a fresh Docker container — no state leaks between evaluations.

Checkpoint-level grading

Each task has 5–9 checkpoints evaluated by one of three graders: (a) deterministic FHIR validation for orders, (b) LLM-judge against a clinician-written rubric for reasoning, (c) trajectory analysis for data retrieval.

Multi-run reliability

Every model is evaluated 3 times. We report Pass@1 (mean ± SD), Pass@3 (probability of success in 3 attempts), and Pass³ (consistency across all 3 runs).

Key design principle

“End-to-end completion, not isolated atomic skills. A task only passes when every checkpoint — from data retrieval to final documentation — passes.”

Cite

Citation

@article{physicianbench2026,
  title         = {PhysicianBench: Evaluating LLM Agents on Physician Tasks in Real-World EHR Environments},
  author        = {Ruoqi Liu and Imran Q. Mohiuddin and Austin J. Schoeffler and Kavita Renduchintala and Ashwin Nayak and Prasantha L. Vemu and Shivam C. Vedak and Kameron C. Black and John L. Havlik and Isaac Ogunmola and Stephen P. Ma and Roopa Dhatt and Jonathan H. Chen},
  year          = {2026},
  eprint        = {2605.02240},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2605.02240}
}

PhysicianBench:Evaluating LLM Agents inReal-World EHR Environments