Table 1 Error patterns in LLM-generated hematologic pharmacotherapy simulations

From: Evaluating large language models for pharmacotherapy simulations: a mixed-methods study

Assessment domain	Subdomain	AML (N = 50)	CML (N = 53)	Error patterns and example session IDs
Clinical accuracy and safety	Overall domain failure	25 (50.0%)	18 (34.0%)	Lowest pass rate (60/103, 58.3%); required all 4 subdomains ≥4.0 with no exceptions permitted
	Guideline alignment	22 (44.0%)	13 (24.5%)	AML: FLT3-ITD+/NPM1- risk classification inconsistencies; TLS high-risk criteria misapplied (WBC >25k vs >100k); IDSA/ESMO guideline deviations (Sessions 2, 4, 5, 6, 7, 11, 18, 19, 20, 23, 24, 25, 26, 38, 39, 47). CML: BCR-ABL1 response thresholds misapplied (8.5% at 3 months as “warning” vs optimal); ELTS score calculation errors; TFR timing violations; mutation analysis omissions (Sessions 56, 58, 79, 84, 88, 89, 97, 98, 99, 100, 102)
	Pharmacotherapeutic accuracy	24 (48.0%)	17 (32.1%)	AML: Concurrent allopurinol with rasburicase; blinatumomab (B-ALL therapy) for AML; venetoclax maintenance as first-line for favorable-risk AML (HiDAC indicated); rasburicase for low-risk TLS; inappropriate consolidation for refractory disease; pegfilgrastim during induction (Sessions 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 20, 21, 22, 30, 40). CML: Premature TFR after 1 year MMR (requires ≥2 years MR4.5); dasatinib switch for nilotinib-induced pleural effusion; suboptimal imatinib for intermediate-risk patients (Sessions 58, 79, 84, 88, 89, 97, 98, 99, 100, 102)
	Domain specificity	4 (8.0%)	0 (0%)	AML: Session 1 (blinatumomab - CD19-targeted B-ALL therapy for NPM1-mutated AML); Session 2 (sorafenib for KIT-mutated AML based on FLT3-ITD evidence); Session 21 (differentiation syndrome for standard 7 + 3 chemotherapy); Session 35 (CML content in AML session). CML: No domain entanglement
	Clinical plausibility	3 (6.0%)	0 (0%)	AML: Session 1 (hydroxyurea “not indicated” then “WBC decreased after hydroxyurea”); Session 12 (missing blast percentage); Session 22 (severe thrombocytopenia with “normal coagulation”). CML: No plausibility failures
	Fabricated evidence	7 (14.0%)	2 (3.8%)	AML: Sessions 3, 4, 5, 8, 9, 10, 13 (“MORPHO trial NEJM 2023” with false gilteritinib outcomes - actual trial negative; “RELMAZA trial” with invented azacitidine data). CML: Sessions 89, 100, possibly 79 (fabricated ELTS formulas with non-existent coefficients; mathematically impossible calculations)
Clinical reasoning fidelity	Overall domain failure^a	13 (26.0%)	6 (11.3%)	Required all 4 subdomains ≥4.0; pass rate 84/103 (81.6%); multiple concurrent reasoning errors per session common
	Response classification	8 (16.0%)	4 (7.5%)	AML: 31.5% blasts as “partial response” vs refractory disease; CR vs CRi misapplication (Sessions 14, 40, 41, 44). CML: BCR-ABL1 15% as “optimal” vs failure; 8.5% as “warning” vs optimal (Sessions 58, 78, 79, 98)
	Treatment sequencing	11 (22.0%)	4 (7.5%)	AML: Consolidation for refractory disease; Day 14 BM 5% blasts triggering second-line; venetoclax first-line vs HiDAC (Sessions 9, 11, 12, 16, 21, 28, 35, 38, 40, 42, 49). CML: TKI switching without mutation analysis; escalation at “warning”; premature TFR (Sessions 88, 99, 102, plus 1 additional)
	Answer-question mismatch	4 (8.0%)	1 (1.9%)	AML: TLS question receiving antifungal rationale; febrile neutropenia answered with consolidation (Sessions 34, 35, 39, 41). CML: Question-answer disconnection (Session 92)
	Question validity	1 (2.0%)	2 (3.8%)	AML: No correct answer among options (Session 51). CML: Sokal question with no valid components; unclear “none of options” (Sessions 56, 73)
Instructional design quality	Overall domain failure^b	9 (18.0%)	9 (17.0%)	Highest pass rate (85/103, 82.5%); flexible criteria (subdomains >3.0, mean >4.0) permitted compensation
	Learning objectives	50 (100%)	50 (94.3%)	All AML; all CML except Sessions 78, 82, 91 lacked explicit objectives; goals implied through structure; noted as “minor weakness” in passing sessions
	Embedded feedback^b	3 (6.0%)	8 (15.1%)	AML: Only “Correct!” without rationale (Sessions 14, 19, 34). CML: No explanatory feedback (Sessions 56, 73, 85, 94, 100)
	Scaffolding quality	2 (4.0%)	0 (0%)	Strongest subdomain (101/103, 98.1% pass). Examples: Sessions 24, 35 (format collapse; lost progression)
	Instructional framing	3 (6.0%)	4 (7.5%)	Second-strongest (100/103, 97.1% pass). AML: Sessions 4, 11, 42. CML: Sessions 69, 70, 76, 80 lacking proper introduction
	Absence of final scoring	48 (96.0%)	50 (94.3%)	Only 5 sessions provided summaries (7, 52, 66, 80, 84); 98 sessions (95.1%) terminated without learning closure
	Case demographic uniformity	50 (100%)	3 (5.7%)	AML: Ages 42–47 years; identical symptoms (early satiety, weight loss). CML: Sessions 57, 61, 63 with duplicates/multiple cases
	Disease stage uniformity	NA	53 (100%)	All CML chronic phase only; zero accelerated phase or blast crisis presentations limiting disease spectrum exposure
	Answer revelation	0 (0%)	2 (3.8%)	Sessions 71, 96: Complete answers on hint request; listing answers before student attempts
	Language switching	0 (0%)	1 (1.9%)	Session 103: English to Arabic mid-session without justification

AML acute myeloid leukemia, CML chronic myeloid leukemia, CR complete remission, CRi incomplete recovery, ELN European LeukemiaNet, ELTS EUTOS long-term survival, HiDAC high-dose cytarabine, MR4.5 4.5-log reduction, MMR major molecular response, MRD measurable residual disease, N/A not applicable, TFR treatment-free remission, TKI tyrosine kinase inhibitor, TLS tumor lysis syndrome, WBC white blood cell count.
^aIndividual sessions may exhibit multiple reasoning errors across categories.
^bInstructional Design error frequencies reflect observed quality limitations; domain pass/fail was determined by combined performance across all subdomains (Methods - Expert evaluation procedures).

Back to article page

Table 1 Error patterns in LLM-generated hematologic pharmacotherapy simulations

Search

Quick links