Table 1 Error patterns in LLM-generated hematologic pharmacotherapy simulations

From: Evaluating large language models for pharmacotherapy simulations: a mixed-methods study

Assessment domain

Subdomain

AML (N = 50)

CML (N = 53)

Error patterns and example session IDs

Clinical accuracy and safety

Overall domain failure

25 (50.0%)

18 (34.0%)

Lowest pass rate (60/103, 58.3%); required all 4 subdomains ≄4.0 with no exceptions permitted

Ā 

Guideline alignment

22 (44.0%)

13 (24.5%)

AML: FLT3-ITD+/NPM1- risk classification inconsistencies; TLS high-risk criteria misapplied (WBC >25k vs >100k); IDSA/ESMO guideline deviations (Sessions 2, 4, 5, 6, 7, 11, 18, 19, 20, 23, 24, 25, 26, 38, 39, 47). CML: BCR-ABL1 response thresholds misapplied (8.5% at 3 months as ā€œwarningā€ vs optimal); ELTS score calculation errors; TFR timing violations; mutation analysis omissions (Sessions 56, 58, 79, 84, 88, 89, 97, 98, 99, 100, 102)

Ā 

Pharmacotherapeutic accuracy

24 (48.0%)

17 (32.1%)

AML: Concurrent allopurinol with rasburicase; blinatumomab (B-ALL therapy) for AML; venetoclax maintenance as first-line for favorable-risk AML (HiDAC indicated); rasburicase for low-risk TLS; inappropriate consolidation for refractory disease; pegfilgrastim during induction (Sessions 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 20, 21, 22, 30, 40). CML: Premature TFR after 1 year MMR (requires ≄2 years MR4.5); dasatinib switch for nilotinib-induced pleural effusion; suboptimal imatinib for intermediate-risk patients (Sessions 58, 79, 84, 88, 89, 97, 98, 99, 100, 102)

Ā 

Domain specificity

4 (8.0%)

0 (0%)

AML: Session 1 (blinatumomab - CD19-targeted B-ALL therapy for NPM1-mutated AML); Session 2 (sorafenib for KIT-mutated AML based on FLT3-ITD evidence); Session 21 (differentiation syndrome for standard 7 + 3 chemotherapy); Session 35 (CML content in AML session). CML: No domain entanglement

Ā 

Clinical plausibility

3 (6.0%)

0 (0%)

AML: Session 1 (hydroxyurea ā€œnot indicatedā€ then ā€œWBC decreased after hydroxyureaā€); Session 12 (missing blast percentage); Session 22 (severe thrombocytopenia with ā€œnormal coagulationā€). CML: No plausibility failures

Ā 

Fabricated evidence

7 (14.0%)

2 (3.8%)

AML: Sessions 3, 4, 5, 8, 9, 10, 13 (ā€œMORPHO trial NEJM 2023ā€ with false gilteritinib outcomes - actual trial negative; ā€œRELMAZA trialā€ with invented azacitidine data). CML: Sessions 89, 100, possibly 79 (fabricated ELTS formulas with non-existent coefficients; mathematically impossible calculations)

Clinical reasoning fidelity

Overall domain failurea

13 (26.0%)

6 (11.3%)

Required all 4 subdomains ≄4.0; pass rate 84/103 (81.6%); multiple concurrent reasoning errors per session common

Ā 

Response classification

8 (16.0%)

4 (7.5%)

AML: 31.5% blasts as ā€œpartial responseā€ vs refractory disease; CR vs CRi misapplication (Sessions 14, 40, 41, 44). CML: BCR-ABL1 15% as ā€œoptimalā€ vs failure; 8.5% as ā€œwarningā€ vs optimal (Sessions 58, 78, 79, 98)

Ā 

Treatment sequencing

11 (22.0%)

4 (7.5%)

AML: Consolidation for refractory disease; Day 14 BM 5% blasts triggering second-line; venetoclax first-line vs HiDAC (Sessions 9, 11, 12, 16, 21, 28, 35, 38, 40, 42, 49). CML: TKI switching without mutation analysis; escalation at ā€œwarningā€; premature TFR (Sessions 88, 99, 102, plus 1 additional)

Ā 

Answer-question mismatch

4 (8.0%)

1 (1.9%)

AML: TLS question receiving antifungal rationale; febrile neutropenia answered with consolidation (Sessions 34, 35, 39, 41). CML: Question-answer disconnection (Session 92)

Ā 

Question validity

1 (2.0%)

2 (3.8%)

AML: No correct answer among options (Session 51). CML: Sokal question with no valid components; unclear ā€œnone of optionsā€ (Sessions 56, 73)

Instructional design quality

Overall domain failureb

9 (18.0%)

9 (17.0%)

Highest pass rate (85/103, 82.5%); flexible criteria (subdomains >3.0, mean >4.0) permitted compensation

Ā 

Learning objectives

50 (100%)

50 (94.3%)

All AML; all CML except Sessions 78, 82, 91 lacked explicit objectives; goals implied through structure; noted as ā€œminor weaknessā€ in passing sessions

Ā 

Embedded feedbackb

3 (6.0%)

8 (15.1%)

AML: Only ā€œCorrect!ā€ without rationale (Sessions 14, 19, 34). CML: No explanatory feedback (Sessions 56, 73, 85, 94, 100)

Ā 

Scaffolding quality

2 (4.0%)

0 (0%)

Strongest subdomain (101/103, 98.1% pass). Examples: Sessions 24, 35 (format collapse; lost progression)

Ā 

Instructional framing

3 (6.0%)

4 (7.5%)

Second-strongest (100/103, 97.1% pass). AML: Sessions 4, 11, 42. CML: Sessions 69, 70, 76, 80 lacking proper introduction

Ā 

Absence of final scoring

48 (96.0%)

50 (94.3%)

Only 5 sessions provided summaries (7, 52, 66, 80, 84); 98 sessions (95.1%) terminated without learning closure

Ā 

Case demographic uniformity

50 (100%)

3 (5.7%)

AML: Ages 42–47 years; identical symptoms (early satiety, weight loss). CML: Sessions 57, 61, 63 with duplicates/multiple cases

Ā 

Disease stage uniformity

NA

53 (100%)

All CML chronic phase only; zero accelerated phase or blast crisis presentations limiting disease spectrum exposure

Ā 

Answer revelation

0 (0%)

2 (3.8%)

Sessions 71, 96: Complete answers on hint request; listing answers before student attempts

Ā 

Language switching

0 (0%)

1 (1.9%)

Session 103: English to Arabic mid-session without justification

  1. AML acute myeloid leukemia, CML chronic myeloid leukemia, CR complete remission, CRi incomplete recovery, ELN European LeukemiaNet, ELTS EUTOS long-term survival, HiDAC high-dose cytarabine, MR4.5 4.5-log reduction, MMR major molecular response, MRD measurable residual disease, N/A not applicable, TFR treatment-free remission, TKI tyrosine kinase inhibitor, TLS tumor lysis syndrome, WBC white blood cell count.
  2. aIndividual sessions may exhibit multiple reasoning errors across categories.
  3. bInstructional Design error frequencies reflect observed quality limitations; domain pass/fail was determined by combined performance across all subdomains (Methods - Expert evaluation procedures).