Table 1 Error patterns in LLM-generated hematologic pharmacotherapy simulations
From: Evaluating large language models for pharmacotherapy simulations: a mixed-methods study
Assessment domain | Subdomain | AML (Nā=ā50) | CML (Nā=ā53) | Error patterns and example session IDs |
|---|---|---|---|---|
Clinical accuracy and safety | Overall domain failure | 25 (50.0%) | 18 (34.0%) | Lowest pass rate (60/103, 58.3%); required all 4 subdomains ā„4.0 with no exceptions permitted |
| Ā | Guideline alignment | 22 (44.0%) | 13 (24.5%) | AML: FLT3-ITD+/NPM1- risk classification inconsistencies; TLS high-risk criteria misapplied (WBC >25k vs >100k); IDSA/ESMO guideline deviations (Sessions 2, 4, 5, 6, 7, 11, 18, 19, 20, 23, 24, 25, 26, 38, 39, 47). CML: BCR-ABL1 response thresholds misapplied (8.5% at 3 months as āwarningā vs optimal); ELTS score calculation errors; TFR timing violations; mutation analysis omissions (Sessions 56, 58, 79, 84, 88, 89, 97, 98, 99, 100, 102) |
| Ā | Pharmacotherapeutic accuracy | 24 (48.0%) | 17 (32.1%) | AML: Concurrent allopurinol with rasburicase; blinatumomab (B-ALL therapy) for AML; venetoclax maintenance as first-line for favorable-risk AML (HiDAC indicated); rasburicase for low-risk TLS; inappropriate consolidation for refractory disease; pegfilgrastim during induction (Sessions 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 20, 21, 22, 30, 40). CML: Premature TFR after 1 year MMR (requires ā„2 years MR4.5); dasatinib switch for nilotinib-induced pleural effusion; suboptimal imatinib for intermediate-risk patients (Sessions 58, 79, 84, 88, 89, 97, 98, 99, 100, 102) |
| Ā | Domain specificity | 4 (8.0%) | 0 (0%) | AML: Session 1 (blinatumomab - CD19-targeted B-ALL therapy for NPM1-mutated AML); Session 2 (sorafenib for KIT-mutated AML based on FLT3-ITD evidence); Session 21 (differentiation syndrome for standard 7ā+ā3 chemotherapy); Session 35 (CML content in AML session). CML: No domain entanglement |
| Ā | Clinical plausibility | 3 (6.0%) | 0 (0%) | AML: Session 1 (hydroxyurea ānot indicatedā then āWBC decreased after hydroxyureaā); Session 12 (missing blast percentage); Session 22 (severe thrombocytopenia with ānormal coagulationā). CML: No plausibility failures |
| Ā | Fabricated evidence | 7 (14.0%) | 2 (3.8%) | AML: Sessions 3, 4, 5, 8, 9, 10, 13 (āMORPHO trial NEJM 2023ā with false gilteritinib outcomes - actual trial negative; āRELMAZA trialā with invented azacitidine data). CML: Sessions 89, 100, possibly 79 (fabricated ELTS formulas with non-existent coefficients; mathematically impossible calculations) |
Clinical reasoning fidelity | Overall domain failurea | 13 (26.0%) | 6 (11.3%) | Required all 4 subdomains ā„4.0; pass rate 84/103 (81.6%); multiple concurrent reasoning errors per session common |
| Ā | Response classification | 8 (16.0%) | 4 (7.5%) | AML: 31.5% blasts as āpartial responseā vs refractory disease; CR vs CRi misapplication (Sessions 14, 40, 41, 44). CML: BCR-ABL1 15% as āoptimalā vs failure; 8.5% as āwarningā vs optimal (Sessions 58, 78, 79, 98) |
| Ā | Treatment sequencing | 11 (22.0%) | 4 (7.5%) | AML: Consolidation for refractory disease; Day 14 BM 5% blasts triggering second-line; venetoclax first-line vs HiDAC (Sessions 9, 11, 12, 16, 21, 28, 35, 38, 40, 42, 49). CML: TKI switching without mutation analysis; escalation at āwarningā; premature TFR (Sessions 88, 99, 102, plus 1 additional) |
| Ā | Answer-question mismatch | 4 (8.0%) | 1 (1.9%) | AML: TLS question receiving antifungal rationale; febrile neutropenia answered with consolidation (Sessions 34, 35, 39, 41). CML: Question-answer disconnection (Session 92) |
| Ā | Question validity | 1 (2.0%) | 2 (3.8%) | AML: No correct answer among options (Session 51). CML: Sokal question with no valid components; unclear ānone of optionsā (Sessions 56, 73) |
Instructional design quality | Overall domain failureb | 9 (18.0%) | 9 (17.0%) | Highest pass rate (85/103, 82.5%); flexible criteria (subdomains >3.0, mean >4.0) permitted compensation |
| Ā | Learning objectives | 50 (100%) | 50 (94.3%) | All AML; all CML except Sessions 78, 82, 91 lacked explicit objectives; goals implied through structure; noted as āminor weaknessā in passing sessions |
| Ā | Embedded feedbackb | 3 (6.0%) | 8 (15.1%) | AML: Only āCorrect!ā without rationale (Sessions 14, 19, 34). CML: No explanatory feedback (Sessions 56, 73, 85, 94, 100) |
| Ā | Scaffolding quality | 2 (4.0%) | 0 (0%) | Strongest subdomain (101/103, 98.1% pass). Examples: Sessions 24, 35 (format collapse; lost progression) |
| Ā | Instructional framing | 3 (6.0%) | 4 (7.5%) | Second-strongest (100/103, 97.1% pass). AML: Sessions 4, 11, 42. CML: Sessions 69, 70, 76, 80 lacking proper introduction |
| Ā | Absence of final scoring | 48 (96.0%) | 50 (94.3%) | Only 5 sessions provided summaries (7, 52, 66, 80, 84); 98 sessions (95.1%) terminated without learning closure |
| Ā | Case demographic uniformity | 50 (100%) | 3 (5.7%) | AML: Ages 42ā47 years; identical symptoms (early satiety, weight loss). CML: Sessions 57, 61, 63 with duplicates/multiple cases |
| Ā | Disease stage uniformity | NA | 53 (100%) | All CML chronic phase only; zero accelerated phase or blast crisis presentations limiting disease spectrum exposure |
| Ā | Answer revelation | 0 (0%) | 2 (3.8%) | Sessions 71, 96: Complete answers on hint request; listing answers before student attempts |
| Ā | Language switching | 0 (0%) | 1 (1.9%) | Session 103: English to Arabic mid-session without justification |