Evaluating large language models for pharmacotherapy simulations: a mixed-methods study

Farrag, Ahmed N.; El-Zeiny, Amany; Ali, Amani M.

doi:10.1038/s41746-026-02626-1

Download PDF

Article
Open access
Published: 05 May 2026

Evaluating large language models for pharmacotherapy simulations: a mixed-methods study

npj Digital Medicine volume 9, Article number: 355 (2026) Cite this article

Subjects

Abstract

Simulation-based learning is essential in clinical pharmacy education but requires substantial faculty resources that limit scalability. Large language models (LLMs) offer promise for generating scalable simulations, yet their pedagogical rigor and clinical reliability remain unclear. In a mixed-methods, counterbalanced evaluation study, PharmD students (n = 104) engaged with acute myeloid leukemia (AML) or chronic myeloid leukemia (CML) cases, conditions requiring complex longitudinal management yet sharing semantic similarity, generated by four LLMs using expert-guided meta-prompts. Expert panels evaluated sessions across clinical authenticity, instructional design, and clinical reasoning; students completed satisfaction surveys. Of 103 sessions, 53 (51.5%) met passing criteria across all domains. Clinical accuracy and safety emerged as the limiting domain (58.3%) compared to clinical reasoning (81.6%) and instructional design (82.5%). CML sessions outperformed AML sessions (62.3% vs 40.0%; p = 0.031). Platform success rates ranged from 34.5% to 62.1%. Error analysis revealed guideline misalignment, pharmacotherapeutic inaccuracies, fabricated evidence, and cross-condition therapeutic recommendations occurring exclusively in AML sessions. Students favored LLMs over traditional methods (49.8% vs 30.0%); however, we did not detect statistically significant alignment between student satisfaction and expert-assessed quality. Sessions more frequently met criteria for instructional design and clinical reasoning than for pharmacotherapeutic accuracy and guideline alignment. Expert oversight with platform-specific and disease-specific validation remains essential for safe educational deployment, and effectiveness trials assessing objective learning outcomes represent necessary subsequent work.

Comparison of scoring systems evaluating suitability for intensive chemotherapy in adults with acute myeloid leukemia—a Grand Ouest Against Leukemia (GOAL) study

Article 12 August 2022

Treatment expectations and goals among patients with chronic myeloid leukemia in Germany: a patient-centered perspective

Article Open access 19 December 2025

2025 European LeukemiaNet recommendations for the management of chronic myeloid leukemia

Article Open access 11 July 2025

Introduction

Simulation-based learning plays an important role in clinical pharmacy education because it allows students to practice clinical reasoning and therapeutic decision-making in safe, controlled settings^1,2,3. Well-designed simulations provide structured scenarios and timely feedback that support cognitive development and progressive skill acquisition^4,5,6. However, traditional simulation methods are difficult to scale, requiring substantial faculty time, specialized expertise, and institutional resources that limit broad curricular integration^7,8. Large language models (LLMs) offer a promising solution by generating sophisticated, interactive clinical simulations at scale^9,10,11.

Despite their potential, LLMs’ clinical accuracy and educational validity remain uncertain in specialized therapeutic areas^11,12. Most existing evaluations focus on short, discrete knowledge questions rather than extended clinical scenarios that require sustained reasoning, longitudinal decision-making, and contextual adaptation^13,14,15. This gap is critical because LLMs rely on statistical learning processes that can introduce systematic errors, which may go undetected without rigorous domain-specific testing¹⁶.

One notable error pattern arises from how LLMs process semantically related information. Because these models learn statistical associations from training data, conditions sharing clinical features or terminology may be inappropriately conflated in generated content^17,18. Recent work has shown that word embeddings can misattribute symptoms between distinct diseases when conditions share semantic similarities, with errors stemming from tangential associations rather than direct clinical relevance^19,20. In medical contexts, such conflation can inadvertently merge management strategies for conditions requiring fundamentally different therapeutic approaches, a pattern we refer to as domain entanglement¹⁹. These errors are particularly concerning in educational settings, as they often appear coherent and authoritative, potentially reinforcing inaccurate or unsafe knowledge among learners who lack sufficient clinical experience to recognize the inaccuracies^21,22.

To systematically assess these vulnerabilities, rigorous evaluation requires a strategic selection of test domains that combine authentic educational applications with conditions likely to expose systematic errors^23,24. Hematologic malignancies offer this combination, involving complex, evidence-based treatment algorithms, frequent protocol updates, and challenging clinical decisions that students must master before practice^25,26,27,28. Within this domain, acute and chronic myeloid leukemias provide a strategically designed stress test. These conditions share myeloid cell lineage and present with overlapping clinical and laboratory features, creating semantic similarity that may challenge LLMs’ ability to maintain appropriate therapeutic boundaries. However, their management approaches differ fundamentally: AML requires time-sensitive intensive chemotherapy with consolidation decisions guided by molecular features and remission status²⁹, while CML requires chronic oral tyrosine kinase inhibitor therapy with ongoing molecular monitoring and specific criteria for adjusting treatment³⁰.

This pairing enables direct evaluation of whether semantic overlap causes domain entanglement. If models inappropriately recommend CML-specific tyrosine kinase inhibitors for AML patients or apply AML induction regimens to CML cases, it would demonstrate a safety-critical failure mode. Conversely, successful boundary preservation despite semantic similarity would suggest that structured prompting can mitigate this vulnerability. Additionally, the complexity gradient between these conditions (CML following relatively linear therapeutic pathways while AML requires multi-variable conditional reasoning) enables assessment of whether therapeutic complexity independently affects LLM performance, with implications for predicting performance in other therapeutic areas.

In this study, we evaluated how well LLMs generate pharmacotherapy simulations requiring accurate reasoning, safe therapeutic recommendations, and sound instructional design. Our primary aims were to (1) characterize LLM performance across instructional design quality, clinical accuracy and safety, and clinical reasoning fidelity, and (2) compare performance between AML and CML to test whether semantic similarity challenges boundary preservation while assessing whether therapeutic complexity independently affects accuracy. Our secondary aims were to (1) compare performance across four major platforms to distinguish general model capabilities from platform-specific characteristics, and (2) examine whether student satisfaction aligns with expert-rated quality to inform oversight requirements for safe educational deployment.

Results

Session characteristics and inter-rater reliability

A total of 103 sessions were evaluated (one student did not complete the study), comprising 50 AML and 53 CML simulations distributed across four platforms: Gemini (n = 29, 28.2%), GPT-4o (n = 29, 28.2%), DeepSeek (n = 23, 22.3%), and Claude (n = 22, 21.4%). Inter-rater reliability was excellent, with an overall Krippendorff’s alpha of 0.83 (95% CI: 0.724–0.875), exceeding the prespecified threshold of 0.80. Pairwise agreement varied across rater pairs (Supplementary D), with near-perfect concordance between two reviewers (κ = 0.955) and moderate-to-substantial concordance for pairs involving the educator reviewer (κ = 0.633–0.656)³¹. Full scoring outputs are provided in Supplementary D, and complete session transcripts are available in Supplementary E.

Overall session success rate

Of 103 sessions evaluated, 53 (51.5%; 95% CI: 41.7–61.2%) met passing criteria across all three domains simultaneously (Fig. 1). Domain-specific success rates were 60/103 (58.3%) for clinical accuracy and safety, 84/103 (81.6%) for clinical reasoning fidelity, and 85/103 (82.5%) for instructional design quality, with clinical accuracy and safety emerging as the limiting domain.

**Fig. 1: Domain-level success rates by platform and disease type.**

Performance by disease type

CML sessions demonstrated higher overall success rates than AML sessions: 33/53 (62.3%) versus 20/50 (40.0%; OR (CML vs AML) = 2.50, 95% CI: 1.12–5.56; RR (CML/AML) = 1.56, 95% CI: 1.04–2.32; Cohen’s h = 0.449; p = 0.031) (Fig. 1). Domain-specific comparisons showed consistent trends favoring CML: clinical accuracy and safety CML 35/53 (66.0%) versus AML 25/50 (50.0%); clinical reasoning fidelity CML 47/53 (88.7%) versus AML 37/50 (74.0%); instructional design quality remained nearly equivalent at CML 44/53 (83.0%) versus AML 41/50 (82.0%).

At the subdomain level, the largest performance gaps occurred within clinical accuracy and safety (Fig. 2), guideline alignment: CML 40/53 (75.5%) versus AML 28/50 (56.0%), pharmacotherapeutic accuracy: CML 36/53 (67.9%) versus AML 26/50 (52.0%), domain specificity: CML 53/53 (100.0%) versus AML 46/50 (92.0%), with domain entanglement occurring exclusively in AML sessions (4/50, 8.0% vs 0/53, 0%). This relatively low event frequency limits robust mechanistic interpretation. Clinical reasoning fidelity and instructional design quality subdomains showed minimal disease-type variation.

**Fig. 2: LLM performance profiles across domains and subdomains.**

Performance by platform

Platform-level overall success rates ranged from 34.5% to 62.1%: Gemini 2.0 Pro 18/29 (62.1%, 95% CI: 44.0–77.3%), Claude 3.7 Sonnet 13/22 (59.1%, 95% CI: 38.7–76.7%), DeepSeek V2 12/23 (52.2%, 95% CI: 33.0–70.8%), and GPT-4o 10/29 (34.5%, 95% CI: 19.9–52.7%) (Fig. 1). Chi-square analysis revealed no significant platform differences (p = 0.160, Cramér’s V = 0.224). Post hoc power analysis confirmed inadequate power for all platform comparisons (5.5–39.7%).

Domain-specific analysis revealed platform strengths and weaknesses (Fig. 1). For clinical accuracy and safety, performance spanned 39.4 percentage points: Claude 17/22 (77.3%), Gemini 20/29 (69.0%), DeepSeek 12/23 (52.2%), and GPT-4o 11/29 (37.9%). For instructional design quality: DeepSeek 22/23 (95.7%), Gemini 27/29 (93.1%), GPT-4o 25/29 (86.2%), and Claude 17/22 (77.3%). For clinical reasoning fidelity: DeepSeek 20/23 (87.0%), Claude 19/22 (86.4%), Gemini 24/29 (82.8%), and GPT-4o 21/29 (72.4%).

DeepSeek demonstrated marked disease-specific performance variation: CML sessions 11/13 (84.6%) versus AML sessions 1/10 (10.0%; OR (CML vs AML) = 50.0, 95% CI: 3.85–∞; RR (CML/AML) = 8.46; p < 0.001).

Subdomain-level performance patterns

Success rates varied across twelve subdomains (Fig. 2), ranging from 60.2% to 99.0%. The weakest performers were pharmacotherapeutic accuracy (62/103, 60.2%) and guideline alignment (68/103, 66.0%), both within clinical accuracy and safety. The strongest performers were problem identification (102/103, 99.0%), scaffolding quality (101/103, 98.1%), instructional framing (100/103, 97.1%), and clinical narrative plausibility (100/103, 97.1%).

Platform-specific patterns (Fig. 2) revealed GPT-4o with consistent weaknesses in pharmacotherapeutic accuracy (12/29, 41.4%) and guideline alignment (17/29, 58.6%). Claude achieved the highest clinical accuracy and safety performance (17/22, 77.3%) with balanced subdomain scores. DeepSeek demonstrated near-perfect instructional design quality (22/23, 95.7%) but moderate clinical accuracy and safety performance. Gemini showed consistent performance across all subdomain categories (Fig. 3).

**Fig. 3: Subdomain-level success rates.**

Clinical error analysis

Error analysis revealed three prominent failure patterns (Table 1). Domain entanglement occurred exclusively in AML sessions (4/50, 8.0%), where therapies from related hematologic conditions were inappropriately applied—including blinatumomab (a B-ALL-specific agent) recommended for AML and differentiation syndrome incorrectly attributed to standard chemotherapy³². Fabricated evidence emerged in 9 sessions, presenting invented clinical trials with specific statistical outcomes (e.g., “MORPHO trial^33,34 NEJM 2023” with false gilteritinib data) and mathematically impossible risk scoring formulas. The most frequent errors involved guideline misalignment (22 AML, 13 CML) and pharmacotherapeutic inaccuracies (24 AML, 17 CML), including concurrent allopurinol with rasburicase, premature treatment-free remission attempts, and inappropriate therapy escalations at warning response milestones.

Table 1 Error patterns in LLM-generated hematologic pharmacotherapy simulations

Full size table

Student satisfaction and preference-safety alignment

Student satisfaction data were obtained from 102 participants (one completed the simulation but not the survey) with excellent internal consistency (Cronbach’s α = 0.939; Supplementary D). Overall mean satisfaction score was 3.41 (SD = 1.44), significantly above the neutral midpoint of 3.0 (p < 0.001, Cohen’s d = 0.282) (Fig. 4).

**Fig. 4: Student assessment of large language models in clinical education.**

Preference distribution showed 406 responses (49.8%) favoring LLMs, 165 (20.2%) neutral, and 245 (30.0%) favoring traditional methods, differing significantly from uniform expectations (p < 0.001, Cohen’s w = 0.368).

Students significantly favored LLMs for ease of use (65/102, 63.7%, 95% CI: 54.1–72.4%, Cohen’s h = 0.278, p = 0.007) and time saving (63/102, 61.8%, 95% CI: 52.1–70.6%, Cohen’s h = 0.238, p = 0.022) but significantly favored traditional methods for clinical practice realism (38/102, 37.3%, 95% CI: 28.5–46.9%, Cohen’s h = −0.258, p = 0.013) (Fig. 4). No directional preferences emerged for other dimensions including diagnostic skills, clinical confidence, learning enjoyment, exam preparation, or future use intent. Mean satisfaction scores ranged across platforms with no significant platform differences (p = 0.442, ε² ≈ 0), with DeepSeek receiving the highest satisfaction (M = 3.68, SD = 1.14) and Claude the lowest (M = 3.11, SD = 1.03).

Stratified analyses did not detect statistically significant alignment between student satisfaction and expert-assessed content quality in this sample (Fig. 4), and these comparisons were not powered to exclude moderate associations. Effect sizes and confidence intervals for these stratified comparisons are provided in Supplementary D.

Students who experienced sessions meeting expert clinical accuracy and safety standards reported similar satisfaction on clinical practice realism compared to those who experienced failing sessions (3.15 vs 2.86, q = 0.626). For clinical confidence and diagnostic skills improvement, students who experienced sessions failing expert clinical reasoning fidelity criteria reported numerically higher satisfaction than those who experienced passing sessions (3.47 vs 3.11 and 3.68 vs 3.24, respectively), though differences were not statistically significant (q = 0.653 and q = 0.259). With only four platforms, rank correlation analysis is underpowered to yield interpretable results; we therefore present platform-level satisfaction and safety pass rates descriptively in Fig. 4. No confidence interval is reported for ρ = −0.80 given that bootstrap resampling is unreliable at n = 4 platforms. DeepSeek received the highest student preference (62.5%) but demonstrated 52.2% safety pass rate, while Claude achieved the highest safety rate (77.3%) but received only 52.4% student preference.

Discussion

We systematically evaluated four major LLMs in generating pharmacotherapy simulations for AML and CML across instructional design quality, clinical accuracy and safety, and clinical reasoning fidelity. Overall success reached 51.5%, with notable domain heterogeneity. Models demonstrated relatively strong performance in instructional design (82.5%) and clinical reasoning fidelity (81.6%), while clinical accuracy and safety emerged as the primary limitation (58.3%). Current models appear capable of structuring learning experiences and modeling reasoning processes but demonstrate challenges with precise clinical content generation, indicating a need for targeted improvements and rigorous oversight before high-stakes educational deployment.

The clinical accuracy and safety domain showed uneven performance across its four subdomains. Clinical narrative plausibility and domain specificity both exceeded 90% success, indicating models can generate coherent scenarios and often maintain disease-specific boundaries when guided by structured prompts. However, pharmacotherapeutic accuracy and guideline alignment showed substantially lower success. Models appear capable of constructing believable clinical narratives but demonstrate difficulties translating evidence-based guidelines into accurate therapeutic recommendations. High narrative plausibility may explain why students engaged positively with content despite its pharmacotherapeutic inaccuracies, which highlights the need for expert review before educational deployment. In contrast, performance in instructional design quality and clinical reasoning fidelity remained relatively consistent across platforms and disease types. Scaffolding quality, instructional framing, learning objective clarity, problem identification, and knowledge integration all achieved relatively strong success with minimal variation. This consistency may indicate that pedagogical structure generation and reasoning process modeling represent more stable capabilities, potentially less sensitive to clinical complexity than content accuracy. The contrast between relatively stable pedagogical performance and variable clinical accuracy reinforces that content precision represents a primary technical challenge.

The disease-specific performance differences we observed provide insight into factors that influence model capabilities and suggest how findings may generalize to other conditions. Acute and chronic myeloid leukemia share myeloid cell lineage with overlapping clinical features, creating semantic similarity that prior literature suggested might challenge boundary preservation in language models^16,19,35,36. Their management approaches differ fundamentally. CML follows relatively linear therapeutic pathways with tyrosine kinase inhibitor selection based on comorbidities and molecular monitoring³⁵, while AML requires complex decision trees dependent on molecular subtype, induction response, and consolidation eligibility³⁶.

This pairing allowed us to test whether semantic overlap causes domain entanglement and whether therapeutic complexity affects performance. Our findings are consistent with both concerns; however, the low absolute frequency of domain entanglement events limits mechanistic inference, and we treat entanglement as a hypothesis-generating failure mode warranting targeted experimental testing (e.g., prompt ablation studies and retrieval-augmented generation comparisons). Domain entanglement occurred exclusively in AML sessions, where models recommended blinatumomab, a B-cell acute lymphoblastic leukemia-specific agent, and incorrectly attributed differentiation syndrome to standard chemotherapy, demonstrating that semantic similarity may override explicit prompt constraints³². CML sessions succeeded more often, particularly in guideline alignment and pharmacotherapeutic accuracy.

These findings generate hypotheses for other domains. Conditions requiring multi-variable conditional reasoning may pose a higher risk for clinically important errors than conditions governed by more linear algorithms; however, generalizability beyond AML/CML requires direct empirical testing across additional disease states. Conceptually, conditions requiring complex multi-variable conditional reasoning (e.g., sepsis management, advanced heart failure therapy, or complex anticoagulation strategies) may present greater susceptibility to prompt override due to their non-linear decision pathways. In contrast, semantically related conditions governed by relatively structured or linear therapeutic algorithms (e.g., Crohn’s disease vs ulcerative colitis, type 1 vs type 2 diabetes, bacterial vs viral meningitis) may theoretically demonstrate higher discrimination accuracy. Establishing cross-domain robustness would require dedicated studies across additional disease pairs.

Platform patterns revealed meaningful heterogeneity. Gemini achieved the highest overall success with consistent cross-disease performance. Claude demonstrated the highest clinical accuracy with balanced disease-type performance, which may reflect conservative, guideline-adherent generation. DeepSeek demonstrated substantial disease-dependent variation, succeeding far more often with CML than AML, potentially reflecting training data imbalances or differential sensitivity to complexity. GPT-4o demonstrated lower clinical accuracy despite adequate instructional design. This variation is consistent with prior findings of domain-dependent LLM performance, indicating the need for disease-specific validation before implementation^37,38,39.

Error analysis revealed patterns that may inform oversight and development priorities. Guideline misalignment reflected apparent temporal bias toward older training data, which may suggest models revert to statistically dominant training patterns when generating extended content⁴⁰. Fabricated evidence manifested as confident citations of non-existent trials, which may indicate architectures lack robust mechanisms to verify claims or signal uncertainty. Pharmacotherapeutic inaccuracies included clinically concerning recommendations such as concurrent allopurinol-rasburicase administration, potentially reflecting failures in multi-step conditional reasoning³⁹. Domain entanglement occurred despite explicit negative constraints, suggesting possible architectural limitations in maintaining categorical boundaries. These patterns indicate that while prompt engineering can elicit pedagogical structure, clinical safety requires model-level improvements, including citation verification, uncertainty quantification, temporal weighting, and enhanced boundary preservation.

Our meta-prompt framework showed promise in maintaining domain specificity and enabled systematic evaluation by standardizing content generation across platforms and disease types⁴¹. The framework incorporated five boundary-preservation mechanisms, including disease-specific guideline anchoring, negative constraints prohibiting cross-referencing related conditions, and consistency verification requirements. High domain specificity success (95.1%) with relatively limited entanglement (3.9% of sessions) suggests these mechanisms helped maintain therapeutic boundaries in most cases. However, the persistence of entanglement errors in complex acute myeloid leukemia cases indicates that prompt engineering alone may be insufficient to overcome architectural limitations when semantic similarity is high. Beyond enabling systematic evaluation, the template-based design allows educators to adapt the framework through variable substitution while preserving safety constraints, thereby democratizing LLM-based simulation development and offering a potential approach for research on optimal prompt architecture in medical education.

The comprehensive expert validation conducted in this study may initially appear to contradict claims of scalability for LLM-generated simulations. However, in safety-critical clinical education, the appropriate goal is augmentation of expert judgment rather than its replacement. Inaccurate pharmacotherapy recommendations embedded in training simulations risk propagating incorrect clinical reasoning patterns to learners, creating downstream patient-safety concerns. Our findings empirically support this concern: 48.5% of unmoderated sessions failed to meet minimum expert safety and accuracy criteria, yet students experiencing these sessions reported satisfaction levels statistically indistinguishable from those experiencing passing sessions, suggesting learners may not reliably detect content quality deficiencies without expert guidance. Beyond expert evaluation, established frameworks recommend incorporating learner perspectives when assessing educational technologies⁴². We included student satisfaction assessment to capture authentic end-user experience, allowing natural interaction without experimental constraints to identify real-world engagement patterns and implementation challenges^43,44. The stratified analysis examined whether student satisfaction on specific dimensions aligned with expert evaluation of corresponding session quality. We interpret satisfaction as feasibility evidence (acceptability and usability) rather than evidence of educational effectiveness, which requires objective learning outcome measures in subsequent studies.

Stratified analyses did not detect statistically significant alignment between student satisfaction and expert-assessed content quality in this sample. Students experiencing sessions meeting expert clinical accuracy and safety standards reported similar clinical practice realism satisfaction as those experiencing failing sessions. For clinical confidence and diagnostic skills improvement, students experiencing sessions failing expert clinical reasoning fidelity criteria reported numerically higher satisfaction than those experiencing passing sessions, though differences were not statistically significant. Although these comparisons were underpowered to exclude moderate effects, the observed dissociation—positive learner experience even in sessions with expert-identified safety or guideline violations—supports the governance interpretation that student satisfaction alone is insufficient for validating clinical content quality, consistent with cognitive theory on fluency-driven illusions of competence^{45,46,47,48,49,50,51}.

Platform-level patterns reinforced this disconnect. Gemini achieved the highest expert pass rate but moderate student satisfaction, while GPT-4o demonstrated the lowest expert pass rate with similar student satisfaction^52,53. Combined with stratified analysis showing a lack of alignment between individual satisfaction and session quality, this may suggest that students base preferences on factors unrelated to clinical accuracy, such as conversational style or response elaboration, rather than expert-identified educational value or safety. Students significantly favored models for ease of use but significantly favored traditional methods for clinical practice realism. The ease of use advantage likely reflects immediate accessibility and conversational interaction, while preference for traditional methods regarding clinical realism suggests students recognized that model-generated scenarios might lack the authenticity of faculty-developed cases. This dimensional variation indicates students may discern certain experiential aspects but appear unable to evaluate clinical accuracy.

This potential inability to discriminate content quality has implications for unsupervised deployment. Students may find content appealing based on polished presentation, confident tone, and engaging narratives while potentially unable to identify underlying clinical inaccuracies, fabricated evidence, or guideline violations. High narrative plausibility combined with low pharmacotherapeutic accuracy creates conditions where learners may encounter content appearing authentic while containing clinically inappropriate recommendations. Advanced pharmacy students, despite substantial clinical knowledge, showed no significant ability to identify quality differences that experts readily detected, which may suggest clinical background alone does not protect learners and that structured oversight may be necessary even for experienced students⁵⁴.

For safe implementation and future development, educators and developers face different but complementary priorities. Developers need to improve models by integrating real-time guideline updates with temporal weighting to reduce reliance on outdated patterns, implementing citation verification to prevent fabricated evidence, strengthening boundary-preservation mechanisms to limit domain entanglement, and enhancing logical consistency checks for multi-step therapeutic reasoning^55,56. Educators must recognize current limitations and ensure that all LLM-generated content undergoes expert review, with particular attention to pharmacotherapeutic recommendations and guideline adherence in complex therapeutic areas.

The patterns of errors observed suggest several directions for research and development. Validation of the meta-prompt framework across diverse therapeutic domains is necessary to determine whether it can generalize beyond hematologic malignancies. Comparative studies of retrieval-augmented generation with real-time guideline access may reveal whether architectural modifications can reduce gaps in accuracy. Development of automated error detection targeting guideline violations, fabricated citations, domain entanglement, and reasoning failures could support scalable quality assurance. Multi-institutional studies with larger samples would help clarify whether observed platform differences and complexity-dependent performance patterns are robust, providing guidance for both model refinement and educational deployment. Future work should also include error clustering stratified by AML molecular subtype/risk category (e.g., FLT3/NPM1/IDH-driven decision branches) to determine whether failures concentrate in specific therapeutic pathways.

We applied non-compensatory thresholds for clinical accuracy and safety and clinical reasoning fidelity, and compensatory scoring for instructional design quality. This approach is consequence-based: in clinically oriented domains, compensatory averaging can mask safety-critical deficiencies, a concern reflected in patient safety–oriented standard setting and mastery learning frameworks that treat critical actions as non-compensatory requirements^57,58. In contrast, instructional design elements may compensate for one another within a minimum-quality floor without introducing direct patient-safety risk^5,59. These thresholds should be interpreted as conservative deployment criteria for this Phase 1 validation stage rather than as a claim that all clinical errors have equivalent severity. Accordingly, we report a structured error taxonomy (Table 1), distinguishing failure modes with potentially different consequences. Future work should develop severity-weighted scoring and explicit “never-event” catalogs through formal expert consensus (e.g., modified Delphi) and psychometric calibration (e.g., Rasch modeling or item response theory), while retaining non-compensatory gating for errors judged unacceptable for learner exposure⁶⁰.

This study has several limitations. Sample sizes per platform were sufficient for overall characterization but offered limited power to detect small differences, making platform comparisons exploratory. Single-institution implementation and evaluation limited to two hematologic malignancies constrain external generalizability. Although differences in therapeutic complexity may conceptually influence model performance—particularly in conditions requiring multi-variable conditional reasoning—such extrapolation remains hypothetical and requires empirical validation across additional disease domains. While our design ensures a focused evaluation of each model, it precludes direct learner-based comparative analysis. Expert reviewers were not blinded to platform identity, but structured rubrics and high inter-rater reliability may have reduced bias. Conservative pass-fail thresholds prioritized safety but require further empirical evaluation. The evaluation framework and meta-prompt require formal validation, and stratified analyses had limited power, particularly for clinical reasoning domains, which may have contributed to non-significant findings despite meaningful numerical trends.

Moreover, the review intensity used in this study reflects research-grade benchmarking intended to characterize failure modes and support early validity evidence for the rubric. In this dataset, 48.5% of sessions failed to meet minimum criteria for clinical accuracy and safety, while student satisfaction ratings were statistically similar between sessions that passed and failed expert criteria. Scalability should be considered relative to the appropriate comparator workflow: LLM-assisted draft generation with structured expert verification versus traditional simulation development workflows. The study did not quantify time or cost differences. Drawing on patient-safety approaches to standard setting and critical-action gating, severity-stratified medication safety systems, and implementation governance considerations, we propose a risk-stratified quality assurance approach in which review intensity is calibrated to clinical consequence^{50,58,60,61,62,63}. Operational deployment would likely require risk-stratified review.

Additionally, because model behavior may change over time in proprietary web interfaces, replication using the same prompts and rubric at future time points is required to assess the temporal stability of observed failure modes. Unfortunately, this type of version drift is now typical for continuously deployed proprietary LLMs across both web and API access routes, so transparent reporting and longitudinal re-testing are currently the most practical mitigation strategies when version-pinned snapshots are unavailable⁶⁴. Future work should also consider benchmarking smaller locally deployable models (or institution-hosted open-weight models) that can be maintained as frozen snapshots, enabling stronger reproducibility and implementation governance than continuously updated proprietary endpoints⁶⁴.

Additionally, a key limitation is the expert panel composition (three co-authors from one institution involved in rubric and meta-prompt development), which may introduce confirmation bias. The reported reliability reflects internal scoring consistency rather than independent external validation; stronger support for independence requires replication with external and/or blinded raters. Phase 2 work should incorporate independent external clinical experts, ideally blinded to platform identity and study hypotheses, to strengthen extrapolation and generalizability inferences^65,66. Also, we did not perform prompt ablation experiments; therefore, we cannot attribute observed performance to any specific safeguard mechanism within the meta-prompt. DeepSeek’s marked disease-dependent variation suggests platform-specific sensitivity to therapeutic complexity or training data distribution; explaining these effects would require model-level access not available in web-interface evaluations.

Finally, because each student used only one platform, within-subject platform comparisons were not feasible in this Phase 1 content validation study, which prioritized independent expert evaluation of diverse platform outputs over learner-centered comparative usability testing. Future studies should evaluate multiple platforms per learner using counterbalanced designs with temporal spacing, disease context rotation, and burden management to strengthen evidence about individual preference patterns while controlling for carryover learning effects and fatigue. Additionally, evaluating LLM simulations across multiple professional years may clarify prerequisite knowledge thresholds and the optimal curricular timing for LLM-assisted simulation within vertically and horizontally integrated curricula.

In conclusion, using controlled meta-prompting, sessions more frequently met criteria for instructional design and clinical reasoning than for pharmacotherapeutic accuracy and guideline alignment, with performance varying by platform and disease context. Expert oversight with platform-specific and disease-specific validation remains essential for safe educational deployment, and effectiveness trials assessing objective learning outcomes represent necessary subsequent work.

Methods

Study design

This was a phase 1 mixed-methods evaluation study evaluating educational materials generated during routine curricular activities from March 15 to April 10, 2025, at a single institution⁶⁷. The study aimed to characterize LLM performance in generating pharmacotherapy simulations and to assess student perceptions of this learning modality, as outlined in the five-phase evaluation framework (Fig. 5). All clinical evaluations were benchmarked against 2022 National Comprehensive Cancer Network/European LeukemiaNet (NCCN/ELN)³⁶ guidelines for AML and 2017 European Society for Medical Oncology/European LeukemiaNet (ESMO/ELN) guidelines for CML³⁵, which served as the reference standards for assessing clinical accuracy and guideline adherence. This study was approved by the Research Ethics Committee of the Faculty of Pharmacy, Cairo University (Approval ID: CL 3931). All participants provided informed consent.

**Fig. 5: Framework for evaluating LLMs in pharmacotherapy.**

Framework development and meta-prompt engineering

LLMs demonstrate substantial sensitivity to prompt design, with well-structured prompts shown to reduce hallucination and improve performance¹⁶. However, effective prompt engineering typically requires specialized technical expertise that may limit accessibility for educators. To address this barrier while maintaining rigor, we developed a systematic meta-prompt framework that bridges clinical domain expertise with prompt engineering principles⁴¹. This approach enables educators to generate high-quality pharmacotherapy simulations through structured variable substitution rather than de novo prompt development, providing a transparent and replicable method adaptable across educational settings.

Development proceeded in three stages. In the first stage, a multidisciplinary team consisting of a clinical pharmacy educator (AA), board-certified oncology pharmacist (AZ), and clinical pharmacy researcher (AF) defined foundational principles for simulation design informed by scaffolding theory and cognitive load theory (Supplementary A)^{4,5,6,68,69,70}. In the second stage, these principles were translated into a universal meta-prompt template with variable placeholders for leukemia type, specialty, and guideline source (Supplementary B). The template defined the model’s role as clinical pharmacy professor, structured learner engagement through scaffolded progression aligned with Bloom’s Taxonomy⁴¹, and required strict guideline adherence across diagnostic, therapeutic, and follow-up scenarios. Five safeguard mechanisms were embedded to minimize domain entanglement and other clinical errors: (1) disease-specific guideline anchoring, (2) negative constraints prohibiting reference to related conditions, (3) boundary reinforcement directing limitation statements when uncertain, (4) citation requirements linking recommendations to specific guideline sections, and (5) consistency verification before each recommendation.

In the third stage, final prompts were generated using Google AI Studio (Gemini 2.0 Pro), selected for its extended context window, stable output, and academic licensing (Supplementary B)^71,72. Subject-matter experts reviewed and standardized these prompts to remove platform-specific phrasing before deployment across all four platforms.

Participants and randomization

We recruited 104 fourth-year Doctor of Pharmacy students after obtaining informed consent. Participants were initially randomized to one of two leukemia contexts (AML or CML) for their LLM-based simulation. To minimize sequence effects and ensure balanced academic preparedness across study arms, students were further randomized in a counterbalanced manner to receive either the LLM simulation first, followed by a traditional case-based session, or vice versa. Within each leukemia context, students assigned to the LLM condition were randomized to one of four platforms (ChatGPT GPT-4o⁷³, Claude 3.7 Sonnet⁷⁴, Gemini 2.0 Pro⁷¹, or DeepSeek V2⁷⁵), yielding approximately 13 participants per platform–disease combination. Each student completed both leukemia contexts overall (one via LLM simulation and one via traditional case study); however, only one LLM transcript per student was included in the final analysis to minimize carryover learning effects.

The study was conducted within a traditional (discipline-based) curriculum where clinical therapeutics is taught as the capstone of a longitudinal science sequence. To ensure adequate academic preparedness, all participating fourth-year PharmD students had completed a prerequisite series of at least two pharmacology courses. This sequential scaffolding provides vertical integration, bridging foundational drug science with the advanced clinical reasoning required for complex hematologic oncology management.

Deployment procedures

Students received 30-min training covering study protocol, procedures for initiating new sessions without prior conversation history to avoid context carryover, and documentation instructions. They deployed provided prompts on personal devices using publicly accessible web interfaces. ChatGPT GPT-4o was accessed via chat.openai.com⁷⁶, Claude 3.7 Sonnet via claude.ai⁷⁷, Gemini 2.0 Pro Experimental 0125 via aistudio.google.com⁷², and DeepSeek V2 via chat.deepseek.com⁷⁸. All platforms used default web interface settings, including standard temperature parameters, no retrieval augmentation, and no custom instructions.

Students interacted naturally with platforms without experimental constraints, prioritizing external validity to enable identification of authentic implementation failure modes^43,44. Participants exported sessions using native platform functions while DeepSeek users saved full webpages. All sessions were copy-pasted into structured forms with timestamped screenshots and session metadata to ensure complete documentation⁷⁹.

Traditional case-based sessions mirrored the clinical complexity and modular design of large language model simulations. Students received unique patient cases followed by self-learning questions and complete guideline documents^35,36. They submitted answers with highlighted screenshots of cited references documenting evidence-based reasoning. Cases covered diverse AML and CML contexts, including various patient demographics, disease subtypes, and molecular profiles. Essay-style questions guided students to interpret diagnostics, apply guideline reasoning, and propose management strategies. After expert review of all submitted materials, a reconciliation session was conducted with participants to identify and correct any clinical inaccuracies, preventing the propagation of misinformation. Acute myeloid leukemia (AML) and chronic myeloid leukemia (CML) were selected as a focused proof-of-concept pair due to their shared hematologic classification yet distinct therapeutic algorithms. This pairing was intended to examine potential domain entanglement within closely related malignancies rather than to establish generalizable conclusions across disease categories.

For transparency despite web-interface version drift, we report platform access routes, default interface settings, and full prompt syntax (Supplementary B), consistent with MI-CLEAR-LLM reporting guidance⁸⁰. While backend updates cannot be controlled in web interfaces, this study provides benchmarking under ecologically valid access conditions and enables future longitudinal replication using identical prompts and the same rubric.

Performance assessment framework

We adapted four established instruments for LLM evaluation contexts: the Healthcare Simulation Standards of Best Practice (HSSOBP)⁸¹ for simulation design quality, the Simulation Design Scale (SDS)⁸² for design element assessment, the Clinical Reasoning Evaluation Rubric (CRER)⁸³ for reasoning process evaluation, and the Simulation Effectiveness Tool–Modified (SET-M)⁸⁴ for learner-reported effectiveness. The expert panel reviewed all items for face and content validity, calculating Content Validity Indices using established methods⁸⁵. Items achieving item-level Content Validity Index (CVI) ≥ 0.78 were retained, while those below were revised or excluded. Final instruments achieved scale-level CVI values ≥ 0.90, indicating excellent content validity. The adaptation process addressed fundamental challenges in evaluating LLM-generated content, including the absence of real-time facilitation and the need to assess algorithmic rather than human reasoning processes; specific adaptations for each instrument are detailed in Table 2.

Table 2 Instrument adaptation for LLM-generated medical simulation in education

Full size table

The adapted instruments created a three-domain framework encompassing twelve subdomains: instructional design quality (learning objective clarity, instructional framing, scaffolding quality, embedded feedback quality); clinical accuracy and safety (clinical narrative plausibility, guideline alignment, pharmacotherapeutic accuracy, domain specificity); and clinical reasoning fidelity (clinical prioritization, knowledge integration, evidence-based decision-making, outcomes modeling).

Student satisfaction was assessed using eight 5-point Likert items adapted from validated instruments⁸⁴. Students completed satisfaction questionnaires immediately after both traditional and LLM-based learning modalities, with comparative items addressing clinical confidence building, diagnostic skills improvement, time efficiency, and future use intentions. Internal consistency was evaluated using Cronbach’s alpha⁸⁶. Complete evaluation rubrics, validation indices, and error taxonomy are provided in Supplementary C.

Expert evaluation procedures

The three-member expert panel conducted independent evaluations of each session transcript using the adapted instruments described in the “Performance assessment framework” section. Each subdomain was rated on a 1-to-5 scale. Inter-rater reliability was assessed using Krippendorff’s alpha for ordinal ratings, with α ≥ 0.80 interpreted as strong agreement. Krippendorff’s α provides evidence of rating consistency but does not establish independence from shared systematic bias; stronger support for independence requires replication using external and/or blinded raters³¹. We therefore frame this work as Phase 1 content validation, consistent with staged approaches in instrument development in health professions education, where internal expert panels commonly define content boundaries and stabilize scoring prior to external replication^87,88,89,90. Subdomain scores were calculated as the mean of independent ratings across the three reviewers. To minimize potential bias from non-blinding to platform identity, the rubrics emphasized objective, verifiable clinical criteria.

Given the absence of established standards for evaluating AI-generated medical education simulations, we developed domain-specific pass-fail criteria informed by our rubric’s scoring definitions and simulation assessment literature. We employed a categorical approach requiring each subdomain to meet minimum performance thresholds rather than averaging scores across subdomains, thereby preventing high performance in one area from masking critical deficiencies in another.

We applied domain-specific pass-fail criteria. Clinical Accuracy and Safety required all four subdomains to achieve mean scores ≥4.0 with no exceptions permitted (non-compensatory scoring), consistent with the Healthcare Simulation Standards of Best Practice (HSSBP)⁸¹. Clinical reasoning fidelity similarly required all four subdomains ≥4.0 without compensation, reflecting the expectation that simulations model expert-level reasoning processes^91,92. In contrast, instructional design quality used compensatory scoring based on Chen et al.’s⁵⁹ (all subdomains >3.0 with overall domain mean >4.0), allowing strengths in some instructional elements to offset modest weaknesses in others while maintaining a minimum-quality floor^59,93. Sessions were classified as successful only when all three domains met criteria simultaneously⁹⁴. We report a structured error taxonomy (Table 1) distinguishing among guideline misalignment, pharmacotherapeutic inaccuracies, fabricated evidence citations, and domain entanglement.

Statistical analysis

Our sample size was 80% powered to detect at least a 30-percentage-point difference in LLM performance between leukemia types at a two-sided α = 0.05⁹⁵. The primary endpoint was the overall success rate, defined as the proportion of sessions meeting passing criteria across all three domains, reported with exact binomial 95% confidence intervals⁹⁵. Domain-specific success rates were estimated overall and stratified by platform and leukemia type using Wilson score confidence intervals. Error frequencies were quantified by platform and disease type to identify systematic failure patterns.

Comparisons between AML and CML used Fisher’s exact tests with risk ratios and 95% confidence intervals. Chi-square or Fisher’s exact test was applied according to standard guidelines, with Fisher’s exact test selected when expected cell counts were less than five to ensure valid inference in small-sample categorical comparisons.

Platform-level differences in overall success rates were assessed using chi-square tests, and within-platform disease differences were evaluated using Fisher’s exact tests.

Student satisfaction was analyzed by comparing mean scores on each of the eight dimensions against the neutral midpoint (μ = 3.0) using one-sample t-tests, consistent with established practices for Likert scale analysis in educational research^96,97. In addition, we examined whether student satisfaction varied according to expert-assessed content quality. For three conceptually related domain-dimension pairs (clinical accuracy and safety with clinical practice realism; clinical reasoning fidelity with both clinical confidence and diagnostic skills improvement), we calculated mean student dimension scores stratified by whether their corresponding sessions passed or failed expert evaluation criteria.

Statistical significance was defined as α = 0.05. Multiple comparisons were corrected using the Benjamini–Hochberg procedure⁹⁸. All analyses were conducted in Python 3.12 (SciPy 1.11, Statsmodels 0.14).

Data availability

All datasets generated and/or analyzed during the current study are available in the supplementary material.

Code availability

Not applicable.

References

Gharib, A. M., Bindoff, I. K., Peterson, G. M. & Salahudeen, M. S. Computer-based simulators in pharmacy practice education: a systematic narrative review. Pharmacy 11, 8 (2023).
Article PubMed PubMed Central Google Scholar
Korayem, G. B. et al. Simulation-based education implementation in pharmacy curriculum: a review of the current status. Adv. Med. Educ. Pract. 13, 649–660 (2022).
Article PubMed PubMed Central Google Scholar
Seybert, A. L. et al. Evidence for simulation in pharmacy education. J. Am. Coll. Clin. Pharm. 2, 686–692 (2019).
Article Google Scholar
Reiser, B. J. Scaffolding complex learning: the mechanisms of structuring and problematizing student work. J. Learn. Sci. 13, 273–304 (2004).
Van Merriënboer, J. J. G. & Sweller, J. Cognitive load theory and complex learning: recent developments and future directions. Educ. Psychol. Rev. 17, 147–177 (2005).
Article Google Scholar
Mesquita, A. R. et al. Developing communication skills in pharmacy: a systematic review of the use of simulated patient methods. Patient Educ. Couns. 78, 143–148 (2010).
Article PubMed Google Scholar
Lin, K., Travlos, D. V., Wadelin, J. W. & Vlasses, P. H. Simulation and introductory pharmacy practice experiences. Am. J. Pharm. Educ. 75, 1–9 (2011).
Article Google Scholar
Vyas, D., Bray, B. S. & Wilson, M. N. Use of simulation-based teaching methodologies in US colleges and schools of pharmacy. Am. J. Pharm. Educ. 77, 53 (2013).
Article PubMed PubMed Central Google Scholar
Cook, D. A. Creating virtual patients using large language models: scalable, global, and low cost. Med. Teach. 47, 40–42 (2025).
Article PubMed Google Scholar
Cook, D. A. et al. Virtual patients using large language models: scalable, contextualized simulation of clinician-patient dialogue with feedback. J. Med. Internet Res. 27, e68486 (2025).
Article PubMed PubMed Central Google Scholar
Brügge, E. et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Med. Educ. 24, 1391 (2024).
Article PubMed PubMed Central Google Scholar
Mehan, N., Desinghe, T. D. & Saha, A. Development and evaluation of large-language models (LLMs) for oncology: a scoping review. PLoS Digit. Health. https://doi.org/10.1371/journal.pdig.0000980 (2025).
Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).
Article PubMed PubMed Central Google Scholar
Safranek, C. W., Sidamon-Eristoff, A. E., Gilson, A. & Chartash, D. The role of large language models in medical education: applications and implications. JMIR Med. Educ. 9, e50945 (2023).
Article PubMed PubMed Central Google Scholar
Sharma, S., Mittal, P., Kumar, M. & Bhardwaj, V. The role of large language models in personalized learning: a systematic review of educational impact. Discov. Sustain. 6, 1–24 (2025).
Article CAS Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Alkalbani, A. M. et al. A systematic review of large language models in medical specialties: applications, challenges and future directions. Information 16, 489 (2025).
Article Google Scholar
Abramski, K., Improta, R., Rossetti, G. & Stella, M. The “LLM world of words” English free association norms generated by large language models. Sci. Data 12, 803 (2025).
Article PubMed PubMed Central Google Scholar
Yazdani, S., Henry, R. C., Byrne, A. & Henry, I. C. Utility of word embeddings from large language models in medical diagnosis. J. Am. Med. Inform. Assoc. 32, 526–534 (2025).
Article PubMed PubMed Central Google Scholar
Sauder, M. et al. Exploring generative artificial intelligence-assisted medical education: assessing case-based learning for medical students. Cureus 16, e51961 (2024).
PubMed PubMed Central Google Scholar
Patil, N. G., Kou, N. L., Baptista-Hon, D. T. & Monteiro, O. Artificial intelligence in medical education: a practical guide for educators. MedComm Futur. Med. 4, e70018 (2025).
Shaw, K., Henning, M. A. & Webster, C. S. Artificial intelligence in medical education: a scoping review of the evidence for efficacy and future directions. Med. Sci. Educ. 35, 1803–1816 (2025).
Article PubMed PubMed Central Google Scholar
Ahsan, Z. Integrating artificial intelligence into medical education: a narrative systematic review of current applications, challenges, and future directions. BMC Med. Educ. 25, 1187 (2025).
Article PubMed PubMed Central Google Scholar
Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, E2343689 (2023).
Article PubMed PubMed Central Google Scholar
Shimony, S., Stahl, M. & Stone, R. M. Acute myeloid leukemia: 2025 update on diagnosis, risk-stratification, and management. Am. J. Hematol. 100, 860–891 (2025).
Article CAS PubMed PubMed Central Google Scholar
Jabbour, E. & Kantarjian, H. Chronic myeloid leukemia: 2025 update on diagnosis, therapy, and monitoring. Am. J. Hematol. 99, 2191–2212 (2024).
Article CAS PubMed Google Scholar
DiNardo, C. D. & Wei, A. H. How I treat acute myeloid leukemia in the era of new drugs. Blood 135, 85–96 (2020).
Article PubMed Google Scholar
Döhner, H. et al. Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood 129, 424–447 (2017).
Article PubMed Google Scholar
Hochhaus, A. et al. European LeukemiaNet 2020 recommendations for treating chronic myeloid leukemia. Leukemia 34, 966–984 (2020).
Article CAS PubMed PubMed Central Google Scholar
Krippendorff, K. Content Analysis: An Introduction to Its Methodology. https://doi.org/10.4135/9781071878781 (Sage, 2019).
FDA. FDA Approves Blinatumomab as Consolidation for CD19-positive Philadelphia Chromosome-negative B-cell Precursor Acute Lymphoblastic Leukemia (FDA, 2024).
Astellas Pharma Global Development, Inc. NCT02997202. A trial of the FMS-like tyrosine kinase 3 (FLT3) inhibitor gilteritinib administered as maintenance therapy following allogeneic transplant for patients with FLT3/internal tandem duplication (ITD) acute myeloid leukemia (AML). Astellas Pharma Global Development, Inc. https://clinicaltrials.gov/show/NCT02997202 (2016).
Levis, M. J. et al. Gilteritinib as post-transplant maintenance for AML with internal tandem duplication mutation of FLT3. J. Clin. Oncol. 42, 1766–1775 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hochhaus, A. et al. Chronic myeloid leukaemia: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 28, iv41–iv51 (2017).
Article CAS PubMed Google Scholar
Döhner, H. et al. Diagnosis and management of AML in adults: 2022 recommendations from an international expert panel on behalf of the ELN. Blood 140, 1345–1377 (2022).
Article PubMed Google Scholar
Mavrych, V., Yousef, E. M., Yaqinuddin, A. & Bolgova, O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Med. Educ. Online 30, 2534065 (2025).
Article PubMed PubMed Central Google Scholar
Dinc, M. T., Bardak, A. E., Bahar, F. & Noronha, C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA Open 8, ooaf055 (2025).
Article PubMed PubMed Central Google Scholar
Wilhelm, T. I., Roos, J. & Kaczmarczyk, R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J. Med. Internet Res. 25, e49324 (2023).
Article PubMed PubMed Central Google Scholar
Zhu, C. et al. Is your LLM outdated? A deep look at temporal generalization. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 7433–7457. https://doi.org/10.18653/v1/2025.naacl-long.381 (Association for Computational Linguistics (ACL), 2025).
Suzgun, M. & Kalai, A. T. Meta-prompting: enhancing language models with task-agnostic scaffolding. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.12954 (2024).
Lai, J. W. M., De Nobile, J., Bower, M. & Breyer, Y. Comprehensive evaluation of the use of technology in education—validation with a cohort of global open online learners. Educ. Inf. Technol. 27, 9877–9911 (2022).
Article Google Scholar
Lai, J. W. M. & Bower, M. How is the use of technology in education evaluated? A systematic review. Comput. Educ. 133, 27–42 (2019).
Article Google Scholar
Sun, Y. & Liu, F. Real-world implementation of an AI learning tool-MetaGP-Edu in medical education: a multi-center cohort study. Comput. Educ. 237, 105388 (2025).
Article Google Scholar
Dunlosky, J. et al. Improving students’ learning with effective learning techniques: promising directions from cognitive and educational psychology. Psychol. Sci. Public Interest Suppl. 14, 4–58 (2013).
Article Google Scholar
Finn, B. & Metcalfe, J. Judgments of learning are influenced by memory for past test. J. Mem. Lang. 58, 19–34 (2008).
Article PubMed PubMed Central Google Scholar
Sanchez, C. A. & Wiley, J. An examination of the seductive details effect in terms of working memory capacity. Mem. Cogn. 34, 344–355 (2006).
Article Google Scholar
Rey, G. D. A review of research and a meta-analysis of the seductive detail effect. Educ. Res. Rev. 7, 216–237 (2012).
Article Google Scholar
Bjork, E. L. & Bjork, R. A. Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning. Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, 55–64 (2009).
Koriat, A. & Bjork, R. A. Illusions of competence in monitoring one’s knowledge during study. J. Exp. Psychol. Learn. Mem. Cogn. 31, 187–194 (2005).
Article PubMed Google Scholar
Harp, S. F. & Mayer, R. E. How seductive details do their damage: a theory of cognitive interest in science learning. J. Educ. Psychol. 90, 414–434 (1998).
Article Google Scholar
Almulla, M. A. Investigating influencing factors of learning satisfaction in AI ChatGPT for research: university students perspective. Heliyon 10, e32220 (2024).
Article PubMed PubMed Central Google Scholar
Suchanek, P. & Kralova, M. Generative artificial intelligence expectations and experiences in management education: ChatGPT use and student satisfaction. J. Innov. Knowl. 10, 100781 (2025).
Article Google Scholar
Elsayed, H. The impact of hallucinated information in large language models on student learning outcomes: a critical examination of misinformation risks in AI-assisted education. Northern reviews on algorithmic research. Theor. Comput. Complex 9, 11–23 (2024).
Google Scholar
Ahn, S. A guide to evade hallucinations and maintain reliability when using large language models for medical research: a narrative review. Ann. Pediatr. Endocrinol. Metab. 30, 115–118 (2025).
Article PubMed PubMed Central Google Scholar
Moëll, B. & Sand Aronsson, F. Harm reduction strategies for thoughtful use of large language models in the medical domain: perspectives for patients and clinicians. J. Med. Internet Res. 27, e75849 (2025).
Article PubMed PubMed Central Google Scholar
McGaghie, W. C. et al. Does simulation-based medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence. Acad. Med. 86, 706–711 (2011).
Article PubMed PubMed Central Google Scholar
Yudkowsky, R. et al. A patient safety approach to setting pass/fail standards for basic procedural skills checklists. Simul. Healthc. 9, 277–282 (2014).
Article PubMed Google Scholar
Chen, Y. et al. The effect of high-fidelity simulation in medical nursing based on the healthcare simulation standards of best practice. Nurs. Commun. 8, e2024020 (2024).
Article Google Scholar
The Agency for Healthcare Research and Quality Patient Safety Network (AHRQ PSNet). The National Coordinating Council for Medication Error Reporting and Prevention (NCCMERP). http://www.nccmerp.org/aboutMedErrors.html (2022).
ECA Academy. EU GMP Annex 22 (Draft 2025): Artificial Intelligence. https://www.gmp-compliance.org/guidelines/gmp-guideline/eu-gmp-annex-22-draft-2025-artificial-intelligence (2025).
Food and Drug Administration/U.S. Department of Health and Human. Clinical Decision Support Software (FDA, 2026).
The Agency for Healthcare Research and Quality Patient Safety Network (AHRQ PSNet). The Institute for Safe Medication (ISMP). https://psnet.ahrq.gov/issue/institute-safe-medication-practices (2019).
Chen, L., Zaharia, M. & Zou, J. How is ChatGPT’s behavior changing over time? Harvard Data Sci. Rev. https://doi.org/10.1162/99608f92.5317da47 (2024).
Messick, S. Standards of validity and the validity of standards in performance asessment. Educ. Meas. Issues Pract. 14, 5–8 (1995).
Article Google Scholar
Kane, M. T. Validating the interpretations and uses of test scores. J. Educ. Meas. 50, 1–73 (2013).
Article Google Scholar
Mertens, D. M. Mixed Methods Design in Evaluation. https://doi.org/10.4135/9781506330631 (Sage, 2018).
Seybert, A. L. & Barton, C. M. Simulation-based learning to teach blood pressure assessment to doctor of pharmacy students. Am. J. Pharm. Educ. 71, 48 (2007).
Article PubMed PubMed Central Google Scholar
Kiersma, M. E. et al. Development of the 2025 ACPE accreditation standards leading to the Doctor of Pharmacy degree. Am. J. Pharm. Educ. https://doi.org/10.1016/j.ajpe.2024.101348 (2025).
Mackler, E. et al. 2018 Hematology/oncology pharmacist association best practices for the management of oral oncolytic therapy: pharmacy practice standard. J. Oncol. Pract. 15, E346–E355 (2019).
Article PubMed PubMed Central Google Scholar
Pichai, S., Hassabis, D. & Kavukcuoglu, K. Google Introduces Gemini 2.0: A New AI Model for the Agentic Era. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#building-responsibly (2024).
Google AI Studio. Chat. https://aistudio.google.com/prompts/new_chat (2025).
OpenAI et al. GPT-4o system card. https://openai.com/index/gpt-4o-system-card/ (2024).
Anthropic. Claude 3.7 sonnet and Claude code\anthropic. https://www.anthropic.com/news/claude-3-7-sonnet (2025).
DeepSeek-AI et al. DeepSeek-v2: a strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434 (2024).
ChatGPT. https://chatgpt.com/ (accessed 15 March 2025).
Claude. https://claude.ai/new. (accessed 15 March 2025).
DeepSeek. https://chat.deepseek.com/. (accessed 15 March 2025).
Wu, F., Dang, Y. & Li, M. A systematic review of responses, attitudes, and utilization behaviors on generative AI for teaching and learning in higher education. Behav. Sci. 15, 467 (2025).
Article PubMed PubMed Central Google Scholar
Park, S. H. et al. Minimum reporting items for clear evaluation of accuracy reports of large language models in healthcare (MI-CLEAR-LLM). Korean J. Radiol. 25, 865–868 (2024).
Article PubMed PubMed Central Google Scholar
Watts, P. I. et al. Healthcare Simulation Standards of Best Practice™ simulation design. Clin. Simul. Nurs. 58, 14–21 (2021).
Article Google Scholar
Hughes, K. Simulation in nursing education. Am. J. Nurs. 118, 13 (2018).
Article PubMed Google Scholar
Furze, J. et al. Clinical reasoning: development of a grading rubric for student assessment. J. Phys. Ther. Educ. 29, 34–45 (2015).
Article Google Scholar
Leighton, K., Ravert, P., Mudra, V. & Macintosh, C. Updating the simulation effectiveness tool: item modifications and reevaluation of psychometric properties. Nurs. Educ. Perspect. 36, 317–323 (2015).
Article PubMed Google Scholar
Polit, D. F. & Beck, C. T. The content validity index: are you sure you know what’s being reported? Critique and recommendations. Res. Nurs. Health 29, 489–497 (2006).
Article PubMed Google Scholar
Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika 16, 297–334 (1951).
Article Google Scholar
Welsh, D. et al. Development of the barriers to error disclosure assessment tool. J. Patient Saf. 17, 363–374 (2021).
Article PubMed PubMed Central Google Scholar
Acker, R. C. et al. Belonging in surgery: a validated instrument and single institutional pilot. Ann. Surg. 280, 345–352 (2024).
Article PubMed Google Scholar
Shell, I. G. et al. Decision rules for the use of radiography in acute ankle injuries: refinement and prospective validation. J. Am. Med. Assoc. 269, 1127–1132 (1993).
Article Google Scholar
Stiell, I. G. et al. A study to develop clinical decision rules for the use of radiography in acute ankle injuries. Ann. Emerg. Med. 21, 384–390 (1992).
Article CAS PubMed Google Scholar
Lee, J. H., Park, C. G., Kim, S. H. & Bae, J. Psychometric properties of a clinical reasoning assessment rubric for nursing education. BMC Nurs. 20, 177 (2021).
Article PubMed PubMed Central Google Scholar
Jeffries, P. R. A framework for designing, implementing, and evaluating: simulations used as teaching strategies in nursing. Nurs. Educ. Perspect. 26, 96–103 (2005).
PubMed Google Scholar
Unver, V. et al. The reliability and validity of three questionnaires: the student satisfaction and self-confidence in learning scale, simulation design scale, and educational practices questionnaire. Contemp. Nurse 53, 60–74 (2017).
Article PubMed Google Scholar
Franklin, A. E., Burns, P. & Lee, C. S. Psychometric testing on the NLN student satisfaction and self-confidence in learning, simulation design scale, and educational practices questionnaire using a sample of pre-licensure novice nurses. Nurse Educ. Today 34, 1298–1304 (2014).
Article PubMed Google Scholar
Clopper, C. J. & Pearson, E. S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 404–413 (1934).
Article Google Scholar
Norman, G. Likert scales, levels of measurement and the ‘laws’ of statistics. Adv. Health Sci. Educ. 15, 625–632 (2010).
Article Google Scholar
Sullivan, G. M. & Artino, A. R. Analyzing and interpreting data from Likert-type scales. J. Grad. Med. Educ. 5, 541–542 (2013).
Article PubMed PubMed Central Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Stat. Methodol. 57, 289–300 (1995).
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the contribution of all PharmD students who participated in the learning project associated with this study. Their engagement and valuable input were instrumental in supporting the development and completion of this work. The authors received no funding for this work.

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Department of Clinical Pharmacy, Faculty of Pharmacy, Cairo University, Cairo, Egypt
Ahmed N. Farrag & Amani M. Ali
National Cancer Institute, Cairo University, Cairo, Egypt
Amany El-Zeiny

Authors

Ahmed N. Farrag
View author publications
Search author on:PubMed Google Scholar
Amany El-Zeiny
View author publications
Search author on:PubMed Google Scholar
Amani M. Ali
View author publications
Search author on:PubMed Google Scholar

Contributions

A.A. conceptualized the study. A.F. and A.A. designed the methodology. A.F. and A.A. conducted the data collection and formal analysis. Supervision and manuscript review and editing were provided by A.Z., A.A., and A.F. A.F. wrote the original draft and created the visualizations. All authors reviewed and edited the final manuscript.

Corresponding authors

Correspondence to Ahmed N. Farrag or Amani M. Ali.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Farrag, A.N., El-Zeiny, A. & Ali, A.M. Evaluating large language models for pharmacotherapy simulations: a mixed-methods study. npj Digit. Med. 9, 355 (2026). https://doi.org/10.1038/s41746-026-02626-1

Download citation

Received: 11 October 2025
Accepted: 01 April 2026
Published: 05 May 2026
Version of record: 05 May 2026
DOI: https://doi.org/10.1038/s41746-026-02626-1