Abstract
Simulation-based learning is essential in clinical pharmacy education but requires substantial faculty resources that limit scalability. Large language models (LLMs) offer promise for generating scalable simulations, yet their pedagogical rigor and clinical reliability remain unclear. In a mixed-methods, counterbalanced evaluation study, PharmD students (n = 104) engaged with acute myeloid leukemia (AML) or chronic myeloid leukemia (CML) cases, conditions requiring complex longitudinal management yet sharing semantic similarity, generated by four LLMs using expert-guided meta-prompts. Expert panels evaluated sessions across clinical authenticity, instructional design, and clinical reasoning; students completed satisfaction surveys. Of 103 sessions, 53 (51.5%) met passing criteria across all domains. Clinical accuracy and safety emerged as the limiting domain (58.3%) compared to clinical reasoning (81.6%) and instructional design (82.5%). CML sessions outperformed AML sessions (62.3% vs 40.0%; p = 0.031). Platform success rates ranged from 34.5% to 62.1%. Error analysis revealed guideline misalignment, pharmacotherapeutic inaccuracies, fabricated evidence, and cross-condition therapeutic recommendations occurring exclusively in AML sessions. Students favored LLMs over traditional methods (49.8% vs 30.0%); however, we did not detect statistically significant alignment between student satisfaction and expert-assessed quality. Sessions more frequently met criteria for instructional design and clinical reasoning than for pharmacotherapeutic accuracy and guideline alignment. Expert oversight with platform-specific and disease-specific validation remains essential for safe educational deployment, and effectiveness trials assessing objective learning outcomes represent necessary subsequent work.
Similar content being viewed by others
Introduction
Simulation-based learning plays an important role in clinical pharmacy education because it allows students to practice clinical reasoning and therapeutic decision-making in safe, controlled settings1,2,3. Well-designed simulations provide structured scenarios and timely feedback that support cognitive development and progressive skill acquisition4,5,6. However, traditional simulation methods are difficult to scale, requiring substantial faculty time, specialized expertise, and institutional resources that limit broad curricular integration7,8. Large language models (LLMs) offer a promising solution by generating sophisticated, interactive clinical simulations at scale9,10,11.
Despite their potential, LLMs’ clinical accuracy and educational validity remain uncertain in specialized therapeutic areas11,12. Most existing evaluations focus on short, discrete knowledge questions rather than extended clinical scenarios that require sustained reasoning, longitudinal decision-making, and contextual adaptation13,14,15. This gap is critical because LLMs rely on statistical learning processes that can introduce systematic errors, which may go undetected without rigorous domain-specific testing16.
One notable error pattern arises from how LLMs process semantically related information. Because these models learn statistical associations from training data, conditions sharing clinical features or terminology may be inappropriately conflated in generated content17,18. Recent work has shown that word embeddings can misattribute symptoms between distinct diseases when conditions share semantic similarities, with errors stemming from tangential associations rather than direct clinical relevance19,20. In medical contexts, such conflation can inadvertently merge management strategies for conditions requiring fundamentally different therapeutic approaches, a pattern we refer to as domain entanglement19. These errors are particularly concerning in educational settings, as they often appear coherent and authoritative, potentially reinforcing inaccurate or unsafe knowledge among learners who lack sufficient clinical experience to recognize the inaccuracies21,22.
To systematically assess these vulnerabilities, rigorous evaluation requires a strategic selection of test domains that combine authentic educational applications with conditions likely to expose systematic errors23,24. Hematologic malignancies offer this combination, involving complex, evidence-based treatment algorithms, frequent protocol updates, and challenging clinical decisions that students must master before practice25,26,27,28. Within this domain, acute and chronic myeloid leukemias provide a strategically designed stress test. These conditions share myeloid cell lineage and present with overlapping clinical and laboratory features, creating semantic similarity that may challenge LLMs’ ability to maintain appropriate therapeutic boundaries. However, their management approaches differ fundamentally: AML requires time-sensitive intensive chemotherapy with consolidation decisions guided by molecular features and remission status29, while CML requires chronic oral tyrosine kinase inhibitor therapy with ongoing molecular monitoring and specific criteria for adjusting treatment30.
This pairing enables direct evaluation of whether semantic overlap causes domain entanglement. If models inappropriately recommend CML-specific tyrosine kinase inhibitors for AML patients or apply AML induction regimens to CML cases, it would demonstrate a safety-critical failure mode. Conversely, successful boundary preservation despite semantic similarity would suggest that structured prompting can mitigate this vulnerability. Additionally, the complexity gradient between these conditions (CML following relatively linear therapeutic pathways while AML requires multi-variable conditional reasoning) enables assessment of whether therapeutic complexity independently affects LLM performance, with implications for predicting performance in other therapeutic areas.
In this study, we evaluated how well LLMs generate pharmacotherapy simulations requiring accurate reasoning, safe therapeutic recommendations, and sound instructional design. Our primary aims were to (1) characterize LLM performance across instructional design quality, clinical accuracy and safety, and clinical reasoning fidelity, and (2) compare performance between AML and CML to test whether semantic similarity challenges boundary preservation while assessing whether therapeutic complexity independently affects accuracy. Our secondary aims were to (1) compare performance across four major platforms to distinguish general model capabilities from platform-specific characteristics, and (2) examine whether student satisfaction aligns with expert-rated quality to inform oversight requirements for safe educational deployment.
Results
Session characteristics and inter-rater reliability
A total of 103 sessions were evaluated (one student did not complete the study), comprising 50 AML and 53 CML simulations distributed across four platforms: Gemini (n = 29, 28.2%), GPT-4o (n = 29, 28.2%), DeepSeek (n = 23, 22.3%), and Claude (n = 22, 21.4%). Inter-rater reliability was excellent, with an overall Krippendorff’s alpha of 0.83 (95% CI: 0.724–0.875), exceeding the prespecified threshold of 0.80. Pairwise agreement varied across rater pairs (Supplementary D), with near-perfect concordance between two reviewers (κ = 0.955) and moderate-to-substantial concordance for pairs involving the educator reviewer (κ = 0.633–0.656)31. Full scoring outputs are provided in Supplementary D, and complete session transcripts are available in Supplementary E.
Overall session success rate
Of 103 sessions evaluated, 53 (51.5%; 95% CI: 41.7–61.2%) met passing criteria across all three domains simultaneously (Fig. 1). Domain-specific success rates were 60/103 (58.3%) for clinical accuracy and safety, 84/103 (81.6%) for clinical reasoning fidelity, and 85/103 (82.5%) for instructional design quality, with clinical accuracy and safety emerging as the limiting domain.
Proportion of sessions meeting domain-specific pass/fail criteria: Clinical Accuracy & Safety and Clinical Reasoning Fidelity required all subdomains ≥4.0; Instructional Design Quality required all subdomains >3.0 with mean >4.0. A Overall success across the three domains. B Comparison across LLM platforms and overall success. C Comparison by disease type. Error bars = standard error.
Performance by disease type
CML sessions demonstrated higher overall success rates than AML sessions: 33/53 (62.3%) versus 20/50 (40.0%; OR (CML vs AML) = 2.50, 95% CI: 1.12–5.56; RR (CML/AML) = 1.56, 95% CI: 1.04–2.32; Cohen’s h = 0.449; p = 0.031) (Fig. 1). Domain-specific comparisons showed consistent trends favoring CML: clinical accuracy and safety CML 35/53 (66.0%) versus AML 25/50 (50.0%); clinical reasoning fidelity CML 47/53 (88.7%) versus AML 37/50 (74.0%); instructional design quality remained nearly equivalent at CML 44/53 (83.0%) versus AML 41/50 (82.0%).
At the subdomain level, the largest performance gaps occurred within clinical accuracy and safety (Fig. 2), guideline alignment: CML 40/53 (75.5%) versus AML 28/50 (56.0%), pharmacotherapeutic accuracy: CML 36/53 (67.9%) versus AML 26/50 (52.0%), domain specificity: CML 53/53 (100.0%) versus AML 46/50 (92.0%), with domain entanglement occurring exclusively in AML sessions (4/50, 8.0% vs 0/53, 0%). This relatively low event frequency limits robust mechanistic interpretation. Clinical reasoning fidelity and instructional design quality subdomains showed minimal disease-type variation.
A Overall success rates by domain for AML (top) and CML (bottom) cases. B Subdomain success rates by domain (rows) and disease type (columns: AML left, CML right). Radial axes apply a three-tier transformation to enhance separation at high-performance levels. Platforms are ranked by overall session success. AML acute myeloid leukemia, CML chronic myeloid leukemia.
Performance by platform
Platform-level overall success rates ranged from 34.5% to 62.1%: Gemini 2.0 Pro 18/29 (62.1%, 95% CI: 44.0–77.3%), Claude 3.7 Sonnet 13/22 (59.1%, 95% CI: 38.7–76.7%), DeepSeek V2 12/23 (52.2%, 95% CI: 33.0–70.8%), and GPT-4o 10/29 (34.5%, 95% CI: 19.9–52.7%) (Fig. 1). Chi-square analysis revealed no significant platform differences (p = 0.160, Cramér’s V = 0.224). Post hoc power analysis confirmed inadequate power for all platform comparisons (5.5–39.7%).
Domain-specific analysis revealed platform strengths and weaknesses (Fig. 1). For clinical accuracy and safety, performance spanned 39.4 percentage points: Claude 17/22 (77.3%), Gemini 20/29 (69.0%), DeepSeek 12/23 (52.2%), and GPT-4o 11/29 (37.9%). For instructional design quality: DeepSeek 22/23 (95.7%), Gemini 27/29 (93.1%), GPT-4o 25/29 (86.2%), and Claude 17/22 (77.3%). For clinical reasoning fidelity: DeepSeek 20/23 (87.0%), Claude 19/22 (86.4%), Gemini 24/29 (82.8%), and GPT-4o 21/29 (72.4%).
DeepSeek demonstrated marked disease-specific performance variation: CML sessions 11/13 (84.6%) versus AML sessions 1/10 (10.0%; OR (CML vs AML) = 50.0, 95% CI: 3.85–∞; RR (CML/AML) = 8.46; p < 0.001).
Subdomain-level performance patterns
Success rates varied across twelve subdomains (Fig. 2), ranging from 60.2% to 99.0%. The weakest performers were pharmacotherapeutic accuracy (62/103, 60.2%) and guideline alignment (68/103, 66.0%), both within clinical accuracy and safety. The strongest performers were problem identification (102/103, 99.0%), scaffolding quality (101/103, 98.1%), instructional framing (100/103, 97.1%), and clinical narrative plausibility (100/103, 97.1%).
Platform-specific patterns (Fig. 2) revealed GPT-4o with consistent weaknesses in pharmacotherapeutic accuracy (12/29, 41.4%) and guideline alignment (17/29, 58.6%). Claude achieved the highest clinical accuracy and safety performance (17/22, 77.3%) with balanced subdomain scores. DeepSeek demonstrated near-perfect instructional design quality (22/23, 95.7%) but moderate clinical accuracy and safety performance. Gemini showed consistent performance across all subdomain categories (Fig. 3).
A Overall success rates with standard error bars; pharmacotherapeutic accuracy and guideline alignment were the lowest-performing domains. B Platform-level heatmap showing variation in success rates, with text optimized for contrast. C Disease-type heatmap comparing CML and AML performance. Asterisks indicate significance: ***p < 0.001, **p < 0.01, *p < 0.05. AML acute myeloid leukemia, CML chronic myeloid leukemia.
Clinical error analysis
Error analysis revealed three prominent failure patterns (Table 1). Domain entanglement occurred exclusively in AML sessions (4/50, 8.0%), where therapies from related hematologic conditions were inappropriately applied—including blinatumomab (a B-ALL-specific agent) recommended for AML and differentiation syndrome incorrectly attributed to standard chemotherapy32. Fabricated evidence emerged in 9 sessions, presenting invented clinical trials with specific statistical outcomes (e.g., “MORPHO trial33,34 NEJM 2023” with false gilteritinib data) and mathematically impossible risk scoring formulas. The most frequent errors involved guideline misalignment (22 AML, 13 CML) and pharmacotherapeutic inaccuracies (24 AML, 17 CML), including concurrent allopurinol with rasburicase, premature treatment-free remission attempts, and inappropriate therapy escalations at warning response milestones.
Student satisfaction and preference-safety alignment
Student satisfaction data were obtained from 102 participants (one completed the simulation but not the survey) with excellent internal consistency (Cronbach’s α = 0.939; Supplementary D). Overall mean satisfaction score was 3.41 (SD = 1.44), significantly above the neutral midpoint of 3.0 (p < 0.001, Cohen’s d = 0.282) (Fig. 4).
A Overall 5-point Likert preferences (red = traditional, gray = neutral, green-blue = LLM). B Diverging bars for eight assessment dimensions (1–2 left, 4–5 right; neutral excluded). C Satisfaction by expert session quality (PASS vs FAIL; SEM; Mann–Whitney U, FDR-corrected). D Expert pass rates (green) vs student satisfaction (blue) across four LLM platforms. SE standard error, SEM standard error of the mean, FDR false discovery rate.
Preference distribution showed 406 responses (49.8%) favoring LLMs, 165 (20.2%) neutral, and 245 (30.0%) favoring traditional methods, differing significantly from uniform expectations (p < 0.001, Cohen’s w = 0.368).
Students significantly favored LLMs for ease of use (65/102, 63.7%, 95% CI: 54.1–72.4%, Cohen’s h = 0.278, p = 0.007) and time saving (63/102, 61.8%, 95% CI: 52.1–70.6%, Cohen’s h = 0.238, p = 0.022) but significantly favored traditional methods for clinical practice realism (38/102, 37.3%, 95% CI: 28.5–46.9%, Cohen’s h = −0.258, p = 0.013) (Fig. 4). No directional preferences emerged for other dimensions including diagnostic skills, clinical confidence, learning enjoyment, exam preparation, or future use intent. Mean satisfaction scores ranged across platforms with no significant platform differences (p = 0.442, ε² ≈ 0), with DeepSeek receiving the highest satisfaction (M = 3.68, SD = 1.14) and Claude the lowest (M = 3.11, SD = 1.03).
Stratified analyses did not detect statistically significant alignment between student satisfaction and expert-assessed content quality in this sample (Fig. 4), and these comparisons were not powered to exclude moderate associations. Effect sizes and confidence intervals for these stratified comparisons are provided in Supplementary D.
Students who experienced sessions meeting expert clinical accuracy and safety standards reported similar satisfaction on clinical practice realism compared to those who experienced failing sessions (3.15 vs 2.86, q = 0.626). For clinical confidence and diagnostic skills improvement, students who experienced sessions failing expert clinical reasoning fidelity criteria reported numerically higher satisfaction than those who experienced passing sessions (3.47 vs 3.11 and 3.68 vs 3.24, respectively), though differences were not statistically significant (q = 0.653 and q = 0.259). With only four platforms, rank correlation analysis is underpowered to yield interpretable results; we therefore present platform-level satisfaction and safety pass rates descriptively in Fig. 4. No confidence interval is reported for ρ = −0.80 given that bootstrap resampling is unreliable at n = 4 platforms. DeepSeek received the highest student preference (62.5%) but demonstrated 52.2% safety pass rate, while Claude achieved the highest safety rate (77.3%) but received only 52.4% student preference.
Discussion
We systematically evaluated four major LLMs in generating pharmacotherapy simulations for AML and CML across instructional design quality, clinical accuracy and safety, and clinical reasoning fidelity. Overall success reached 51.5%, with notable domain heterogeneity. Models demonstrated relatively strong performance in instructional design (82.5%) and clinical reasoning fidelity (81.6%), while clinical accuracy and safety emerged as the primary limitation (58.3%). Current models appear capable of structuring learning experiences and modeling reasoning processes but demonstrate challenges with precise clinical content generation, indicating a need for targeted improvements and rigorous oversight before high-stakes educational deployment.
The clinical accuracy and safety domain showed uneven performance across its four subdomains. Clinical narrative plausibility and domain specificity both exceeded 90% success, indicating models can generate coherent scenarios and often maintain disease-specific boundaries when guided by structured prompts. However, pharmacotherapeutic accuracy and guideline alignment showed substantially lower success. Models appear capable of constructing believable clinical narratives but demonstrate difficulties translating evidence-based guidelines into accurate therapeutic recommendations. High narrative plausibility may explain why students engaged positively with content despite its pharmacotherapeutic inaccuracies, which highlights the need for expert review before educational deployment. In contrast, performance in instructional design quality and clinical reasoning fidelity remained relatively consistent across platforms and disease types. Scaffolding quality, instructional framing, learning objective clarity, problem identification, and knowledge integration all achieved relatively strong success with minimal variation. This consistency may indicate that pedagogical structure generation and reasoning process modeling represent more stable capabilities, potentially less sensitive to clinical complexity than content accuracy. The contrast between relatively stable pedagogical performance and variable clinical accuracy reinforces that content precision represents a primary technical challenge.
The disease-specific performance differences we observed provide insight into factors that influence model capabilities and suggest how findings may generalize to other conditions. Acute and chronic myeloid leukemia share myeloid cell lineage with overlapping clinical features, creating semantic similarity that prior literature suggested might challenge boundary preservation in language models16,19,35,36. Their management approaches differ fundamentally. CML follows relatively linear therapeutic pathways with tyrosine kinase inhibitor selection based on comorbidities and molecular monitoring35, while AML requires complex decision trees dependent on molecular subtype, induction response, and consolidation eligibility36.
This pairing allowed us to test whether semantic overlap causes domain entanglement and whether therapeutic complexity affects performance. Our findings are consistent with both concerns; however, the low absolute frequency of domain entanglement events limits mechanistic inference, and we treat entanglement as a hypothesis-generating failure mode warranting targeted experimental testing (e.g., prompt ablation studies and retrieval-augmented generation comparisons). Domain entanglement occurred exclusively in AML sessions, where models recommended blinatumomab, a B-cell acute lymphoblastic leukemia-specific agent, and incorrectly attributed differentiation syndrome to standard chemotherapy, demonstrating that semantic similarity may override explicit prompt constraints32. CML sessions succeeded more often, particularly in guideline alignment and pharmacotherapeutic accuracy.
These findings generate hypotheses for other domains. Conditions requiring multi-variable conditional reasoning may pose a higher risk for clinically important errors than conditions governed by more linear algorithms; however, generalizability beyond AML/CML requires direct empirical testing across additional disease states. Conceptually, conditions requiring complex multi-variable conditional reasoning (e.g., sepsis management, advanced heart failure therapy, or complex anticoagulation strategies) may present greater susceptibility to prompt override due to their non-linear decision pathways. In contrast, semantically related conditions governed by relatively structured or linear therapeutic algorithms (e.g., Crohn’s disease vs ulcerative colitis, type 1 vs type 2 diabetes, bacterial vs viral meningitis) may theoretically demonstrate higher discrimination accuracy. Establishing cross-domain robustness would require dedicated studies across additional disease pairs.
Platform patterns revealed meaningful heterogeneity. Gemini achieved the highest overall success with consistent cross-disease performance. Claude demonstrated the highest clinical accuracy with balanced disease-type performance, which may reflect conservative, guideline-adherent generation. DeepSeek demonstrated substantial disease-dependent variation, succeeding far more often with CML than AML, potentially reflecting training data imbalances or differential sensitivity to complexity. GPT-4o demonstrated lower clinical accuracy despite adequate instructional design. This variation is consistent with prior findings of domain-dependent LLM performance, indicating the need for disease-specific validation before implementation37,38,39.
Error analysis revealed patterns that may inform oversight and development priorities. Guideline misalignment reflected apparent temporal bias toward older training data, which may suggest models revert to statistically dominant training patterns when generating extended content40. Fabricated evidence manifested as confident citations of non-existent trials, which may indicate architectures lack robust mechanisms to verify claims or signal uncertainty. Pharmacotherapeutic inaccuracies included clinically concerning recommendations such as concurrent allopurinol-rasburicase administration, potentially reflecting failures in multi-step conditional reasoning39. Domain entanglement occurred despite explicit negative constraints, suggesting possible architectural limitations in maintaining categorical boundaries. These patterns indicate that while prompt engineering can elicit pedagogical structure, clinical safety requires model-level improvements, including citation verification, uncertainty quantification, temporal weighting, and enhanced boundary preservation.
Our meta-prompt framework showed promise in maintaining domain specificity and enabled systematic evaluation by standardizing content generation across platforms and disease types41. The framework incorporated five boundary-preservation mechanisms, including disease-specific guideline anchoring, negative constraints prohibiting cross-referencing related conditions, and consistency verification requirements. High domain specificity success (95.1%) with relatively limited entanglement (3.9% of sessions) suggests these mechanisms helped maintain therapeutic boundaries in most cases. However, the persistence of entanglement errors in complex acute myeloid leukemia cases indicates that prompt engineering alone may be insufficient to overcome architectural limitations when semantic similarity is high. Beyond enabling systematic evaluation, the template-based design allows educators to adapt the framework through variable substitution while preserving safety constraints, thereby democratizing LLM-based simulation development and offering a potential approach for research on optimal prompt architecture in medical education.
The comprehensive expert validation conducted in this study may initially appear to contradict claims of scalability for LLM-generated simulations. However, in safety-critical clinical education, the appropriate goal is augmentation of expert judgment rather than its replacement. Inaccurate pharmacotherapy recommendations embedded in training simulations risk propagating incorrect clinical reasoning patterns to learners, creating downstream patient-safety concerns. Our findings empirically support this concern: 48.5% of unmoderated sessions failed to meet minimum expert safety and accuracy criteria, yet students experiencing these sessions reported satisfaction levels statistically indistinguishable from those experiencing passing sessions, suggesting learners may not reliably detect content quality deficiencies without expert guidance. Beyond expert evaluation, established frameworks recommend incorporating learner perspectives when assessing educational technologies42. We included student satisfaction assessment to capture authentic end-user experience, allowing natural interaction without experimental constraints to identify real-world engagement patterns and implementation challenges43,44. The stratified analysis examined whether student satisfaction on specific dimensions aligned with expert evaluation of corresponding session quality. We interpret satisfaction as feasibility evidence (acceptability and usability) rather than evidence of educational effectiveness, which requires objective learning outcome measures in subsequent studies.
Stratified analyses did not detect statistically significant alignment between student satisfaction and expert-assessed content quality in this sample. Students experiencing sessions meeting expert clinical accuracy and safety standards reported similar clinical practice realism satisfaction as those experiencing failing sessions. For clinical confidence and diagnostic skills improvement, students experiencing sessions failing expert clinical reasoning fidelity criteria reported numerically higher satisfaction than those experiencing passing sessions, though differences were not statistically significant. Although these comparisons were underpowered to exclude moderate effects, the observed dissociation—positive learner experience even in sessions with expert-identified safety or guideline violations—supports the governance interpretation that student satisfaction alone is insufficient for validating clinical content quality, consistent with cognitive theory on fluency-driven illusions of competence45,46,47,48,49,50,51.
Platform-level patterns reinforced this disconnect. Gemini achieved the highest expert pass rate but moderate student satisfaction, while GPT-4o demonstrated the lowest expert pass rate with similar student satisfaction52,53. Combined with stratified analysis showing a lack of alignment between individual satisfaction and session quality, this may suggest that students base preferences on factors unrelated to clinical accuracy, such as conversational style or response elaboration, rather than expert-identified educational value or safety. Students significantly favored models for ease of use but significantly favored traditional methods for clinical practice realism. The ease of use advantage likely reflects immediate accessibility and conversational interaction, while preference for traditional methods regarding clinical realism suggests students recognized that model-generated scenarios might lack the authenticity of faculty-developed cases. This dimensional variation indicates students may discern certain experiential aspects but appear unable to evaluate clinical accuracy.
This potential inability to discriminate content quality has implications for unsupervised deployment. Students may find content appealing based on polished presentation, confident tone, and engaging narratives while potentially unable to identify underlying clinical inaccuracies, fabricated evidence, or guideline violations. High narrative plausibility combined with low pharmacotherapeutic accuracy creates conditions where learners may encounter content appearing authentic while containing clinically inappropriate recommendations. Advanced pharmacy students, despite substantial clinical knowledge, showed no significant ability to identify quality differences that experts readily detected, which may suggest clinical background alone does not protect learners and that structured oversight may be necessary even for experienced students54.
For safe implementation and future development, educators and developers face different but complementary priorities. Developers need to improve models by integrating real-time guideline updates with temporal weighting to reduce reliance on outdated patterns, implementing citation verification to prevent fabricated evidence, strengthening boundary-preservation mechanisms to limit domain entanglement, and enhancing logical consistency checks for multi-step therapeutic reasoning55,56. Educators must recognize current limitations and ensure that all LLM-generated content undergoes expert review, with particular attention to pharmacotherapeutic recommendations and guideline adherence in complex therapeutic areas.
The patterns of errors observed suggest several directions for research and development. Validation of the meta-prompt framework across diverse therapeutic domains is necessary to determine whether it can generalize beyond hematologic malignancies. Comparative studies of retrieval-augmented generation with real-time guideline access may reveal whether architectural modifications can reduce gaps in accuracy. Development of automated error detection targeting guideline violations, fabricated citations, domain entanglement, and reasoning failures could support scalable quality assurance. Multi-institutional studies with larger samples would help clarify whether observed platform differences and complexity-dependent performance patterns are robust, providing guidance for both model refinement and educational deployment. Future work should also include error clustering stratified by AML molecular subtype/risk category (e.g., FLT3/NPM1/IDH-driven decision branches) to determine whether failures concentrate in specific therapeutic pathways.
We applied non-compensatory thresholds for clinical accuracy and safety and clinical reasoning fidelity, and compensatory scoring for instructional design quality. This approach is consequence-based: in clinically oriented domains, compensatory averaging can mask safety-critical deficiencies, a concern reflected in patient safety–oriented standard setting and mastery learning frameworks that treat critical actions as non-compensatory requirements57,58. In contrast, instructional design elements may compensate for one another within a minimum-quality floor without introducing direct patient-safety risk5,59. These thresholds should be interpreted as conservative deployment criteria for this Phase 1 validation stage rather than as a claim that all clinical errors have equivalent severity. Accordingly, we report a structured error taxonomy (Table 1), distinguishing failure modes with potentially different consequences. Future work should develop severity-weighted scoring and explicit “never-event” catalogs through formal expert consensus (e.g., modified Delphi) and psychometric calibration (e.g., Rasch modeling or item response theory), while retaining non-compensatory gating for errors judged unacceptable for learner exposure60.
This study has several limitations. Sample sizes per platform were sufficient for overall characterization but offered limited power to detect small differences, making platform comparisons exploratory. Single-institution implementation and evaluation limited to two hematologic malignancies constrain external generalizability. Although differences in therapeutic complexity may conceptually influence model performance—particularly in conditions requiring multi-variable conditional reasoning—such extrapolation remains hypothetical and requires empirical validation across additional disease domains. While our design ensures a focused evaluation of each model, it precludes direct learner-based comparative analysis. Expert reviewers were not blinded to platform identity, but structured rubrics and high inter-rater reliability may have reduced bias. Conservative pass-fail thresholds prioritized safety but require further empirical evaluation. The evaluation framework and meta-prompt require formal validation, and stratified analyses had limited power, particularly for clinical reasoning domains, which may have contributed to non-significant findings despite meaningful numerical trends.
Moreover, the review intensity used in this study reflects research-grade benchmarking intended to characterize failure modes and support early validity evidence for the rubric. In this dataset, 48.5% of sessions failed to meet minimum criteria for clinical accuracy and safety, while student satisfaction ratings were statistically similar between sessions that passed and failed expert criteria. Scalability should be considered relative to the appropriate comparator workflow: LLM-assisted draft generation with structured expert verification versus traditional simulation development workflows. The study did not quantify time or cost differences. Drawing on patient-safety approaches to standard setting and critical-action gating, severity-stratified medication safety systems, and implementation governance considerations, we propose a risk-stratified quality assurance approach in which review intensity is calibrated to clinical consequence50,58,60,61,62,63. Operational deployment would likely require risk-stratified review.
Additionally, because model behavior may change over time in proprietary web interfaces, replication using the same prompts and rubric at future time points is required to assess the temporal stability of observed failure modes. Unfortunately, this type of version drift is now typical for continuously deployed proprietary LLMs across both web and API access routes, so transparent reporting and longitudinal re-testing are currently the most practical mitigation strategies when version-pinned snapshots are unavailable64. Future work should also consider benchmarking smaller locally deployable models (or institution-hosted open-weight models) that can be maintained as frozen snapshots, enabling stronger reproducibility and implementation governance than continuously updated proprietary endpoints64.
Additionally, a key limitation is the expert panel composition (three co-authors from one institution involved in rubric and meta-prompt development), which may introduce confirmation bias. The reported reliability reflects internal scoring consistency rather than independent external validation; stronger support for independence requires replication with external and/or blinded raters. Phase 2 work should incorporate independent external clinical experts, ideally blinded to platform identity and study hypotheses, to strengthen extrapolation and generalizability inferences65,66. Also, we did not perform prompt ablation experiments; therefore, we cannot attribute observed performance to any specific safeguard mechanism within the meta-prompt. DeepSeek’s marked disease-dependent variation suggests platform-specific sensitivity to therapeutic complexity or training data distribution; explaining these effects would require model-level access not available in web-interface evaluations.
Finally, because each student used only one platform, within-subject platform comparisons were not feasible in this Phase 1 content validation study, which prioritized independent expert evaluation of diverse platform outputs over learner-centered comparative usability testing. Future studies should evaluate multiple platforms per learner using counterbalanced designs with temporal spacing, disease context rotation, and burden management to strengthen evidence about individual preference patterns while controlling for carryover learning effects and fatigue. Additionally, evaluating LLM simulations across multiple professional years may clarify prerequisite knowledge thresholds and the optimal curricular timing for LLM-assisted simulation within vertically and horizontally integrated curricula.
In conclusion, using controlled meta-prompting, sessions more frequently met criteria for instructional design and clinical reasoning than for pharmacotherapeutic accuracy and guideline alignment, with performance varying by platform and disease context. Expert oversight with platform-specific and disease-specific validation remains essential for safe educational deployment, and effectiveness trials assessing objective learning outcomes represent necessary subsequent work.
Methods
Study design
This was a phase 1 mixed-methods evaluation study evaluating educational materials generated during routine curricular activities from March 15 to April 10, 2025, at a single institution67. The study aimed to characterize LLM performance in generating pharmacotherapy simulations and to assess student perceptions of this learning modality, as outlined in the five-phase evaluation framework (Fig. 5). All clinical evaluations were benchmarked against 2022 National Comprehensive Cancer Network/European LeukemiaNet (NCCN/ELN)36 guidelines for AML and 2017 European Society for Medical Oncology/European LeukemiaNet (ESMO/ELN) guidelines for CML35, which served as the reference standards for assessing clinical accuracy and guideline adherence. This study was approved by the Research Ethics Committee of the Faculty of Pharmacy, Cairo University (Approval ID: CL 3931). All participants provided informed consent.
Five-phase systematic assessment of LLM-generated AML/CML simulations: (1–2) Expert panel established integrated pedagogical and clinical guidelines. (3) Guidelines encoded into a standardized meta-prompt with AI role, case structure, interaction, and assessment modules. (4) Deployment across four LLM platforms (ChatGPT, Claude, DeepSeek, Gemini). (5) Evaluation using HSSOBP, SDS, CRER, and SET-M. LLM large language model, AML acute myeloid leukemia, CML chronic myeloid leukemia, HSSOBP Healthcare Simulation Standards of Best Practice, SDS simulation design scale, CRER clinical reasoning evaluation rubric, SET-M simulation effectiveness tool–modified.
Framework development and meta-prompt engineering
LLMs demonstrate substantial sensitivity to prompt design, with well-structured prompts shown to reduce hallucination and improve performance16. However, effective prompt engineering typically requires specialized technical expertise that may limit accessibility for educators. To address this barrier while maintaining rigor, we developed a systematic meta-prompt framework that bridges clinical domain expertise with prompt engineering principles41. This approach enables educators to generate high-quality pharmacotherapy simulations through structured variable substitution rather than de novo prompt development, providing a transparent and replicable method adaptable across educational settings.
Development proceeded in three stages. In the first stage, a multidisciplinary team consisting of a clinical pharmacy educator (AA), board-certified oncology pharmacist (AZ), and clinical pharmacy researcher (AF) defined foundational principles for simulation design informed by scaffolding theory and cognitive load theory (Supplementary A)4,5,6,68,69,70. In the second stage, these principles were translated into a universal meta-prompt template with variable placeholders for leukemia type, specialty, and guideline source (Supplementary B). The template defined the model’s role as clinical pharmacy professor, structured learner engagement through scaffolded progression aligned with Bloom’s Taxonomy41, and required strict guideline adherence across diagnostic, therapeutic, and follow-up scenarios. Five safeguard mechanisms were embedded to minimize domain entanglement and other clinical errors: (1) disease-specific guideline anchoring, (2) negative constraints prohibiting reference to related conditions, (3) boundary reinforcement directing limitation statements when uncertain, (4) citation requirements linking recommendations to specific guideline sections, and (5) consistency verification before each recommendation.
In the third stage, final prompts were generated using Google AI Studio (Gemini 2.0 Pro), selected for its extended context window, stable output, and academic licensing (Supplementary B)71,72. Subject-matter experts reviewed and standardized these prompts to remove platform-specific phrasing before deployment across all four platforms.
Participants and randomization
We recruited 104 fourth-year Doctor of Pharmacy students after obtaining informed consent. Participants were initially randomized to one of two leukemia contexts (AML or CML) for their LLM-based simulation. To minimize sequence effects and ensure balanced academic preparedness across study arms, students were further randomized in a counterbalanced manner to receive either the LLM simulation first, followed by a traditional case-based session, or vice versa. Within each leukemia context, students assigned to the LLM condition were randomized to one of four platforms (ChatGPT GPT-4o73, Claude 3.7 Sonnet74, Gemini 2.0 Pro71, or DeepSeek V275), yielding approximately 13 participants per platform–disease combination. Each student completed both leukemia contexts overall (one via LLM simulation and one via traditional case study); however, only one LLM transcript per student was included in the final analysis to minimize carryover learning effects.
The study was conducted within a traditional (discipline-based) curriculum where clinical therapeutics is taught as the capstone of a longitudinal science sequence. To ensure adequate academic preparedness, all participating fourth-year PharmD students had completed a prerequisite series of at least two pharmacology courses. This sequential scaffolding provides vertical integration, bridging foundational drug science with the advanced clinical reasoning required for complex hematologic oncology management.
Deployment procedures
Students received 30-min training covering study protocol, procedures for initiating new sessions without prior conversation history to avoid context carryover, and documentation instructions. They deployed provided prompts on personal devices using publicly accessible web interfaces. ChatGPT GPT-4o was accessed via chat.openai.com76, Claude 3.7 Sonnet via claude.ai77, Gemini 2.0 Pro Experimental 0125 via aistudio.google.com72, and DeepSeek V2 via chat.deepseek.com78. All platforms used default web interface settings, including standard temperature parameters, no retrieval augmentation, and no custom instructions.
Students interacted naturally with platforms without experimental constraints, prioritizing external validity to enable identification of authentic implementation failure modes43,44. Participants exported sessions using native platform functions while DeepSeek users saved full webpages. All sessions were copy-pasted into structured forms with timestamped screenshots and session metadata to ensure complete documentation79.
Traditional case-based sessions mirrored the clinical complexity and modular design of large language model simulations. Students received unique patient cases followed by self-learning questions and complete guideline documents35,36. They submitted answers with highlighted screenshots of cited references documenting evidence-based reasoning. Cases covered diverse AML and CML contexts, including various patient demographics, disease subtypes, and molecular profiles. Essay-style questions guided students to interpret diagnostics, apply guideline reasoning, and propose management strategies. After expert review of all submitted materials, a reconciliation session was conducted with participants to identify and correct any clinical inaccuracies, preventing the propagation of misinformation. Acute myeloid leukemia (AML) and chronic myeloid leukemia (CML) were selected as a focused proof-of-concept pair due to their shared hematologic classification yet distinct therapeutic algorithms. This pairing was intended to examine potential domain entanglement within closely related malignancies rather than to establish generalizable conclusions across disease categories.
For transparency despite web-interface version drift, we report platform access routes, default interface settings, and full prompt syntax (Supplementary B), consistent with MI-CLEAR-LLM reporting guidance80. While backend updates cannot be controlled in web interfaces, this study provides benchmarking under ecologically valid access conditions and enables future longitudinal replication using identical prompts and the same rubric.
Performance assessment framework
We adapted four established instruments for LLM evaluation contexts: the Healthcare Simulation Standards of Best Practice (HSSOBP)81 for simulation design quality, the Simulation Design Scale (SDS)82 for design element assessment, the Clinical Reasoning Evaluation Rubric (CRER)83 for reasoning process evaluation, and the Simulation Effectiveness Tool–Modified (SET-M)84 for learner-reported effectiveness. The expert panel reviewed all items for face and content validity, calculating Content Validity Indices using established methods85. Items achieving item-level Content Validity Index (CVI) ≥ 0.78 were retained, while those below were revised or excluded. Final instruments achieved scale-level CVI values ≥ 0.90, indicating excellent content validity. The adaptation process addressed fundamental challenges in evaluating LLM-generated content, including the absence of real-time facilitation and the need to assess algorithmic rather than human reasoning processes; specific adaptations for each instrument are detailed in Table 2.
The adapted instruments created a three-domain framework encompassing twelve subdomains: instructional design quality (learning objective clarity, instructional framing, scaffolding quality, embedded feedback quality); clinical accuracy and safety (clinical narrative plausibility, guideline alignment, pharmacotherapeutic accuracy, domain specificity); and clinical reasoning fidelity (clinical prioritization, knowledge integration, evidence-based decision-making, outcomes modeling).
Student satisfaction was assessed using eight 5-point Likert items adapted from validated instruments84. Students completed satisfaction questionnaires immediately after both traditional and LLM-based learning modalities, with comparative items addressing clinical confidence building, diagnostic skills improvement, time efficiency, and future use intentions. Internal consistency was evaluated using Cronbach’s alpha86. Complete evaluation rubrics, validation indices, and error taxonomy are provided in Supplementary C.
Expert evaluation procedures
The three-member expert panel conducted independent evaluations of each session transcript using the adapted instruments described in the “Performance assessment framework” section. Each subdomain was rated on a 1-to-5 scale. Inter-rater reliability was assessed using Krippendorff’s alpha for ordinal ratings, with α ≥ 0.80 interpreted as strong agreement. Krippendorff’s α provides evidence of rating consistency but does not establish independence from shared systematic bias; stronger support for independence requires replication using external and/or blinded raters31. We therefore frame this work as Phase 1 content validation, consistent with staged approaches in instrument development in health professions education, where internal expert panels commonly define content boundaries and stabilize scoring prior to external replication87,88,89,90. Subdomain scores were calculated as the mean of independent ratings across the three reviewers. To minimize potential bias from non-blinding to platform identity, the rubrics emphasized objective, verifiable clinical criteria.
Given the absence of established standards for evaluating AI-generated medical education simulations, we developed domain-specific pass-fail criteria informed by our rubric’s scoring definitions and simulation assessment literature. We employed a categorical approach requiring each subdomain to meet minimum performance thresholds rather than averaging scores across subdomains, thereby preventing high performance in one area from masking critical deficiencies in another.
We applied domain-specific pass-fail criteria. Clinical Accuracy and Safety required all four subdomains to achieve mean scores ≥4.0 with no exceptions permitted (non-compensatory scoring), consistent with the Healthcare Simulation Standards of Best Practice (HSSBP)81. Clinical reasoning fidelity similarly required all four subdomains ≥4.0 without compensation, reflecting the expectation that simulations model expert-level reasoning processes91,92. In contrast, instructional design quality used compensatory scoring based on Chen et al.’s59 (all subdomains >3.0 with overall domain mean >4.0), allowing strengths in some instructional elements to offset modest weaknesses in others while maintaining a minimum-quality floor59,93. Sessions were classified as successful only when all three domains met criteria simultaneously94. We report a structured error taxonomy (Table 1) distinguishing among guideline misalignment, pharmacotherapeutic inaccuracies, fabricated evidence citations, and domain entanglement.
Statistical analysis
Our sample size was 80% powered to detect at least a 30-percentage-point difference in LLM performance between leukemia types at a two-sided α = 0.0595. The primary endpoint was the overall success rate, defined as the proportion of sessions meeting passing criteria across all three domains, reported with exact binomial 95% confidence intervals95. Domain-specific success rates were estimated overall and stratified by platform and leukemia type using Wilson score confidence intervals. Error frequencies were quantified by platform and disease type to identify systematic failure patterns.
Comparisons between AML and CML used Fisher’s exact tests with risk ratios and 95% confidence intervals. Chi-square or Fisher’s exact test was applied according to standard guidelines, with Fisher’s exact test selected when expected cell counts were less than five to ensure valid inference in small-sample categorical comparisons.
Platform-level differences in overall success rates were assessed using chi-square tests, and within-platform disease differences were evaluated using Fisher’s exact tests.
Student satisfaction was analyzed by comparing mean scores on each of the eight dimensions against the neutral midpoint (μ = 3.0) using one-sample t-tests, consistent with established practices for Likert scale analysis in educational research96,97. In addition, we examined whether student satisfaction varied according to expert-assessed content quality. For three conceptually related domain-dimension pairs (clinical accuracy and safety with clinical practice realism; clinical reasoning fidelity with both clinical confidence and diagnostic skills improvement), we calculated mean student dimension scores stratified by whether their corresponding sessions passed or failed expert evaluation criteria.
Statistical significance was defined as α = 0.05. Multiple comparisons were corrected using the Benjamini–Hochberg procedure98. All analyses were conducted in Python 3.12 (SciPy 1.11, Statsmodels 0.14).
Data availability
All datasets generated and/or analyzed during the current study are available in the supplementary material.
Code availability
Not applicable.
References
Gharib, A. M., Bindoff, I. K., Peterson, G. M. & Salahudeen, M. S. Computer-based simulators in pharmacy practice education: a systematic narrative review. Pharmacy 11, 8 (2023).
Korayem, G. B. et al. Simulation-based education implementation in pharmacy curriculum: a review of the current status. Adv. Med. Educ. Pract. 13, 649–660 (2022).
Seybert, A. L. et al. Evidence for simulation in pharmacy education. J. Am. Coll. Clin. Pharm. 2, 686–692 (2019).
Reiser, B. J. Scaffolding complex learning: the mechanisms of structuring and problematizing student work. J. Learn. Sci. 13, 273–304 (2004).
Van Merriënboer, J. J. G. & Sweller, J. Cognitive load theory and complex learning: recent developments and future directions. Educ. Psychol. Rev. 17, 147–177 (2005).
Mesquita, A. R. et al. Developing communication skills in pharmacy: a systematic review of the use of simulated patient methods. Patient Educ. Couns. 78, 143–148 (2010).
Lin, K., Travlos, D. V., Wadelin, J. W. & Vlasses, P. H. Simulation and introductory pharmacy practice experiences. Am. J. Pharm. Educ. 75, 1–9 (2011).
Vyas, D., Bray, B. S. & Wilson, M. N. Use of simulation-based teaching methodologies in US colleges and schools of pharmacy. Am. J. Pharm. Educ. 77, 53 (2013).
Cook, D. A. Creating virtual patients using large language models: scalable, global, and low cost. Med. Teach. 47, 40–42 (2025).
Cook, D. A. et al. Virtual patients using large language models: scalable, contextualized simulation of clinician-patient dialogue with feedback. J. Med. Internet Res. 27, e68486 (2025).
Brügge, E. et al. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC Med. Educ. 24, 1391 (2024).
Mehan, N., Desinghe, T. D. & Saha, A. Development and evaluation of large-language models (LLMs) for oncology: a scoping review. PLoS Digit. Health. https://doi.org/10.1371/journal.pdig.0000980 (2025).
Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit. Health 2, e0000198 (2023).
Safranek, C. W., Sidamon-Eristoff, A. E., Gilson, A. & Chartash, D. The role of large language models in medical education: applications and implications. JMIR Med. Educ. 9, e50945 (2023).
Sharma, S., Mittal, P., Kumar, M. & Bhardwaj, V. The role of large language models in personalized learning: a systematic review of educational impact. Discov. Sustain. 6, 1–24 (2025).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Alkalbani, A. M. et al. A systematic review of large language models in medical specialties: applications, challenges and future directions. Information 16, 489 (2025).
Abramski, K., Improta, R., Rossetti, G. & Stella, M. The “LLM world of words” English free association norms generated by large language models. Sci. Data 12, 803 (2025).
Yazdani, S., Henry, R. C., Byrne, A. & Henry, I. C. Utility of word embeddings from large language models in medical diagnosis. J. Am. Med. Inform. Assoc. 32, 526–534 (2025).
Sauder, M. et al. Exploring generative artificial intelligence-assisted medical education: assessing case-based learning for medical students. Cureus 16, e51961 (2024).
Patil, N. G., Kou, N. L., Baptista-Hon, D. T. & Monteiro, O. Artificial intelligence in medical education: a practical guide for educators. MedComm Futur. Med. 4, e70018 (2025).
Shaw, K., Henning, M. A. & Webster, C. S. Artificial intelligence in medical education: a scoping review of the evidence for efficacy and future directions. Med. Sci. Educ. 35, 1803–1816 (2025).
Ahsan, Z. Integrating artificial intelligence into medical education: a narrative systematic review of current applications, challenges, and future directions. BMC Med. Educ. 25, 1187 (2025).
Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, E2343689 (2023).
Shimony, S., Stahl, M. & Stone, R. M. Acute myeloid leukemia: 2025 update on diagnosis, risk-stratification, and management. Am. J. Hematol. 100, 860–891 (2025).
Jabbour, E. & Kantarjian, H. Chronic myeloid leukemia: 2025 update on diagnosis, therapy, and monitoring. Am. J. Hematol. 99, 2191–2212 (2024).
DiNardo, C. D. & Wei, A. H. How I treat acute myeloid leukemia in the era of new drugs. Blood 135, 85–96 (2020).
Döhner, H. et al. Diagnosis and management of AML in adults: 2017 ELN recommendations from an international expert panel. Blood 129, 424–447 (2017).
Hochhaus, A. et al. European LeukemiaNet 2020 recommendations for treating chronic myeloid leukemia. Leukemia 34, 966–984 (2020).
Krippendorff, K. Content Analysis: An Introduction to Its Methodology. https://doi.org/10.4135/9781071878781 (Sage, 2019).
FDA. FDA Approves Blinatumomab as Consolidation for CD19-positive Philadelphia Chromosome-negative B-cell Precursor Acute Lymphoblastic Leukemia (FDA, 2024).
Astellas Pharma Global Development, Inc. NCT02997202. A trial of the FMS-like tyrosine kinase 3 (FLT3) inhibitor gilteritinib administered as maintenance therapy following allogeneic transplant for patients with FLT3/internal tandem duplication (ITD) acute myeloid leukemia (AML). Astellas Pharma Global Development, Inc. https://clinicaltrials.gov/show/NCT02997202 (2016).
Levis, M. J. et al. Gilteritinib as post-transplant maintenance for AML with internal tandem duplication mutation of FLT3. J. Clin. Oncol. 42, 1766–1775 (2024).
Hochhaus, A. et al. Chronic myeloid leukaemia: ESMO clinical practice guidelines for diagnosis, treatment and follow-up. Ann. Oncol. 28, iv41–iv51 (2017).
Döhner, H. et al. Diagnosis and management of AML in adults: 2022 recommendations from an international expert panel on behalf of the ELN. Blood 140, 1345–1377 (2022).
Mavrych, V., Yousef, E. M., Yaqinuddin, A. & Bolgova, O. Large language models in medical education: a comparative cross-platform evaluation in answering histological questions. Med. Educ. Online 30, 2534065 (2025).
Dinc, M. T., Bardak, A. E., Bahar, F. & Noronha, C. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA Open 8, ooaf055 (2025).
Wilhelm, T. I., Roos, J. & Kaczmarczyk, R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J. Med. Internet Res. 25, e49324 (2023).
Zhu, C. et al. Is your LLM outdated? A deep look at temporal generalization. In Proc. 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, 7433–7457. https://doi.org/10.18653/v1/2025.naacl-long.381 (Association for Computational Linguistics (ACL), 2025).
Suzgun, M. & Kalai, A. T. Meta-prompting: enhancing language models with task-agnostic scaffolding. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.12954 (2024).
Lai, J. W. M., De Nobile, J., Bower, M. & Breyer, Y. Comprehensive evaluation of the use of technology in education—validation with a cohort of global open online learners. Educ. Inf. Technol. 27, 9877–9911 (2022).
Lai, J. W. M. & Bower, M. How is the use of technology in education evaluated? A systematic review. Comput. Educ. 133, 27–42 (2019).
Sun, Y. & Liu, F. Real-world implementation of an AI learning tool-MetaGP-Edu in medical education: a multi-center cohort study. Comput. Educ. 237, 105388 (2025).
Dunlosky, J. et al. Improving students’ learning with effective learning techniques: promising directions from cognitive and educational psychology. Psychol. Sci. Public Interest Suppl. 14, 4–58 (2013).
Finn, B. & Metcalfe, J. Judgments of learning are influenced by memory for past test. J. Mem. Lang. 58, 19–34 (2008).
Sanchez, C. A. & Wiley, J. An examination of the seductive details effect in terms of working memory capacity. Mem. Cogn. 34, 344–355 (2006).
Rey, G. D. A review of research and a meta-analysis of the seductive detail effect. Educ. Res. Rev. 7, 216–237 (2012).
Bjork, E. L. & Bjork, R. A. Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning. Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, 55–64 (2009).
Koriat, A. & Bjork, R. A. Illusions of competence in monitoring one’s knowledge during study. J. Exp. Psychol. Learn. Mem. Cogn. 31, 187–194 (2005).
Harp, S. F. & Mayer, R. E. How seductive details do their damage: a theory of cognitive interest in science learning. J. Educ. Psychol. 90, 414–434 (1998).
Almulla, M. A. Investigating influencing factors of learning satisfaction in AI ChatGPT for research: university students perspective. Heliyon 10, e32220 (2024).
Suchanek, P. & Kralova, M. Generative artificial intelligence expectations and experiences in management education: ChatGPT use and student satisfaction. J. Innov. Knowl. 10, 100781 (2025).
Elsayed, H. The impact of hallucinated information in large language models on student learning outcomes: a critical examination of misinformation risks in AI-assisted education. Northern reviews on algorithmic research. Theor. Comput. Complex 9, 11–23 (2024).
Ahn, S. A guide to evade hallucinations and maintain reliability when using large language models for medical research: a narrative review. Ann. Pediatr. Endocrinol. Metab. 30, 115–118 (2025).
Moëll, B. & Sand Aronsson, F. Harm reduction strategies for thoughtful use of large language models in the medical domain: perspectives for patients and clinicians. J. Med. Internet Res. 27, e75849 (2025).
McGaghie, W. C. et al. Does simulation-based medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence. Acad. Med. 86, 706–711 (2011).
Yudkowsky, R. et al. A patient safety approach to setting pass/fail standards for basic procedural skills checklists. Simul. Healthc. 9, 277–282 (2014).
Chen, Y. et al. The effect of high-fidelity simulation in medical nursing based on the healthcare simulation standards of best practice. Nurs. Commun. 8, e2024020 (2024).
The Agency for Healthcare Research and Quality Patient Safety Network (AHRQ PSNet). The National Coordinating Council for Medication Error Reporting and Prevention (NCCMERP). http://www.nccmerp.org/aboutMedErrors.html (2022).
ECA Academy. EU GMP Annex 22 (Draft 2025): Artificial Intelligence. https://www.gmp-compliance.org/guidelines/gmp-guideline/eu-gmp-annex-22-draft-2025-artificial-intelligence (2025).
Food and Drug Administration/U.S. Department of Health and Human. Clinical Decision Support Software (FDA, 2026).
The Agency for Healthcare Research and Quality Patient Safety Network (AHRQ PSNet). The Institute for Safe Medication (ISMP). https://psnet.ahrq.gov/issue/institute-safe-medication-practices (2019).
Chen, L., Zaharia, M. & Zou, J. How is ChatGPT’s behavior changing over time? Harvard Data Sci. Rev. https://doi.org/10.1162/99608f92.5317da47 (2024).
Messick, S. Standards of validity and the validity of standards in performance asessment. Educ. Meas. Issues Pract. 14, 5–8 (1995).
Kane, M. T. Validating the interpretations and uses of test scores. J. Educ. Meas. 50, 1–73 (2013).
Mertens, D. M. Mixed Methods Design in Evaluation. https://doi.org/10.4135/9781506330631 (Sage, 2018).
Seybert, A. L. & Barton, C. M. Simulation-based learning to teach blood pressure assessment to doctor of pharmacy students. Am. J. Pharm. Educ. 71, 48 (2007).
Kiersma, M. E. et al. Development of the 2025 ACPE accreditation standards leading to the Doctor of Pharmacy degree. Am. J. Pharm. Educ. https://doi.org/10.1016/j.ajpe.2024.101348 (2025).
Mackler, E. et al. 2018 Hematology/oncology pharmacist association best practices for the management of oral oncolytic therapy: pharmacy practice standard. J. Oncol. Pract. 15, E346–E355 (2019).
Pichai, S., Hassabis, D. & Kavukcuoglu, K. Google Introduces Gemini 2.0: A New AI Model for the Agentic Era. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#building-responsibly (2024).
Google AI Studio. Chat. https://aistudio.google.com/prompts/new_chat (2025).
OpenAI et al. GPT-4o system card. https://openai.com/index/gpt-4o-system-card/ (2024).
Anthropic. Claude 3.7 sonnet and Claude code\anthropic. https://www.anthropic.com/news/claude-3-7-sonnet (2025).
DeepSeek-AI et al. DeepSeek-v2: a strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434 (2024).
ChatGPT. https://chatgpt.com/ (accessed 15 March 2025).
Claude. https://claude.ai/new. (accessed 15 March 2025).
DeepSeek. https://chat.deepseek.com/. (accessed 15 March 2025).
Wu, F., Dang, Y. & Li, M. A systematic review of responses, attitudes, and utilization behaviors on generative AI for teaching and learning in higher education. Behav. Sci. 15, 467 (2025).
Park, S. H. et al. Minimum reporting items for clear evaluation of accuracy reports of large language models in healthcare (MI-CLEAR-LLM). Korean J. Radiol. 25, 865–868 (2024).
Watts, P. I. et al. Healthcare Simulation Standards of Best Practice™ simulation design. Clin. Simul. Nurs. 58, 14–21 (2021).
Hughes, K. Simulation in nursing education. Am. J. Nurs. 118, 13 (2018).
Furze, J. et al. Clinical reasoning: development of a grading rubric for student assessment. J. Phys. Ther. Educ. 29, 34–45 (2015).
Leighton, K., Ravert, P., Mudra, V. & Macintosh, C. Updating the simulation effectiveness tool: item modifications and reevaluation of psychometric properties. Nurs. Educ. Perspect. 36, 317–323 (2015).
Polit, D. F. & Beck, C. T. The content validity index: are you sure you know what’s being reported? Critique and recommendations. Res. Nurs. Health 29, 489–497 (2006).
Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika 16, 297–334 (1951).
Welsh, D. et al. Development of the barriers to error disclosure assessment tool. J. Patient Saf. 17, 363–374 (2021).
Acker, R. C. et al. Belonging in surgery: a validated instrument and single institutional pilot. Ann. Surg. 280, 345–352 (2024).
Shell, I. G. et al. Decision rules for the use of radiography in acute ankle injuries: refinement and prospective validation. J. Am. Med. Assoc. 269, 1127–1132 (1993).
Stiell, I. G. et al. A study to develop clinical decision rules for the use of radiography in acute ankle injuries. Ann. Emerg. Med. 21, 384–390 (1992).
Lee, J. H., Park, C. G., Kim, S. H. & Bae, J. Psychometric properties of a clinical reasoning assessment rubric for nursing education. BMC Nurs. 20, 177 (2021).
Jeffries, P. R. A framework for designing, implementing, and evaluating: simulations used as teaching strategies in nursing. Nurs. Educ. Perspect. 26, 96–103 (2005).
Unver, V. et al. The reliability and validity of three questionnaires: the student satisfaction and self-confidence in learning scale, simulation design scale, and educational practices questionnaire. Contemp. Nurse 53, 60–74 (2017).
Franklin, A. E., Burns, P. & Lee, C. S. Psychometric testing on the NLN student satisfaction and self-confidence in learning, simulation design scale, and educational practices questionnaire using a sample of pre-licensure novice nurses. Nurse Educ. Today 34, 1298–1304 (2014).
Clopper, C. J. & Pearson, E. S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 404–413 (1934).
Norman, G. Likert scales, levels of measurement and the ‘laws’ of statistics. Adv. Health Sci. Educ. 15, 625–632 (2010).
Sullivan, G. M. & Artino, A. R. Analyzing and interpreting data from Likert-type scales. J. Grad. Med. Educ. 5, 541–542 (2013).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B Stat. Methodol. 57, 289–300 (1995).
Acknowledgements
The authors gratefully acknowledge the contribution of all PharmD students who participated in the learning project associated with this study. Their engagement and valuable input were instrumental in supporting the development and completion of this work. The authors received no funding for this work.
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Author information
Authors and Affiliations
Contributions
A.A. conceptualized the study. A.F. and A.A. designed the methodology. A.F. and A.A. conducted the data collection and formal analysis. Supervision and manuscript review and editing were provided by A.Z., A.A., and A.F. A.F. wrote the original draft and created the visualizations. All authors reviewed and edited the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Farrag, A.N., El-Zeiny, A. & Ali, A.M. Evaluating large language models for pharmacotherapy simulations: a mixed-methods study. npj Digit. Med. 9, 355 (2026). https://doi.org/10.1038/s41746-026-02626-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02626-1







