Introduction

According to Kolb’s experiential learning cycle, students must undergo concrete experiences to effectively acquire skills1,2. In medical education, it is crucial for students to gain extensive exposure to real-world clinical scenarios3,4. However, the standard case-based learning approach often faces a significant shortage of materials featuring authentic clinical cases5,6. Furthermore, due to a scarcity of clinical mentors and limited opportunities for hands-on practice, students frequently miss out on critical interactions with actual patients7,8. This limitation hinders the enhancement of students’ medical history taking abilities9.

To address this issue, standardized patients (SPs) were proposed for clinical teaching, which allowed medical students to develop clinical skills independently of real cases and has gained widespread global adoption10,11. However, individuals portraying SPs require specialized training, which often falls short of achieving authentic portrayals12. This limitation impacts the efficiency of medical training, contributes to high teaching costs, and therefore does not fully resolve the issue of limited clinical educational resources13.

With the advancement of digital and multimedia technology, virtual patients (VPs) have emerged as a potential solution to these challenges14,15,16. While VPs effectively foster intrinsic motivation and baseline clinical reasoning competence, maintaining their utility requires continuous updates to reflect evolving clinical standards17,18,19,20. Guided by the Calgary-Cambridge communication framework12,21, which emphasizes structured communication skill development, VPs should exhibit diverse personalities, linguistic habits, and communication styles. Additionally, teaching content should be adapted based on regional differences, students’ knowledge levels, and variations in disease spectra to enhance learning experiences and improve knowledge retention. Consequently, students may encounter outdated or incomplete information, hindering their ability to grasp the latest clinical practices and knowledge, especially for the rare cases. Moreover, the fixed nature of responses presents another limitation: virtual patients’ responses are typically preprogrammed and sometimes overly rigid, which prevents realistic doctor-patient interaction training.

Large language models (LLMs), such as ChatGPT (Open AI, San Francisco, CA, USA), represent a recent significant advancement in healthcare22,23,24. Through a literature search, we found that LLM applications in medicine such as answering medical examination questions25,26, providing diagnostic and treatment recommendations27,28, summarizing clinical research articles29, and generating patient education materials30 are being increasingly explored. Gray M et al.31 demonstrate the use of LLMs to generate authentic patient-clinician dialogs, while Cook et al.32,33 and McKell et al.34 further supported their potential for enhancing medical education. Their established effectiveness in medical consultations lays the foundation for exploring their potential impact on medical education35,36,37. By processing extensive medical literature and electronic health records (EHRs), LLMs has demonstrated the ability to extract detailed clinical information from patient discharge summaries38. This capability is crucial for creating VPs that accurately simulate real patients for educational purposes39,40. This innovative approach aims to address the shortage of authentic clinical scenarios prevalent in traditional case-based learning. While there have been feasibility studies exploring the use of LLMs in medical education18,30, such as those by Potter & Jefferies41 and Yao et al.42, these studies primarily served as proofs-of-concept without rigorous randomized controlled trials (RCTs) or comprehensive validation. Using LLMs to build digital patients (DPs) to simulate real-world doctor-patient interaction is a promising area, but their effectiveness in medical education requires thorough evaluation43,44,45,46.

In this study, we developed an LLM-based digital patient (LLMDP) system that not only virtualizes case information resembling a digital twin47, but also serves as a dynamic “digital patient” by simulating realistic, emotionally rich patient interactions akin to conversing with a real patient. Using our previously established retrieval-augmented LLM framework28 and a large-scale knowledge base, the system allowed medical students to practice ophthalmology history taking in Chinese conversation with DPs (Fig. 1a). Then, its effectiveness and correlation were preliminarily validated through a controlled experiment (Fig. 1b, c). Finally, we conducted a randomized controlled trial to assess the effects of the system on students’ history taking ability measured by medical history-taking assessment scores (Fig. 1d).

Fig. 1: Overview of the study design.
figure 1

The LLM-based digital patient (LLMDP) system development and validation process comprises four stages: a Clinical instructors used a large language model (LLM) to extract information from electronic health records and generated a digital patient knowledge base (KB). The KB was used to construct an LLM-based digital patient (LLMDP) system. b A controlled experiment was conducted to preliminarily validate the effectiveness of LLMDP system in medical history-taking training compared to traditional patient-based training methods. c The correlation of an automated scoring model, incorporated in the LLMDP system, was tested in comparison to manual scoring to show that it provides valid scores for trainee proficiency. d A randomized clinical trial was conducted to assess the effect of the LLMDP system on students’ medical history taking abilities compared with that of traditional methods. EHR electronic health record, KB knowledge base, LLM large language model, LLMDP large language model-based digital patient, MHTA medical history-taking assessment.

Results

Ophthalmic data curation and LLM selection with model fine-tuning

We curated a comprehensive ophthalmic dataset by incorporating data from various sources, including electronic health records (EHRs), clinical guidelines, ophthalmology textbooks, multiple-choice questions, and patient-doctor consultation dialogs (Fig. 2a, b). The dataset was processed into an instruction-output format, with the accuracy of the data being validated by ophthalmologists. The final dataset comprised 59,343 entries, covering a wide range of ocular diseases (Fig. 2c). After evaluating multiple LLMs using 50 ophthalmic tasks in ophthalmic knowledge, information extraction, and instruction alignment as described in Methods, Baichuan-13B-Chat was selected for its best performance in open source LLMs, achieving a win rate comparable to GPT-3.5 Turbo (Fig. 2d, e), while also addressing patient privacy considerations. Fine-tuning of Baichuan-13B-Chat over 20 epochs further enhanced its performance, enabling its integration into the LLMDP system for effective simulation of digital patients in clinical consultations (Fig. 2f, g).

Fig. 2: Ophthalmic dataset construction, LLM selection, and model fine-tuning.
figure 2

Ophthalmic dataset construction, LLM benchmarking, and model fine-tuning workflow includes: a The preprocessing workflow for the ophthalmic instruction training data. EHR, guidelines, textbooks, multiple-choice questions and knowledge from ophthalmology textbooks, and patient-doctor consultation dialogs were used to construct ophthalmic dataset. After being processed into an instruction-output format, the accuracy of the data was sampled and checked by ophthalmologists. b The source distribution of the ophthalmic instruction dataset. c The distribution of ocular diseases in the ophthalmic instruction dataset. d A workflow for comparing LLMs using the Chatbot Arena approach. The LLMs were given 50 instruction tasks in Chinese, including ophthalmic knowledge, EHR information extraction, doctor’s question paraphrasing, patient’s answer paraphrasing and consultation-related queries. Two ophthalmologists compared the output of the two anonymized LLMs. Finally, the win rate between different models was calculated. e Human evaluation comparing the Baichuan-13B-Chat model with other models. The win, tie and loss rates of Baichuan-13B-Chat versus the other models are shown. f The BLEU and ROUGE-L performance metrics on the test set across the fine-tuning epochs. The model trained at 20 epochs was selected for the study. g Human evaluation comparing the fine-tuned Baichuan-13B-Chat model with other models. LLM large language model, BLEU Bilingual Evaluation Understudy, ROUGE-L Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence, EHR electronic health record.

Knowledge base construction and LLMDP system development

We constructed knowledge bases (KBs) for simulating DPs by analyzing patient notes and integrating them with slit-lamp photographs, covering 24 key ocular diseases (Fig. 3a and Supplementary Note 1). The LLMDP system used these KBs to match student inquiries with accurate responses during history-taking simulations (Fig. 3b). When no direct match was found, the system generated answers using patient notes, translated into colloquial language for realistic interactions. Accessible through various devices, the system provided real-time feedback and personalized guidance (Fig. 3c, d). Additionally, an automatic scoring system was implemented based on objective structured clinical examination (OSCE) checklists to objectively evaluate students’ proficiency in medical history-taking (see Methods: “The development of the automatic scoring system”). This scoring system analyzed key points from the consultations, providing standardized and objective assessments.

Fig. 3: Development and demonstration of the LLMDP system.
figure 3

The framework integrates knowledge base construction, clinical interaction simulation, multi-platform deployment, and automated feedback generation. a Using the LLM to construct a KB for DPs. After 24 eye diseases were selected from a clinical teaching syllabus, a clinical inquiry question set was compiled, and EHRs for relevant clinical cases were collected. A LLM was used to answer questions based on patient notes and creating patient information sheets that were weighted according to clinical significance. The sheets were then combined with slit-lamp photographs to form a comprehensive KB for DPs. b An example of a history-taking process between the user and LLMDP system. Medical students’ questions were transcribed and matched to the KB for answers. If a match was found, then the LLMDP system provided as an output; otherwise, the LLMDP system created an answer from the patient notes. Responses were then converted into colloquial language by the LLM, transformed into speech, and provided to students through the HCI interface. c The LLMDP system can be accessed through a web browser on various devices (e.g., iPad, smartphone, PC). Initiate consultation by voice input; make corrections to the inquiry text if necessary; the LLMDP system responds via voice output. d Concluding the consultation, reviewing the score and personalized guidance, and accessing detailed feedback. After completing the consultation, click “End” to view the score and receive personalized guidance. The exam results are displayed. Personalized guidance is shown. Drag the slider down to view the full text of the feedback. LLM large language model, DP digital patient, KB knowledge base, TTS text-to-speech, ASR automatic speech recognition, HCI human-computer interface, LLMDP large language model-based digital patient, EHR electronic health record. The language used in the study was originally Chinese and was translated into English for presentation purposes.

Comparative validation experiment and correlation analysis

The two validation experiments were conducted at the end of development of the LLMDP system to test its function including validating its effectiveness in medical training using SPs and the correlation between automated scoring and manual scoring.

In validation experiment 1, participants were examined using SPs to compare LLMDP-based training with the traditional real patient-based training method. Preliminary results showed that the LLMDP group demonstrated improved medical history taking skills compared to the control group (mean ± SD: 78.13 ± 8.35 vs. 67.08 ± 7.21, respectively), with an average score increase of 11.05 points (mean difference, 6.61–15.48, 95% CI, p < 0.001, Fig. 4b).

Fig. 4: Comparison of the effectiveness between traditional teaching methods and the LLMDP system using standardized patients during the development phase.
figure 4

A validation study demonstrates enhanced clinical competency through standardized assessments. a Study process: A group of 50 participants from the Zhongshan School of Medicine, Sun Yat-sen University, who had not received prior clinical training, were divided equally into two groups. The experimental group of 25 participants used the LLMDP for training, while the control group of 25 participants interacted with real patients. After completing of their respective training sessions, both groups underwent a standardized evaluation using standardized patients and were scored by ophthalmologists. A total of 6 SPs were used for this examination, each uniformly trained to play the same patient role to ensure fairness. b Performance comparison evaluated using standardized patients and scored by ophthalmologists. The LLMDP group (78.13 ± 8.35, mean ± SD) demonstrated a performance improvement in comparison to the control group (67.08 ± 7.21, mean ± SD), with an average score increase of 11.05 points (mean difference, 6.61–15.48, 95% CI, p < 0.001). LLMDP large language model-based digital patient, SPs standardized patients.

In validation experiment 2, participants were consecutively evaluated twice using both assessment methods: first by real patient encounters and then by the automatic scoring system. The results indicated high positive correlation between the scores obtained from the LLMDP system and those from real patient encounters (Pearson’s correlation coefficient, r = 0.88, 95% CI: [0.79, 0.93], p < 0.001, Fig. 5b).

Fig. 5: Convergent validity of LLMDP-based assessments against real patient evaluations.
figure 5

Evaluation of the correlation between the assessment methods of real patient encounters and LLMDP-based scenarios. a Study process: Another set of 50 fourth-year medical students from the Zhongshan School of Medicine, Sun Yat-sen University, who were undergoing traditional clinical internships were included. Participants were consecutively evaluated twice using both assessment methods (real patient encounters and LLMDP-based scenarios) for their medical history-taking assessment (MHTA). b Scatter plot comparing the evaluation scores measured using the two methods. Each data point on the graph represents a participant, with the position on the plot corresponding to the scores from the two methods. A moderate positive correlation between the two methods was shown. Pearson’s correlation coefficient r = 0.88 (95% CI: [0.79, 0.93]). LLMDP large language model-based digital patient; MHTA: medical history-taking assessment;.

Clinical trial to evaluate LLMDP’s effectiveness in medical history taking

84 fourth-year medical students who had not received prior clinical training were enrolled and divided into traditional real patient-based training group and LLMDP-based training group. The baseline ophthalmology theoretical examination scores in the control group were similar to those in the LLMDP group (75.82 ± 8.44 vs. 77.05 ± 7.81), as well as the levels of understanding of LLMs before the trial (Table 1).

Table 1 Baseline characteristics of participants in the two groups of the clinical trial

The effectiveness of these methods was assessed through MHTA, which was designed to evaluate students’ proficiency in medical history taking process. The LLMDP group (64.62 ± 9.52) demonstrated a performance improvement in comparison to the control group (54.12 ± 8.80), with an average score increase of 10.50 (mean difference, 4.66–16.33, 95% CI, p < 0.001), as shown in Fig. 6b.

Fig. 6: A randomized controlled trial testing the efficacy of LLMDP training in clinical history taking education.
figure 6

A RCT with 84 medical trainees with no prior clinical experience compares ophthalmology history-taking performance across training modalities. a The trial profile. Out of 84 assessed participants, all subjects were included and randomized into control and LLMDP groups with no withdrawals. They underwent initial training and the medical history-taking assessment (MHTA). b Participants’ clinical consultation scores on MHTA was 54.12 ± 8.80 for the control group and 64.62 ± 9.52 for the LLMDP group, showing a mean difference of 10.50, 95% CI: 4.66–16.33, p < 0.001. Symbols: Blue circle (control), orange circle (LLMDP). The error bars represent the SDs. The data are shown as the means ± SDs and were analyzed using independent t tests with two-sided p values (p < 0.05 was considered to indicate statistical significance). MHTA medical history taking assessment, LLMDP large language model-based digital patient.

Additionally, in deconstructing the patient history-taking process, we noted that participants who received LLMDP training demonstrated significant improvements in all subitems of the history-taking process in MHTA (Supplementary Table 1), including in the collection of identification and demographic information (Effect size: 0.78, 95% CI: 0.33–1.22), the history of present illness section (0.56, 0.12–1.00), past medical history (1.04, 0.58–1.49), and other medical history (0.76, 0.31–1.20).

Students’ attitudes toward using LLMs in medical education

All participants completed a standardized quantitative questionnaire based on the widely used five-point Likert scale (Methods) about their attitudes toward the application of LLMs in medical education after using the LLMDP system during the trial.

The quantitative questionnaire results from participants are summarized in Table 2. After using LLMDP, subjects were highly satisfied with the history taking instruction provided by the LLMDP and felt that the LLMDP achieved overall better teaching and learning outcomes (48.81% agreement vs. 21.42% disagreement). Additionally, it helped overcome psychological barriers to learning how to ask questions (65.48% agreement vs. 17.85% disagreement) and saved both time (57.15% agreement vs. 22.62% disagreement) and financial costs of instruction (66.67% agreement vs. 14.28% disagreement). Overall, the LLMDP achieves better teaching and learning outcomes (48.81% agreement vs. 21.42% disagreement), and participants expressed a willingness to recommend the LLMDP for learning medical history taking ability to other students (44.05% agreement vs. 25.00% disagreement). However, participants also believed that LLMDP could not fully imitate the interaction process with patients due to the absence of a physical body (37.29% agreement vs. 20.34% disagreement) and that the guidance provided by LLMDP for medical image analysis was insufficient (40.68% agreement vs. 16.95% disagreement).

Table 2 Participants’ attitudes to using large language models in medical history-taking courses after the trial

At the end of the trial, the participants had a group discussion about the merits and drawbacks of employing LLMs in medical education. Three aspects of using LLMs in medical education, namely, accessibility, interaction, and effectiveness, were discussed. Their representative opinions are summarized in Supplementary Table 2 and educators’ attitudes toward using LLMs in medical education are shown in Supplementary Table 3.

After the MHTA, the expert panel assessed all students’ empathy performance based on the backend recordings from the MHTA process. The results showed that students in the LLMDP group exhibited better empathy (Supplementary Fig. 1b).

Discussions

In this study, we developed an LLM-based digital patient system for medical education, and validated its efficacy in control experiments and in a prospective randomized controlled trial. Our study has three main implications, which are to: (1) demonstrate the feasibility of developing an LLMDP system for medical education, (2) provide evidence of its effectiveness in improving history-taking skills, and (3) explore the students’ experiences with the system.

The potential of the LLMDP system extends beyond merely replicating traditional medical education methods; it provides possibilities for transcending the limitations of traditional training paradigm. Building on our prior LLM augmentation framework28, we incorporated clinical guidelines, EHRs and doctor-patient conversations. This has enabled the simulation of dynamic digital patients that significantly surpass earlier digital models reliant on static, scripted interactions, which lacked the capability to interpret nuanced language or complex instructions effectively22. Our system’s advanced semantic understanding capabilities facilitate dynamic, contextually relevant dialogs across a wide range of clinical scenarios, offering scalability and realism unattainable by the earlier virtual patient models15,19. Importantly, LLMDP uses real, anonymized EHR data to generate clinically authentic patient scenarios, validated rigorously by expert clinicians to ensure educational relevance and accuracy which is one of the key advantages as supported in the literature48. We have taken great care to ensure that the cases generated reflect actual patient conditions, incorporating diverse symptoms, medical histories, and responses. Additionally, all patient data and scenario content undergo rigorous review and validation by expert clinicians before being integrated into the LLMDP system, ensuring their clinical relevance and educational value.

Recent advances have begun to tackle the rigidness of traditional history-taking training. Conventional methods—whether involving real standardized patients or static computer-based cases – are resource-intensive and offer limited scenario variability49. Correspondingly, prior work has highlighted that fixed virtual cases often lack interactivity and adaptive feedback15,19. To address these gaps, several studies have applied LLMs to simulation. For example, Potter and Jefferies showed that a GPT-driven agent could serve as a generative “virtual patient” for communication training41; Yao et al. developed a related LLM-powered patient dialog system42; Borg et al. combined an LLM with a social robot to create more authentic case encounters50; and Holderried et al. found that a ChatGPT chatbot acting as a simulated patient produced highly plausible history responses51. Our LLMDP differs from these earlier systems in several key ways. First, every case is instantiated from real EHR data rather than a scripted scenario. Second, the digital patient can exhibit diverse personalities and emotional styles—an advance specifically noted as desirable in prior work50. Third, LLMDP continuously tracks the student’s history-taking progress as they go. And fourth, the system automatically generates feedback on the completeness and correctness of the history, building on findings that LLMs can score student interviews with high agreement to expert raters49. By grounding simulation in authentic data and by delivering real-time tracking and feedback, the LLMDP directly addresses limitations of earlier approaches—in particular, overcoming the static, one-shot nature of prior virtual cases – and thus extends the existing literature on AI-based patient simulators.

The state-of-the-art semantic understanding capabilities of LLMs52 enable the system to effectively interpret student inputs while role-playing patients with diverse personalities and communication styles. This ability to interact with students in a manner that closely mimics real patient encounters, helping trainees overcome psychological barriers often associated with real-life medical history taking process, such as anxiety or hesitation, thereby creating a more conducive environment for learning. Unlike conventional methods, which often rely on standardized cases and static teaching materials15,19, the LLMDP can dynamically generate patient scenarios tailored to the specific learning needs. By aligning with the Calgary-Cambridge communication standards, which emphasize adapting communication strategies to meet the unique needs of each patient, the LLMDP system offers a more dynamic and personalized learning experience. It draws on a vast database of real cases, including rare and atypical presentations often underrepresented in conventional curricula, preparing trainees to recognize and manage a broad spectrum of conditions they might encounter in clinical practice. Additionally, the system can simulate patients with different personality traits, enhancing the realism and complexity of doctor-patient interactions. In line with Kolb’s experiential learning cycle, the automatic scoring system embedded within the LLMDP facilitates a feedback loop that is integral to the learning process. It not only evaluates student performance but can also provide context-specific suggestions that highlights areas for improvement. This adaptability allows for targeted reinforcement of weak areas, ensuring that students receive a more personalized and effective learning experience. As they navigate complex patient histories and evolving clinical situations, they develop critical thinking and decision-making skills in real-time.

Empathy assessment required careful methodological design. Students who trained with the LLMDP demonstrated significantly higher empathic communication in subsequent assessments, consistent with our system’s design as an emotionally responsive “digital patient” rather than a static case repository. This finding parallels prior work showing that AI-driven virtual patient simulations can boost learner empathy: for example, trainees interacting with virtual patients in low-pressure, interactive settings produce more empathic responses53, and virtual patient encounters that include empathic feedback have yielded higher empathy scores in later standardized‐patient interviews54. By offering immersive, lifelike patient dialogs, the LLMDP likely reduced student anxiety and built confidence, enabling more natural patient-centered communication. Overall, our results underscore that emphasizing realistic emotional dynamics in communication training can meaningfully enhance empathic skills, in line with the current evidence.

We selected history-taking as the primary clinical skill to assess because it is a foundational component of the medical consultation process, serving as the gateway to other clinical competencies. While medical training encompasses a wide range of skills, effective history-taking is crucial for accurate diagnosis and management. Additionally, current-generation AI, including LLMs, faces significant limitations in generating highly authentic disease images and facilitating complex physical examinations. Given these constraints, focusing on history-taking represents the most practical application of available technology at this stage. Future work will extend the LLMDP system to incorporate additional clinical skills, such as physical examination and diagnosis, as AI capabilities in these areas evolve.

Our study was conducted in the context of ophthalmology clinical teaching, but it has the potential to be applied to medical history taking training across disciplines. While a significant portion of the DP cases in this study involved eye diseases, the clinical diagnostic process we employed adheres to a holistic approach, encompassing various aspects such as disease characteristics, treatment history, past medical history, and family history. This diagnostic thinking process is consistent across medical specialties, focusing on primary symptoms, their onset, potential triggers, duration, and associated symptoms. Moreover, many ophthalmic conditions, such as diabetic retinopathy, thyroid-associated ophthalmopathy, and steroid-induced cataracts, are related to systemic diseases55, enhancing students’ understanding of the interconnectedness of different body systems. However, the current capabilities of LLMs, including advanced models like GPT-4V, remain limited in their understanding and processing of medical images56,57, making it hard for current LLMs to mimic human physician who is capable of simultaneously good at verbal communication and guiding medical students in disease image recognition and diagnosis as pointed out by the participants. This limitation reflects a broader challenge at the cutting edge of industry innovation, where foundational technological advancements are required. The inability of LLMs to simultaneously excel in language interaction and medical image recognition impedes their real-world deployment, highlighting an urgent need for breakthroughs in foundational AI technology.

Feedback from both students and educators underscored the transformative impact and potential limitations of the LLMDP system in clinical education. Students particularly highlighted the convenience and cost-effectiveness of the LLMDP system, noting that it provides an economical means of practicing logical thinking in various clinical scenarios. Moreover, the system significantly reduces the reliance on human actors for SP roles, thereby cutting both manpower needs and associated costs. The LLMDP system, accessible via mobile devices such as smartphones, tablets, and laptops, facilitates a personalized and flexible educational experience, enabling students to practice independently. This flexibility allows students to engage in natural oral interactions anytime and anywhere, which is especially beneficial for beginners and those lacking confidence, as it provides them with ample opportunities to refine their skills and build confidence in real-world settings. At the same time, students reported that the LLMDP system alleviated their anxiety when dealing with patients and enhanced their empathy. This feature is particularly valuable because real patients are often affected by physical and psychological conditions that make it challenging to participate in educational activities. Moreover, SPs may vary in their performance, especially under heavy teaching workloads. In contrast, the LLMDP system is designed to provide more consistent teaching assistance, contributing to a more standardized learning experience.

We acknowledge several limitations. Our system employs local language models to preserve data privacy, which comes at some cost to performance – the local model may not match that of commercial models58,59. To counterbalance this, engineering techniques such as fine-tuning, disease knowledge bases, and prompt engineering, as applied in this study, are essential to enhance the local model’s capabilities in patient simulation38. It focuses on text-based history-taking only, without simulating a physical pathophysiological body which allows for physical examination, interpreting imaging or laboratory results, or covering broader diagnostic and management processes. A promising future direction is to integrate basic pathophysiological models into the LLMDP, enabling realistic simulation of underlying disease processes60. Our study was conducted at a single institution and focused on a specific history-taking skillset. Further research is needed to explore the system’s utility in a wider variety of contexts and with more extensive interventions, as well as to establish its long-term impact across different institutions and broader and more complex clinical skill sets.

In conclusion, we developed the LLMDP system and validated its effectiveness in enhancing medical history taking skills through a series of experiments. Using the LLMDP system significantly improved students’ abilities in this area and achieved high satisfaction among both students and teachers. By integrating advanced generative AI, the LLMDP system has the potential to significantly advance the current approach to medical education, providing a valuable blueprint for future development and expansion by other educators and researchers in the field.

Methods

Ophthalmic dataset construction, LLM selection, and model fine-tuning

First, we curated a comprehensive ophthalmic dataset by incorporating data from various sources, including EHRs, clinical guidelines, ophthalmology textbooks, multiple-choice questions, and patient-doctor consultation dialogs (Fig. 2a, b). The dataset was processed into an instruction-output format, with the accuracy of the data being validated by ophthalmologists. The final dataset comprised 59,343 entries, covering a wide range of ocular diseases (Fig. 2c).

Using previously established method28, we designed multiturn prompt templates and constructed a dataset with an instruction-output format to process the online medical consultation dialog data, by using historical dialog and patient notes as context and the patient’s actual response as the output target. For nondialogue text data such as EHRs, clinical guidelines, and medical textbooks, we utilized text headings as weak supervision signals and designed rule-based templates to automatically map the text into the dataset using LLMs. The detailed processes are provided in Supplementary Note 2.

To accurately reflect real-world performance and nuanced differences of these models, the performance of several LLMs were evaluated by human doctors using previously established evaluation standards28. Briefly, to enable human evaluation of the LLMs, we adopted an approach similar to the model comparison method used in the Chatbot Arena project where human evaluate the output of LLMs61. We prepared 50 Chinese instruction tasks to evaluate the accuracy of LLMs in terms of ophthalmic knowledge, information extraction capabilities, and Chinese instruction alignment abilities (Fig. 2d). Considering the models’ capabilities and usability, we included Baichuan-13B-Chat, LLaMA2-Chinese-Alpaca2 (13B), LLaMA2-70B-Chat, ChatGLM2-6B and GPT-3.5 Turbo (ChatGPT); detailed model information is provided in the Supplementary Note 3. Apart from GPT-3.5 Turbo, among the five models evaluated, Baichuan-13B-Chat demonstrated the best overall performance in understanding and responding to Chinese instructions and was therefore chosen for our study (Fig. 2e).

To enhance the model’s ability to perform effectively in our specific context, we fine-tuned it using 8 Nvidia Tesla A800 GPUs (each with 80GB of memory). Fine-tuning helps adapt a pre-trained model to a specialized task by updating its parameters based on new data. To make this process more efficient, we used LoRA (Low-Rank Adaptation)62, a technique that reduces the number of trainable parameters by introducing low-rank matrices, thereby lowering computational requirements. This was implemented within DeepSpeed63, an advanced deep learning framework designed to optimize large-scale model training. The key training settings were as follows: Batch size: 4 (number of samples processed at a time), Learning rate: 5e-5 (determines how much the model updates during training), Weight decay: 0 (no additional penalty for large weight values), Scheduler: Cosine (gradually adjusts the learning rate over time), and Training epochs: 25 (one epoch means the model has seen the entire dataset once). The training process lasted approximately 19 h. The best results were observed at epoch 20, where the model demonstrated superior performance on automated evaluation metrics (BLEU and ROUGE, Fig. 2f) and was preferred in human assessments (Fig. 2g). Based on these findings, we selected this version for further studies.

The development of the LLMDP system

In developing the LLMDP system, we utilized the fine-tuned LLM to analyze patient notes and generate a Knowledge Base (KB) for simulating digital patients (Fig. 3). First, in accordance with the syllabus requirements, 24 ocular diseases, all of which are prioritized for student mastery, were included (Supplementary Note 1). Next, a clinical inquiry question set for these 24 ocular diseases was compiled by enumerating all the questions concerning symptoms, signs, and pertinent notes based on ophthalmology textbooks, standardized patient notes, and outline and scoring criteria of previous standardized history-taking exams. EHRs, including patient notes and slit-lamp photographs as the ocular images, for patients with the 24 diseases were gathered. Using the fine-tuned LLM, all questions in the clinical inquiry question set corresponding to each disease were answered based on the patient notes, and the patient information sheet was generated using a standard prompt template (Supplementary Fig. 2a). Importance weights were assigned to each question in the patient information sheet based on the diagnostic significance of each question after the expert panel discussion64 (Supplementary Tables 6 and 7), with specific scores calculated according to a predefined rubric: 5 points for patient identification and demographics, 50 points for history of present illness, 25 points for past medical history, and 20 points for other medical history. The sheet was then combined with the slit-lamp photographs to form a KB for the DPs.

When participants interacted with the LLMDP system, their inquiry audio was converted into text through an automatic speech recognition (ASR) algorithm65. Then, the system matched the inquiry questions with the KB of the DPs using Sentence Bidirectional Encoder Representations from Transformers (SBERT) embeddings (distiluse-base-multilingual-cased-v1)66 as a semantic textual similarity model. If a match was found, then the corresponding answer was output. If no match was found, then the LLMDP generated an answer based on the entire patient note using a prompt template (Supplementary Fig. 2b). Notably, the LLMDP system is designed to understand the meaning and intent behind students’ questions rather than requiring fixed wording patterns. Next, the answer generated in the previous phase was refined into more colloquial language using the LLM, simulating the patient’s tone to enhance realism; at the same time, the LLM could randomly assign personality traits to the patient, further enriching the authenticity of the dialog scenario (Supplementary Fig. 2c). Finally, the response was converted into speech through the text-to-speech (TTS) algorithm67 and provided as feedback to the students via the human-computer interaction (HCI) interface (Fig. 3b).

The development of the automatic scoring system

To establish a scoring system to evaluate the participants’ proficiency in medical history taking, we first identified key points in clinical consultation for each patient based on the objective structured clinical examination (OSCE) checklists for ophthalmic history taking68,69, which are widely used as the National Medical Licensing Examination70 along with ophthalmologic textbooks and standardized medical records (Supplementary Fig. 3). Then, we preliminarily formulated medical inquiry questions about each key point and recruited experts in the relevant subspecialties for group discussion to determine the final medical inquiry questions and evaluate their clinical importance for each case. The expert panel consisted of 9 professors, including 3 cataract specialists, 2 glaucoma specialists, 1 retina specialist, 1 cornea specialist, 1 orbit specialist, and 1 trauma specialist. The specific discussion process64 is outlined in Supplementary Fig. 3: first, the importance of each category of medical history was determined (Supplementary Table 6), followed by assigning scores to each individual question, with an illustrative example shown in Supplementary Table 7.

Upon the completion of history taking, the dialog between participants and the DP was used to calculate the medical history taking assessment (MHTA) scores. During the history taking practice phase, after each session, the system displayed real-time scores, including points earned and points lost, to facilitate improvement in the examination phase, this feedback feature was turned off.

User instructions for the LLMDP system

The LLMDP system interface includes a voice chat window where users verbally interact with the system, an avatar with an image of the real patient’s eye represents the DP’s visual presence and a feedback window showing user progress (Fig. 3c, d). After entering their student ID and choosing a topic, users are directed to the conversational interface designed for practicing medical history taking. This interface comprises three main components. First, a dynamic virtual character is presented on the main screen, enhancing the realism of the interactions, and simulating real-life patient conversations. Second, the response section appears beneath the virtual character. Here, students can engage in continuous dialog by either using the “speak” button for voice commands or typing their responses directly. Upon submitting their responses, the system provides immediate feedback through both text and voice outputs. Finally, the sidebar features question indicators that track the student’s progress. As students address each question in a section, the corresponding checkboxes are marked, offering instant feedback on students’ progress. During the practice period, the system displays feedback in real time which medical history sections the participants had completed. While in the examination period, this feedback feature was turned off. During the training phase, our digital patient simulation deliberately presents scenarios reflecting realistic emotional responses (e.g., a patient experiencing an acute angle-closure glaucoma attack feeling anxious). The system assesses students’ responses, scores their empathetic communication skills, and provides targeted feedback (Supplementary Note 4). Users can access the LLMDP system via smartphones, tablets, or computer browsers. The user guide for the LLMDP system can be found in Supplementary Note 5.

System validation

At the end of development of the LLMDP system, we conducted two initial experiments to preliminarily validate its effectiveness in medical history-taking training and assess the correlation of its automated scoring function with manual scoring.

Validation experiment 1: Effectiveness in medical history taking training

This experiment involved 50 fourth-year medical students from the Zhongshan School of Medicine, Sun Yat-sen University. The goal was to explore the potential of the LLMDP system to enhance medical history taking skills compared to traditional methods. Participants were divided into traditional real patient based training group and LLMDP-based training group (Fig. 4a). Traditional real patient based training refers to the conventional teaching method where students, in groups of five, engage in medical history taking practice under the guidance of a physician instructor, while interacting with actual patients in a clinical environment. Under traditional training, participants were trained with three real patients for 1 h, with 10 min of training and 10 min of explanation for each case. In contrast, “LLMDP-based training” refers to each student independently utilizing the LLMDP system for individual practice. Under the LLMDP system, three cases were provided, and the training duration was 1 h. During the examination, all participants conducted medical history-taking assessments with SPs. Their performance was evaluated by attending doctors following the OSCE criteria and was compared between two groups with t test.

Validation experiment 2: Correlation of automated scoring with manual scoring

This experiment involved a different cohort of 50 fourth-year students who were undergoing traditional clinical internships at the same institution. The goal was to assess the correlation of the LLMDP system’s automated scoring compared to manual scoring by attending doctors. Participants were consecutively evaluated twice using both assessment methods: first by real patient assessment (manual scoring), followed immediately by an assessment using the LLMDP system (LLMDP system scoring), without receiving any feedback between the two assessments (Fig. 5a). To minimize recall bias and ensure a fair comparison, different cases were used for the real patient and LLMDP assessments. In the real patient assessment, students conducted medical history taking assessment with actual patients in a clinical environment, and their performance was evaluated by attending doctors following the OSCE criteria. Pearson’s correlation coefficient and scatter plot were used to determine the correlation between the scores obtained from the LLMDP system and those from real patient encounters.

Design and procedures of the randomized controlled trial

The predefined protocol of this trial was approved by the Institutional Review Board/Ethics Committee of Zhongshan Ophthalmic Center (identifier: 2023KYPJ283) and was prospectively registered at ClinicalTrials.gov (Registration Number: NCT06229379). The inclusion criterion for this study was fourth-year students majoring in clinical medicine at the Zhongshan School of Medicine, Sun Yat-sen University. Having recently completed their theoretical courses and exams in ophthalmology, these students were about to embark on the clinical rotation stage. Prior to this phase, they had not received any clinical training. After online and offline enrollment, a total of 84 students ultimately participated in this study and signed informed consent forms.

In the trial from Nov. 14 to Dec. 7, 2023, 84 participants were divided into two groups, with 42 individuals assigned to the control group who underwent traditional real patient based training and another 42 participants assigned to the LLMDP group where they underwent LLMDP-based training (Fig. 6a). Traditional real patient based training refers to the conventional teaching method where students, in groups of five, engage in medical history taking practice under the guidance of a physician instructor, while interacting with actual patients in a clinical environment. Following standard clinical teaching practices, participants in the traditional training group engaged with three real patients over the course of 1 h, with each case allocated 10 min for history taking training and 10 min for instructor feedback. The LLMDP system mirrored this structure, providing three cases within the same training duration as the control group. Additionally, before taking exams with the LLMDP system, each student was provided with a 10-min orientation to ensure their familiarity with the LLMDP exam system.

Outcome measures

The primary outcome was the participants’ MHTA scores. The scores for MHTA were automatically calculated by the system based on the dialogs, according to predetermined scoring criteria (Supplementary Table 5). Additionally, we analyzed the impact of LLMDP training on specific history taking components by deconstructing MHTA into key subitems, including identification and demographic information, history of present illness, past medical history, and other medical history.

The secondary outcome is to investigate the attitudes toward the application of LLMs in medical education, as well as the differences in empathy toward patients between the two groups. All participants completed a standardized quantitative questionnaire based on the widely used five-point Likert scale about their attitudes toward the application of LLMs in medical education after using the LLMDP system after the trial (Supplementary Note 6). Additionally, at the end of the trial, all 84 students and their 8 supervising teachers were asked to independently discuss questions about the use of LLMDP in medical education. The results of the students’ discussion were recorded, transcribed and coded by three different authors (MJL, SWB, LXL). A panel of 9 experts participated in the discussion to provide further insights. Following discussions in regular meetings, a category system consisting of main and subcategories (according to Mayring’s qualitative content analysis) was agreed upon71. Text passages shown in Supplementary Table 2 were used as quotations to illustrate each category. Inductive category formation was performed to reduce the content of the material to its essentials (bottom-up process). In the first step, the goal of the analysis and the theoretical background were determined. From this, questions about LLMDP in medical education were developed and presented to the students for discussion. Two main topics were identified, namely positive and negative attitudes toward LLMDP in medical education. In the second step, we worked through the individual statements of the students systematically and derived various categories: accessibility, interaction, and effectiveness. To avoid disrupting the MHTA process, we recorded the entire session in the system’s backend, and after the assessment, an expert panel used the recordings to assess students’ empathy performance, according to the scoring criteria shown in Supplementary Fig. 1a.

Randomization and blinding

The participants were randomly assigned in a 1:1 ratio to either the traditional teaching group or the LLMDP teaching group. Stratified block randomization was utilized, with theoretical ophthalmology exam scores serving as the stratification factor. The randomization process was carried out by a central randomization system.

The collection, entry, and monitoring of primary and secondary outcome data were performed by staff who were blinded to group assignment. Participants were not blinded due to the nature of the interventions. Statistical analyses were performed by an independent statistician blinded to group allocation.

Statistical analysis

The trial sample size calculation was based on the primary outcome (MHTA). In the pilot study, the control group (n = 10) had a mean score of 54.59 ± 7.33, while the experimental group (n = 10) had a mean score of 62.05 ± 9.13. Based on these findings, we assumed a seven-point improvement and an SD of 9.5, where the effect size was estimated to be 0.74, yielding a sample size of 40 participants per group, 80 in total. The pilot data were not included in the final analysis. Ultimately, 84 participants were enrolled in the formal trial to account for potential exclusions. The sample size was calculated using PASS 16 software (NCSS, LLC, USA).

The intention to treat population is same with the population of per protocol in this trial since no students discontinued or withdrew after recruitment. Descriptive statistics, including the means and SDs for continuous variables, and frequencies and percentages for categorical variables, were employed to present the data. The baseline characteristics of the participants in the treatment groups were compared using t tests for continuous variables and χ2 tests for categorical variables. Continuous data were examined for normality using Shapiro–Wilk test. Independent t tests were conducted to assess differences in MHTA scores between study groups. To address multiplicity issues, all p values and confidence intervals for the subitem scores of the MHTA were corrected using the Benjamini‒Hochberg procedure to control the false discovery rate (FDR) at 0.0572.

All the statistical analyses were conducted using SPSS statistical software for Windows (version 24.0, IBM Corporation, Armonk, New York, USA), GraphPad Prism (version 9, GraphPad Software, San Diego, USA), and R (version 4.3.1, R Foundation for Statistical Computing, Vienna, Austria).