Introduction

Breast cancer has been a critical threat to public health due to its high incidence and heavy social economic burden1. Out-of-hospital management for breast cancer consists of patient education, psychological management, and disease-related queries. Appropriate out-of-hospital management can effectively reduce the recurrence rate among low-risk breast cancer patients2, but it still needs improvements in availability, personalization, and other areas3. Home-based multidimensional survivorship programs for breast cancer survivors (HBMS), which are one kind of out-of-hospital management, have been proposed to improve the quality of life (QoL) of breast cancer patients via education and training provided by health care professionals, but the benefit is still restricted by availability and cost4. Therefore, more accessible, low-cost and personalized out-of-hospital management for breast cancer patients is urgently needed.

The emergence of Large Language Models (LLMs) has revolutionized the care for breast cancer. Chatbots based on LLMs were developed to collect the data of PROs5, to respond top cancer-related search queries6, and to assist in the tumor board decision making7. For instance, Pan et al. evaluated the understandability of 4 AIs (ChatGPT version 3.5 (OpenAI), Perplexity (Perplexity.AI), Chatsonic (Writesonic), and Bing AI (Microsoft)) regarding 5 most searched queries for skin, lung, breast, colorectal, and prostate cancer. The results showed that 4 AIs had accurate responses, but had limited actionability6. Similarly, another study reported the feasibility of a medical LLM in assisting disease diagnosis. This medical LLM demonstrated superior performance compared to other baseline LLMs and specialized models8. However, personalized out-of-hospital management, like education of limb rehabilitation, psychological management, and other breast cancer related queries, remained unsolved in clinical practice due to the distance between the patients and care-givers. Could the LLMs like DeepSeek-R1 (DS-R1), GPT-o3 powered by sophisticated reasoning models, be the key to solving these problems in the era of AIs? How is the performance of these reasoning enhanced LLMs according to human physicians? Till now, no study has answered these questions.

To explore the role of reasoning-enhanced LLMs in out-of-hospital management for breast cancer, we conducted a multi-phase randomized single-blind study. First, we established a data cleaning and classification pipeline to summarize three major scenarios of out-of-hospital management, and we also constructed authentic EHR datasets for data collection. Second, we matched the EHR datasets with three out-of-hospital management scenarios to create virtual patients (VP) (N = 10) for LLMs to perform the conversation generation using GPT-o3 and DS-R1. Third, we recruited four human specialists to evaluate the responses of LLMs in five dimensions, which highlighted the potential of involved LLMs in the scenarios of out-of-hospital management. To the best of our knowledge, this is the first study evaluating the role of reasoning enhanced LLMs in out-of-hospital management for breast cancer.

Method

Ethics

This study was approved and supervised by the Institutional Review Board of SRRSH (IRB#: 20210910-30). All the patients enrolled in this study were fully informed and consented of their rights, with their identification information masked. The study was performed in accordance with the Declaration of Helsinki.

Overview

As shown in Fig.Ā 1, the methodology consists of three steps: Data Collection, Conversation Generation, and Human Evaluation.

First, in the data collection phase, we established a data cleaning and classification pipeline to summarize a question dataset from publicly available breast cancer-related dialogue datasets. This dataset covers three typical out-of-hospital management application scenarios: disease consultation, rehabilitation guidance, and psychological management. Additionally, we selected authentic patient data with ID information masked from the Breast Cancer Database of Affiliated Sir Run Run Shaw Hospital, Zhejiang University to construct an EHR dataset.

Based on these two datasets, in the conversation generation phase, we matched each patient’s EHR data with several sets of questions corresponding to the three out-of-hospital management scenarios. Using the EHR data and its corresponding question set, we designed a prompt engineering framework that drives LLMs to act as VPs, simulating out-of-hospital management scenarios and engaging in multi-round dialogues with GPT-o3 and DS-R1(released on 2025.01.20). This process generates a question-and-answer dataset that mimics real-world interactions.

In the human evaluation phase, the question-and-answer datasets generated by GPT-o3 and DS-R1were randomly and evenly divided into two parts, each assigned to two groups of evaluators. A Likert scale9 was used to obtain subjective evaluations across five dimensions, including effectiveness, safety, accuracy, personalization, and emotional care. Further details of these three steps will be elaborated in the following three subsections.

We have open-sourced the code, anonymized dataset, and evaluation results used in this experiment on GitHub (https://github.com/Maxin-C/LLM-Evaluation).

Fig. 1
figure 1

Overview of methodology.

Data collection

In this section, we aim to identify frequently mentioned issues by patients during the out-of-hospital management of breast cancer and obtain real patient EHR data to facilitate subsequent steps where the model is required to act as a virtual patient interacting with the LLM under evaluation. For the construction of the common issues dataset, we selected the Huatuo-BC dialogue dataset extracted from Huatuo-26Ā M as the raw data. Huatuo-26Ā M was derived from Qianwen Health, offering more than 26Ā million real-world doctor-patient dialogues. Huatuo-BC is a breast cancer-related subset of this dataset, comprising 208Ā K question-answer pairs. First, by constructing a dataset filtering prompt (Appendix A1), we used the API (Application Programming Interface) to instruct DeepSeek-V3-0324 (DeepSeek-V3) to extract dialogue content related to out-of-hospital breast cancer management from the original dataset, resulting in 29Ā K question-answer pairs. Then, by clearly defining three scenarios—disease consultation, rehabilitation guidance, and psychological management—we designed classification and summarization prompts (Appendix A2) to drive DeepSeek-V3 to categorize the question-answer data. Ultimately, we obtained 256, 309, and 350 common question datasets for the three scenarios, respectively.

Ten patient data with complete indicators and different conditions (with Stage as the standard) were extracted as the background information of EHR information to support personalized question and answer. The main data distribution of the EHR dataset is shown in TableĀ 1.

Table 1 Main data distribution table of the EHR dataset.

Conversation generation

The process of conversation generation is illustrated in Fig.Ā 2. Based on the question dataset and the EHR dataset, we matched the questions with patient data, assigning 10 sets of questions to each patient, with each set including one question from each of the three clinical scenarios. Considering that patients at different stages may raise different questions, we designed a question matching prompt (Appendix A3) to instruct DeepSeek-V3 to identify the 10 most likely sets of questions based on the patient’s clinical background, thereby achieving dataset pairing. Through this process, we obtained background information for 100 VPs, which includes EHR data and one likely question from each of the three scenarios.

For each set of virtual patient information, the virtual patient prompt (Appendix A4) can drive DeepSeek-V3 to act as a patient undergoing post-operative out-of-hospital management for breast cancer and engage in conversation with the model under evaluation. To make the virtual patient’s questions more closely resemble the conversation process of real patients, this study extracted a total of 1,775 dialogue histories from a WeChat group chat focused on post-operative management for breast cancer patients, spanning from June 2023 to January 2025. After removing private information, these dialogues were used as few-shot inputs in the virtual patient prompt, instructing the model to mimic the conversation style in its outputs. Upon receiving the needs raised by the virtual patient, the model under evaluation will act as a doctor providing out-of-hospital management services through the virtual doctor prompt (Appendix A5). The virtual patient and virtual doctor will engage in multiple rounds of conversation. To avoid meaningless conversations, we additionally used a conversation monitoring prompt (Appendix A6) to instruct DeepSeek-V3 to determine whether the current content has addressed the needs raised by the virtual patient. If the judgment is affirmative or the conversation exceeds 8 rounds, the conversation will be terminated, and the next conversation will begin. In this experiment, the reasoning parameters of the models were kept consistent, with the temperature set to 0.1 and top-p set to 1, to ensure the stability of text generation and enhance the reproducibility of the experiment.

Fig. 2
figure 2

Conversation generation process.

Human evaluation

Based on 100 sets of virtual patient information, evaluating GPT-o3 and DS-R1 yielded 200 sets of conversations. Four breast doctors were invited to assess the dataset. The dataset was evenly and randomly divided into two parts, each containing 50 sets of conversations from GPT-o3 and 50 sets from DS-R1. Each part was evaluated by 2 doctors who were aware of the EHR data but unaware of the model sources. Since GPT-o3’s reasoning process is in English, which naturally distinguishes it from DS-R1’s content, a translation prompt (Appendix A7) was used to instruct DeepSeek-V3 to convert the English content into Chinese, and model responses were uniformly formatted as ā€œReasoning: … Answer: ā€¦ā€. Translated content (shown in Appendix B) was manually reviewed by non-evaluators with advanced English-Chinese bilingual competence (Master’s degree or higher) to ensure fidelity and mitigate potential misinterpretations. The data content was rendered into images using unified HTML rendering code for display. The models were evaluated using a five-point Likert scale9 (as shown in TableĀ 2). The evaluation results were collected through electronic forms.

Table 2 LLM evaluation likert scale.

Result

Human evaluation result

We compiled the ratings of conversations from four evaluators and calculated the average scores of DS-R1 and GPT-o3 across five dimensions, as shown in Fig.Ā 3a. The radar chart indicates that the mean values of DS-R1across all five dimensions are higher than those of GPT-o3. We tested the normality of the data distributions for both LLMs across the five dimensions using the Shapiro-Wilk test. The results showed that the data did not follow a normal distribution (P < 0.01). Therefore, we applied the Mann-Whitney U test, which revealed statistically significant differences between DS-R1and GPT-o3 in accuracy, personalization, and emotional care (P < 0.01). However, the P-values for effectiveness and safety were 0.231 and 0.086, respectively. Additionally, we used the Hodges-Lehmann estimator as a non-parametric measure of effect size to estimate the median difference in the data. Consistent with the results of the Mann-Whitney U test, the two LLMs showed statistically significant differences in accuracy, personalization, and emotional care (P < 0.01), while the p-values for effectiveness and safety were 0.231 and 0.086, respectively. FigureĀ 3b shows the average evaluation scores from the four raters. The results indicate that, except for slight advantages of GPT-o3 over DS-R1 in Rater 1’s emotional care, Rater 2’s safety, Rater 3’s safety, and Rater 4’s effectiveness—none of which were statistically significant—DS-R1outperformed GPT-o3 in all other dimensions, consistent with the overall conclusion. We plotted the score distributions of the two LLMs across the five dimensions as bar charts, as shown in Fig.Ā 3c. The results demonstrate that both models performed well, with scores primarily concentrated at 3 and 4 points. Furthermore, DS-R1 had a lower proportion of scores below 3 and a significantly higher proportion of scores at 4 compared to GPT-o3 across all five dimensions, indicating that DS-R1received more positive evaluations and a higher proportion of high scores.

Fig. 3
figure 3

Human evaluation results: (a) The average scores of LLMs, where the blue area and numbers represent DS-R1, and the yellow area and numbers represent GPT-o3. The labels with ā€œ*ā€ indicate statistically significant differences in that dimension; (b) The average scores from each of the four raters. The color correspondence with the models remains consistent with (a), and the labels with ā€œ*ā€ indicate statistically significant differences in that dimension.

Time and economic costs evaluation result

As shown in Fig.Ā 4, the results indicate that whether measured per single-round dialogue or per entire conversation of a virtual patient, the number of tokens generated by DS-R1 is approximately 1.8 times that of GPT-o3. This suggests that, excluding differences in the number of rounds, DS-R1 tends to generate more characters for reasoning and explanation.

The verbose content, on one hand, makes the model’s responses harder to quickly scan and understand, preventing its effectiveness from achieving a statistically significant difference compared to GPT-o3 in human evaluation. On the other hand, despite having a lower per-token cost, the total economic expense of DS-R1 reaches about 1.6 times that of GPT-o3, which is counterintuitive.

Thanks to DS-R1’s faster inference speed, the time costs of the two LLMs are nearly identical even when generating more tokens. Moreover, DS-R1has a shorter total response time, enabling it to meet demands more quickly.

Fig. 4
figure 4

LLMs’ time and economic costs.

Discussion

In this study, we systematically evaluated the performance of mainstream LLMs in the scenarios of out-of-hospital management for breast cancer patients. We simulated 100 VPs from real breast cancer cases, and engaged multiple rounds of dialogues under out-of-hospital management scenarios with GPT-o3 and DS-R1. The performance of LLMs was evaluated in five dimensions, including effectiveness, safety, accuracy, personalization, and emotional care. The results showed that both LLMs had satisfactory performance in out-of-hospital management. Compared to GPT-o3, DS-R1 behaved better in all dimensions according to human specialists except in Rater 1’s emotional care, Rater 2’s safety, Rater 3’s safety, and Rater 4’s effectiveness. Also, DS-R1generated more tokens in identical time with less economic cost, and it also had shorter response time than GPT-o3. Therefore, this study suggested that LLMs could be deployed in the scenarios of out-of-hospital management for breast cancer patients, DS-R1seems to have better performance compared to GPT-o3.

The LLMs’ role in out-of-hospital management of cancer patients remains in debate. Our study suggested that the majority of human physicians rated LLMs’ responses at the score of 3, which means satisfactory performance in out-of-hospital management. However, still there are existing problems, like hallucinatory responses. During the evaluation, we occasionally encountered hallucinatory responses (accounting for 2.0%, 4/200), which could severely mislead patients and cause hazardous events. For instance, LLMs sometimes suggested a HER2 negative patient to receive target therapy, or suggested a stage 0 VP to receive chemotherapy in our study. Though the case is rare, it could result in irretrievable consequences. This is in accordance with another study employing GPT-3.5 and GPT-4. They conducted an intrinsic evaluation study rating 60 GPT-powered VP-clinician conversations to evaluate the clinical performance of LLMs and to rate the quality of dialogues and feedback. The result showed that the quality of LLMs-generated ratings of feedback is similar to human physicians, but it still has detractors like lower authenticity, verbose vocabulary and failure to mention important weaknesses or strengths10. Similar conclusions were also generated by other studies focusing on Alzheimer’s disease management11 and pain management12. To cope with the inaccuracy of response by LLMs, Ge et al. suggested using liver disease-specific LLM ā€œLiVersaā€, which enhance the LLMs with retrieval-augmented generation (RAG). The LiVersa demonstrated better performance than GPT-4 in answering hepatology-related questions13. Therefore, we believe in the promising future of LLMs in out-of-hospital management of cancer patients, however, before it could be universally deployed, we may need to address problems like hallucinatory responses. RAG specific LLM could be a future direction to improve the performance of LLMs in various medical scenarios.

Both GPT-o3 and DS-R1demonstrated substantial potential in assisting out-of-hospital management, but DS-R1 had better overall performance and less cost than GPT-o3 in our study. As newly emerged AI, DS-R1has little research in breast cancer, whereas GPT has the most applications among existing LLMs in multiple scenarios of the practice. One retrospective, cross-sectional study reported that over one-third recommendations for breast, prostate, and lung cancer by GPT-3.5 were not consistent with the standard care set provided by the National Comprehensive Cancer Network (NCCN)14, though the updated GPT-4 has significant improvement in accuracy and details of recommendations15. Another cross-sectional study assessed the response to the 5 most searched queries in Google by 4 mainstream AIs. GPT-3.5 demonstrated relatively high readability (DISCERN score) and understandability (PEMAT score), but relatively low actionability6. This is consistent with our findings that GPT has satisfied accuracy, personalization, safety, effectiveness and emotional care in out-of-hospital management according to human specialists, though it is inferior to DeepSeek-R1. Further, DS-R1generates more tokens than GPT-o3 at similar time, though it has higher cost in total. Compared to human being, LLMs application in remote care management could save tremendous resources16, however, few study make comparison among existing LLMs in the field of cost-effectiveness17. We believe that with the fast iteration of LLMs, the cost of chatbot derived from them will continuously decrease, with significantly increased efficacy.

Although LLMs have demonstrated promising applications in the out-of-hospital management for breast cancer, limitations are still exist. According to the human specialists, the responses of involved LLMs have moderate risk of misleading for the patients (Likert scale 2.92/2.77). The reason for the misleading risk could be derived from wrong suggestions based on VPs, which is consistent with previous literatures indicating the limited applicability of LLMs18,19. Deng et al. reported that GPT-4 has superiority over GPT-3.5 and Claude2 in terms of quality, relevance and applicability in the analyses of breast cancer cases, however, the applicability remains limited according to human raters18. Similarly, another study also reported LLM makes considerably fraudulent decisions at times, which could mislead multidisciplinary tumor board for breast cancer (MTB) decisions19. Therefore, LLMs are not yet ready for the full application in out-of-hospital management for breast cancer, more research is warranted in the improvement of accuracy of responses in future.

Our study indicated that LLMs could provide personalized, empathetic, and accurate suggestions in the out-of-hospital management for breast cancer patients. LLMs could identify the emotional requirement of the patients and provide support for the psychological problems. This is consistent with a previous study that chatbot based on GPT could generate empathetic, quality and readable responses to patient questions compared to human physicians in social media20. Another study reported that Chatbot ā€œVivibotā€ could deliver positive personalized psychology skills to young adults who have undergone cancer treatment, which could significantly reduce anxiety21. Therefore, LLMs could be a strong supporter of physicians as well as cancer patients during the treatment and out-of-hospital management.

Our study has significant advantages of randomized, and multi-phase study design, LLM-human physician evaluation and validation for the results. However, we also confess several limitations. First, we only evaluated two most up-to-dated reasoning enhanced LLMs, other LLMs like Grok3, were not included in the study. Second, only 10 VPs were simulated for the test, though over 100 question datasets were created, still the sample size is limited. Third, we included 4 human physicians participating in the evaluation of the responses from LLMs, inter-person heterogeneity could also affect the results. However, we employed Cohen’s Kappa test to reduce the potential bias. In Dataset A, the Cohen’s Kappa test results were 0.52 for DS-R1 scores (P < 0.01) and 0.68 for GPT-o3 scores (P < 0.01). In Dataset B, the results were 0.80 for DS-R1 scores (P < 0.01) and 0.54 for GPT-o3 scores (P < 0.01). Fourth, limited by the time of follow-up, we have no events of prognosis, which restricts our exploration of the association between LLM deployed out-of-hospital management and the prognosis of the disease. Last, our study is a single center study, additional validation is required.

Conclusion

Our findings demonstrate that LLMs like GPT-o3 and DS-R1 show significant promise for the out-of-hospital management of breast cancer patients. Both models delivered personalized and empathetic responses, with DS-R1 showing superior overall performance, particularly in personalization, emotional care, and accuracy. However, the critical barrier to their autonomous application is the risk of generating factually incorrect and dangerous medical advice. These ā€œhallucinations,ā€ while infrequent, pose an unacceptable threat to patient safety, limiting the current applicability of LLMs. Therefore, before LLMs can be safely integrated into the era of digital healthcare, future research must prioritize improving the safety and reliability of their answers. The focus must be on eliminating these critical errors, potentially through advanced methods like Retrieval-Augmented Generation (RAG), to increase their real-world clinical applicability.