Large language models (LLM) applications serve as promising artificial intelligence (AI) tools to address administrative burden and support clinical decision-making in medicine1. Conversational LLM chatbots can provide quality and empathetic responses to questions in general medicine2 and oncology3,4, as evaluated by clinicians. As chatbots are deployed in patient-facing roles, there remains debate about whether patients also perceive that chatbots can demonstrate empathy, a core competency in medicine5. Empathy, defined as the ability to understand and share the feelings of others, is central to establishing trustworthy patient-provider relationships which have been linked to improved patient outcomes6. However, patients, rather than clinicians, should serve as the benchmark for determining whether their experiences have been understood, shared, and addressed7.

State-of-the-art methods to design empathetic chatbots primarily involve integrating emotional intelligence into LLMs by employing specialized training regimens, model architectures, and attention mechanisms to improve context-dependent empathetic reasoning8,9,10,11,12. These approaches have demonstrated significant advancements in generating contextually appropriate and emotionally attuned chatbot outputs but remain limited due to the computationally prohibitive nature of training and fine-tuning large-scale LLMs. Moreover, training increasingly complex LLMs on larger datasets may be subject to scaling laws of diminishing improvements in empathic responses13. An alternative approach to design more empathetic LLM applications involves multi-step processing of emotional dialogue that involves recognition of emotion in user input followed by integration of appropriate emotions in the generated response14. Adapting chain of thought prompting to elicit emotional reasoning may be an effective method to improve the human perception of empathy of foundational LLMs while decreasing resource demands, but this has not been evaluated by real-world patients15.

This study evaluated the empathy of chatbots compared to physicians in responding to oncology-related patient questions from the perspective of people with cancer and tested the chain-of-thought prompting method to elicit empathy in chatbot responses.

In total, 45 patient participants completed the survey. Survey participants were primarily White (40/45, 88.89%), identified as male (33/45, 73.33%), were above 65 years old (32/45, 71.11%), and post-secondary educated (35/45, 77.78%) (Supplementary Table 2). Descriptive statistics are shown in Supplementary Table 3.

Responses generated by Claude V1, Claude V2, and Claude V2 with CoT were all rated by participants as higher in empathy compared to physician responses (Fig. 1). The best-performing AI chatbot, Claude V2 with CoT (mean, 4.11 [95% CI, 3.99–4.22]), was rated as more empathetic than Claude V2 (mean, 3.72 [95% CI, 3.62–3.81]; P < 0.001; d = 3.46), Claude V1 (mean, 3.35 [95% CI, 3.23–3.48]; P < 0.001; d = 3.01), and physicians (mean, 2.01 [95% CI, 1.88-2.13]; P < 0.001; d = 2.11) (Fig. 1, Supplementary Fig. 1).

Fig. 1: Empathy rating of physician and chatbot responses to patient questions about cancer by survey participants.
figure 1

People with cancer (n = 45) rated the overall empathy of both physician and chatbot (Claude V1, Claude V2, Claude V2 with CoT) responses.

Participant ratings of empathy were consistently lower than physician ratings of empathy for the same set of responses generated by physicians (2.01 [95% CI, 1.88–2.13] vs mean, 2.24 [95% CI, 2.11–2.37]; P = 0.00214; d = −0.36) (Fig. 2A) and Claude V1 (mean, 3.35 [95% CI, 3.23–3.48] vs 3.51 [95% CI, 3.41–3.60]; P < 0.001; d = −0.28) (Fig. 2B). Participant-rated empathy was moderately correlated with physician-rated empathy of Claude V1 and physician responses (Supplementary Fig. 2).

Fig. 2: Empathy rating of physician and Claude V1 chatbot responses to patient questions about cancer by people with cancer (n = 45) and physicians (n = 3).
figure 2

A Empathy rating of physician responses by people with cancer and physicians. B Empathy rating of Claude V1 chatbot responses by people with cancer and physicians.

The word count of Claude V1 (mean, 192.45 [95% CI, 183.35–201.55]; P < 0.001), Claude V2 (mean, 152.31 [95% CI, 147.46–157.16]; P < 0.001), and Claude V2 with CoT (mean, 186.72[95% CI, 183.08–190.36]; P < 0.001) responses were increased compared to physician responses (mean, 99.71 [95% CI, 74.34–125.08]) (Supplementary Fig. 3). Word count was associated with participant-rated empathy for physician and ClaudeV1 but not Claude V2 and Claude V2 with CoT responses based on correlation (Supplementary Fig. 4) and OLS analyses (Supplementary Table 4).

The reading grade level of Claude V1 (mean, 9.66 [95% CI, 9.19–10.12]; P < 0.001), Claude V2 (mean, 8.79 [95% CI,8.49–9.08]; P < 0.001), and Claude V2 with CoT (mean, 8.55 [95% CI, 8.27–8.82]; P < 0.001) responses were increased compared to physician responses (mean, 8.13 [95% CI, 7.50–8.76]) (Supplementary Fig. 5). Readability of physician or chatbot responses was not correlated with participant-rated empathy (Supplementary Fig. 6).

In this cross-sectional study, we observed that patient participants rated responses authored by Claude V1 as more empathetic than physicians; this is consistent with previous results from the physician perspective3. We hypothesize that chatbots can consistently provide empathetic responses by appropriately responding to emotional cues in patient questions and offering supportive language without the time pressures of clinical workload or emotional variability due to human-oriented stressors that physicians may experience. We caution that LLMs generate outputs that reflect perceived empathy through linguistic mimicry based on probabilistic text prediction from learned patterns, rather than emotional cognition or empathic experiences inherent to human interactions16. Fine-tuning of chatbot-generated outputs to prioritize empathy may inadvertently impact medical accuracy. However, in the same dataset used in this study, we highlight that Chen et al. found that physicians rated responses generated by chatbots as higher quality, more empathetic, and more readable than responses generated by physicians3. Our current study extends these findings by assessing empathy from the perspective of people with cancer, confirming that both physicians and people with cancer may perceive chatbots as more empathetic compared to physicians.

Compared to physicians, participants rated the same set of Claude V1 and physician responses as less empathetic, suggesting that there may be differences in perception of empathy from the patient and physician perspective. Given previous findings of discordance between physician and patient perceptions of empathy17, we speculate that physicians and patients may prioritize different elements of message content and delivery in clinical care. Further research is needed to characterize the prioritization of messaging elements across diverse patient demographics to evaluate how patient-facing chatbots convey empathy in real-world clinical oncology scenarios18.

We assessed the Claude LLM due to prior evidence demonstrating its superior empathy for cancer-related inquiries3 and to minimize confounding when comparing LLMs with distinct architectures and training regimens. Despite our standalone assessment of one family of LLMs, benchmarks of chain of thought prompting compared to the baseline LLMs for complex reasoning have shown that it can be a model-agnostic approach that may be broadly generalizable to other LLMs19. The superior empathy of Claude V2 with CoT prompting compared to other chatbots and physicians is promising evidence in support of prompt engineering techniques to optimize chatbot outputs with limited technical expertise required. Application of prompt engineering techniques have seen state-of-the-art success in encoding clinical knowledge20, motivating the design of structured prompts that encode human psychosocial cues to facilitate chatbot responses to patient emotion in messaging.

This study was limited to static, single-time point interactions, restricting insights into longitudinal or real-time dynamics of patient-chatbot interactions. Future research can employ longitudinal designs, real-time conversational analysis, and established physiological or psychometric assessments to systematically explore LLM empathy and its clinical implications in medicine. Translating beyond research benchmarks towards clinician adoption of chatbots may require clinician education about how to design prompts to steer LLM responses, each with their own capabilities and limitations21,22.

While AI systems have demonstrated the ability to generate responses perceived as empathetic23,24 —potentially enhancing patient engagement and alleviating clinician workload— their deployment raises critical concerns. These include safeguarding patient privacy, ensuring informed consent, establishing oversight and liability for AI-generated outputs, mitigating biases to promote health equity, and amending changes to clinical workflows involving AI that may impact the patient-provider relationship25. The growing popularity of AI tools that mimic humanistic traits like empathy raises concerns about misinformation and misidentification as an authoritative expert or empathetic peer that should be considered in future real-world assessments26. Addressing these issues is essential for the responsible integration of AI technologies in oncology, ensuring they augment rather than diminish the empathetic communication central to effective patient care.

The primary limitations of this study include the use of isolated interactions at a single time point on an online forum to model physician-patient interactions, chatbot responses were longer than physician responses on average despite instruction to limit word count response which may confound the length of response and perceived empathy, and the biased demographic representation of survey participants who were primarily white, male, well educated, and high income that reflect the patient population available at our recruitment site. Using Reddit‐derived data as a proxy for real-world oncology consultations since online anonymity may alter the patient–physician relationship by removing nonverbal cues, reducing trust and accountability, and encouraging different self-presentation compared to real-world oncology consultations.

Participant subgroup characteristics, including having former or current diagnosis of cancer, were not collected since this pilot study aimed to capture a broad range of perspectives from individuals with cancer experience. Given evidence that biological and social contexts may contribute to differences in empathy between genders27 among other sociocultural factors, the biased demographics of the study sample may have limited the ability to generalize our results of empathic perception to non-represented populations, motivating future investigations into the association between subgroup characteristics such as demographic factors and perceived empathy of chatbot-patient interactions. This study exclusively assessed the Claude LLM, which may limit the generalizability of our findings to other LLMs. There may be differences between perceived empathy in written responses compared to real-world clinical settings that include non-verbal cues of empathy and constraints based on time and clinical workload.

Methods

Dataset

In this prospective, cross-sectional study, we surveyed oncology patients at a tertiary cancer center about the empathy of physician and chatbot responses to patient questions related to cancer. Two authors (D.C, R.P.) reviewed the external database3 in duplicate and included randomly sampled patient questions if they mentioned a cancer diagnosis and contained clinical context, such as symptom descriptions and diagnostic details, that may be typical of oncological concerns observed in clinical settings (n = 100). Patient questions were originally posted on Reddit’s r/AskDocs from January 1, 2018 to May 31, 2023. The survey collected participant demographic information and empathy ratings between May 1, 2024 and October 31, 2024. This study followed the STROBE reporting guidelines. This study was approved by the UHN Research Ethics Board.

Procedure

We generated responses from 3 AI chatbots (Claude V1, Claude V2, and Claude V2 with chain of thought prompt engineering) to each patient question with an additional prompt to limit the word length of the chatbot response to the mean physician response word count (mean, 100). Claude V1 and Claude V2 were prompted using the CoT-1 prompt while Claude V2 with CoT was prompted with CoT-1, CoT-2, and CoT-3 prompts in succession as described in Supplementary Table 1.

Survey

Patient participants were included if they were above 18 years of age, could read and understand English, and had a former or current diagnosis of cancer. Eligible participants were approached in clinics, consented to the study, and completed the survey on REDCap, a digital survey platform. We originally planned a sample size of 200 participants, but after 45 participants, an interim analysis demonstrated a significant and meaningful effect size in empathy ratings between physician and chatbot responses that prompted early study termination. Each participant was randomly assigned 10 of the 100 patient questions in the database. For each question, participants rated perceived empathy for three chatbot responses and one physician response, presented in a random order and blinded. For a random sample of 10 out 100 total patient questions about cancer, participants rated the perceived empathy of three chatbot and one physician response to each patient question (Supplementary Note 1). Empathy was scored on a Likert scale, ranging from 1 to 5 (1: very poor, 2: poor, 3: acceptable, 4: good, and 5: very good). Participant empathy scores for each response were averaged to determine a consensus participant empathy score. Physician empathy scores rated in triplicate for each response were sourced from Chen et al.3 and averaged to determine a consensus physician empathy score.

Statistical analysis

Using the Wilcoxon test with Benjamini-Hochberg correction, we compared the participant-rated mean empathy scores for responses generated by chatbots and physicians. We also compared our participant-rated mean empathy scores with published physician-rated mean empathy scores for the same set of physician and Claude V1 responses3. Physician-rated empathy scores of Claude V2 were not analyzed because Claude V2 was not released at the time of publication of this external dataset3. Spearman correlation and Ordinary Least Squares (OLS) regression was used to measure the association between word count and Flesch Kincaid Reading Grade Level (FKRGL), a measure of readability, with participant-rated empathy. Cohen’s d was used to measure effect size differences between groups. Statistical analyses were conducted with Python 3.8.9 and scipy 1.11.3.

Conclusion

This study found that oncology patients, like oncology physicians, perceive chatbot responses as more empathetic than physician responses to patient questions about cancer. However, patients may prioritize different elements of clinical messages in their evaluation of empathy than physicians. Further research is required to optimize the integration of empathy in clinical messaging and evaluate the implementation and scope of patient-facing chatbots.