Evaluating evidence-based health information from generative AI using a cross-sectional study with laypeople seeking screening information

Rebitschek, Felix G.; Carella, Alessandra; Kohlrausch-Pazin, Silja; Zitzmann, Michael; Steckelberg, Anke; Wilhelm, Christoph

doi:10.1038/s41746-025-01752-6

Download PDF

Article
Open access
Published: 09 June 2025

Evaluating evidence-based health information from generative AI using a cross-sectional study with laypeople seeking screening information

Felix G. Rebitschek^1,2,
Alessandra Carella³,
Silja Kohlrausch-Pazin¹,
Michael Zitzmann¹,
Anke Steckelberg⁴ &
…
Christoph Wilhelm^1,5

npj Digital Medicine volume 8, Article number: 343 (2025) Cite this article

3692 Accesses
2 Citations
8 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs) are used to seek health information. Guidelines for evidence-based health communication require the presentation of the best available evidence to support informed decision-making. We investigate the prompt-dependent guideline compliance of LLMs and evaluate a minimal behavioural intervention for boosting laypeople’s prompting. Study 1 systematically varied prompt informedness, topic, and LLMs to evaluate compliance. Study 2 randomized 300 participants to three LLMs under standard or boosted prompting conditions. Blinded raters assessed LLM response with two instruments. Study 1 found that LLMs failed evidence-based health communication standards. The quality of responses was found to be contingent upon prompt informedness. Study 2 revealed that laypeople frequently generated poor-quality responses. The simple boost improved response quality, though it remained below required standards. These findings underscore the inadequacy of LLMs as a standalone health communication tool. Integrating LLMs with evidence-based frameworks, enhancing their reasoning and interfaces, and teaching prompting are essential. Study Registration: German Clinical Trials Register (DRKS) (Reg. No.: DRKS00035228, registered on 15 October 2024).

Evaluating large language models on medical evidence summarization

Article Open access 24 August 2023

Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records

Article Open access 13 August 2025

A toolbox for surfacing health equity harms and biases in large language models

Article Open access 23 September 2024

Introduction

The internet has become a primary source of health information for people in Western countries¹. However, many online resources fail to adhere to evidence-based health communication standards². As the practical implementation of guidelines for evidence-based health communication^3,4 remains challenging, much of the health information available online does not support informed health decision-making, and it lacks necessary accuracy and rigor, and in some cases even spreads misinformation, particularly in areas such as cancer⁵. The rapid emergence of artificial intelligence (AI) tools, particularly large language models (LLMs) such as OpenAI’s ChatGPT, Mistral AI’s Le Chat, has opened novel possibilities for digital health communication – even in professional settings⁶. Laypeople are increasingly turning to these platforms to seek answers to health-related questions^7,8. However, can LLM users reliably obtain correct information that complies with established health communication guidelines? To address this, we conduct two complementary studies aimed at evaluating and enhancing the quality of health communication provided by LLMs.

Although LLMs are not designed for health communication, laypeople use them to gather health information. In a convenience online sample of adult panellists from February to March 2023, 7.2% reported regularly using LLMs for health-related topics—often in combination with search engines and online health communities⁹. This number may have increased, as indicated by studies with varying methodologies (e.g. assessment of frequency of use) and populations (e.g. up to 79% among LLM users⁸): 21.2% of an online community convenience sample in May 2024¹⁰, 20.3% of an online representative sample for Germany in a preprint¹¹, and 32.6% of a U.S. online panel in February 2024 reported using LLMs for health information¹².

Though often correct, LLM-generated responses to health-related queries show substantial variability^13,14. The ‘accuracy’ or ‘reliability’ of LLM responses has been subject to numerous descriptive studies, including a recent systematic review of 88 patient education studies¹⁵. Yet, what does accuracy or reliability truly mean in this context? LLM-generated responses to health-related questions have been assessed in various ways, such as readability^{16,17,18,19,20,21}, expert-rated appropriateness of factual explanations^{18,21,22,23,24,25,26,27,28,29,30}, and agreement with health authorities and guidelines^{31,32,33,34,35,36,37,38}. However, what remains missing, is the use of the best available evidence as the ground truth.

LLM information and communication studies consistently lack a well-defined healthcare premise (e.g. informed decision-making). This absence undermines both the criteria for response assessment (e.g. compliance with evidence-based health communication standards) and the structured selection of prompts (e.g. aligning with information requirements). Moreover, these studies often fail to ground their assessments in the best available evidence, such as high-quality evidence synthesis. Evidence-based guidelines, such as those developed by the Working Group for Good Practice in Health Information (GPHI), provide a structured framework for health communication^4,39. They emphasize, for instance, the importance of using numerical data to present risks and benefits based on the best available evidence. Regarding validated instruments for assessing information quality, at least three LLM studies^13,16,40 have applied the DISCERN standards, which from today’s perspective, are considered outdated, while one study utilized a validated patient information quality instrument⁴¹.

The primary aim of Study 1, therefore, is to assess the extent to which content generated by state-of-the-art LLMs—specifically OpenAI’s ChatGPT, Google Gemini, and Mistral AI—adheres to guideline-based criteria, with a particular focus on essential aspects of health risk communication for informed decision-making. These criteria encompass clear communication of benefits and harms, inclusion of reference class information, the appropriate presentation of numerical effects, the quality of evidence, and accurate interpretation of screening results. Additionally, they cover background information, such as proper referencing and disclosures of conflicts of interest. The first research question (RQ) directly asks: Does the health risk communication provided by LLMs reflect the best available evidence? The hypothesis is that LLMs largely fail to meet standard criteria for evidence-based health risk communication, as indicated by more than 50% of responses deviating, on average, from these guidelines. Using both MAPPinfo, an established assessment instrument⁴², and ebmNucleus, an assessment proposal derived from the Guideline Evidence-Based Health Communication³, Study 1 systematically examines the quality of outputs generated in response to repeated prompts concerning mammography for breast cancer screening (BC) and the PSA test for prostate cancer screening (PC).

Prompt engineering refers to the effective interaction with LLMs to obtain optimal responses. The effect of prompt quality on LLM-generated responses has been demonstrated, for instance, in a March 2023 study on cardiovascular health advice, which used a non-validated expert-rating system to assess correctness, including the dimension of simplification in presentation⁴³. However, it remains unknown whether prompt engineering also leads to more evidence-based responses for health decision-making. Therefore, Study 1 explores whether more informed prompts containing context knowledge increase LLM response quality. Accordingly, our RQ2 asks: Does a systematic variation of the contextual informedness of prompting increase the proportion of evidence-based LLM responses? We assume at least a moderately strong association between the grade of prompting informedness and the extent to which LLM responses are scored as evidence-based.

We expect that our findings from this systematic variation of LLM prompting will align with prior studies that have used different accuracy benchmarks. Building on this, we aim to provide causal evidence that supporting laypeople in prompting can improve the quality of LLM responses they receive. However, since little is known about how effectively laypeople prompt without guidance, Study 2 first describes their LLM performance by assessing the compliance of LLM outputs with evidence-based health communication standards—using the same instruments and conditions as in Study 1. Second, earlier studies have shown that prompt engineering training can improve response quality, as demonstrated, for example, with journalists⁴⁴. However, training programs face hurdles in practical implementation, such as time requirements, costs, and the need for human involvement. Beyond traditional education, boosting⁴⁵ could be a more feasible approach. Boosts–interventions that modify human cognition or the environment using behavioral insights to enhance people’s competences–have been proposed for navigating digital information architectures, including applications in fake news detection⁴⁶ and health information search⁴⁷. Study 2 examines RQ3: Does a minimal boosting intervention that encourages laypeople to provide more informed prompts increase the proportion of evidence-based responses?

To sum up, we aim to describe the limitations of LLM with regard to the standards of evidence-based health communication, demonstrate the potential for improvement through contextually more informed prompting, confirm the expected shortcomings of human prompting, and provide evidence that a simple intervention can help laypeople obtain more evidence-based health information. A study protocol was published in advance⁴⁸.

Results

Study 1

LLMs’ responses had a median length of 211 (IQR: 158–278) words per prompt. All LLMs provided a response to every prompt. Our systematic variation of prompt informedness revealed the expected pattern across the tested LLMs (Fig. 1a–c), demonstrating that more informed prompting led to higher scores according to both assessment schemes MAPPinfo (F(2,351) = 40.25, p < 0.001, \({\eta }_{{\rm{p}}}^{2}=0.19\)) and ebmNucleus (F(2,351) = 528.41, p < 0.001, \({\eta }_{{\rm{p}}}^{2}=0.75\)). The effect of informedness varied depending on the LLM as indicated by interactions for MAPPinfo (F(4,351) = 3.48, p = 0.008, \({\eta }_{{\rm{p}}}^{2}=0.04\)) and ebmNucleus (F(4,351) = 12.11, p < 0.001, \({\eta }_{{\rm{p}}}^{2}=0.12\)).

Study 2

The sample (Table 1) consisted of 52.3% female, without any non-binary participants. Age ranged from 19 to 78 years, with a mean age of 46.4 years (SD = 15.5). The majority of participants were highly educated: 16.3% held a Master’s degree, 43.3% a Bachelor’s degree, and 4.3% a PhD. Additionally, 20.7% had completed school education up to the age of 18, while 9.7% had obtained professional or technical qualifications. A small proportion (5.7%) reported having no formal education beyond the age of 16. A total of 63.0% of participants reported having at least some experience with LLMs, and notably, 31.7% used them at least once per month to seek health information. On average, participants submitted 3.0 prompts (SD = 1.0). Those prompts had a median length of seven words (IQR: 5–9). All LLMs generated a response for every prompt, and the intervention increased the prompt word count (p = 0.032). The LLMs’ responses to these prompts had a median length of 188 words (IQR: 126–254) per response.

Table 1 Sample description according to intervention condition

Full size table

Independent of the specific LLM, the information elicited in the control condition reached only approximately 17% of the possible maximum score in MAPPinfo and 13% in ebmNucleus. A regression analysis of the control condition (F(6,144) = 7.75, p < 0.001) revealed that participants generated higher quality information (ebmNucleus) when they had a higher level of education (β_standardized = 0.16, p = 0.032) and more experience with LLMs (β = 0.26, p = 0.007).However, more frequent LLM usage for seeking health information was negatively associated with information quality (β = −0.25, p = 0.009). Neither age (p = 0.219) nor gender (p = 0.952) had a significant influence, when controlling for the chosen topic, with prostate cancer screening prompts yielding higher-quality information (β_standardized = 0.41, p = 0.003).

The boosting intervention for more informed prompting moderately improved the quality of elicited health information across all LLMs (Fig. 2a–c), as indicated by ANOVA main effects for MAPPinfo (F(1,294) = 15.26, p < 0.001, \({\eta }_{{\rm{p}}}^{2}=0.05\)) and ebmNucleus (F(1,294) = 12.12, p = 0.001, \({\eta }_{{\rm{p}}}^{2}=0.04\)). The effect of the intervention neither varied for MAPPinfo (F(2,294) = 0.32, p = 0.726) nor for ebmNucleus (F(2,294) = 1.18, p = 0.307) across the LLMs.

Discussion

Our findings highlight several key insights into the quality of health information provided by LLMs and the role of prompt-informedness in improving response quality.

First, our findings reinforce previous concerns that LLMs are not designed to inform health decision making and, in particular, they do not comply with the standards of evidence-based health information. Even when well-informed prompts were crafted by researchers—ensuring maximum accordance with information requirements—not even 50% of maximum compliance could be reached with ChatGPT and Gemini responses (somewhat better in Le Chat).

Second, our results confirm that the quality of elicited information directly depends on prompt informedness. This was systematically demonstrated by the gradual prompt manipulation, which aligned with the information requirements for informed decision-making. Importantly, this relationship was observed across both assessment tools—ebmNucleus, which explicitly focuses on these information requirements, and MAPPinfo, an established but broader quality assessment instrument.

Third and most notably, we provide causal evidence that a simple boosting intervention—reminding users to inquire about possible consequences of medical options—improved the quality of LLM-generated health information. This extends previous research on digital navigation boosts^45,47, demonstrating that those behavioral interventions can help users seek better information. Moreover, our boost could be easily implemented at LLM interfaces. Theoretically, a hybrid approach would also be conceivable, in which evidence-based health communication prompting structures like the OARS rule are not only shown to users but also embedded into system-side chain-of-thought prompts⁴⁹. This could benefit users who do not engage with optional guidance. However, such integration should be designed carefully, as fully automated scaffolding may reduce users’ awareness and active involvement in their own decision-making, particularly in health contexts that require personal judgment and reflection.

These insights may also inform the development of guidelines for responsible LLM health communication, which should inform about health prompt engineering, the potential of simple user-side prompting interventions, and standards and tools for output quality assessments, such as MAPPinfo and ebmNucleus.

Despite the strengths of our study, several limitations warrant consideration. The relatively low scores obtained with MAPPinfo may suggest that traditional health information assessment tools need to be carefully adapted for evaluating LLM outputs. However, this does not imply that the fundamental quality criteria for evidence-based health information should change. These criteria are derived from established evidence and codified in guidelines, forming the basis of our investigation. MAPPinfo was originally designed for structured health information sources, such as websites and brochures. While LLM-generated responses are dynamic and interactive, they should still adhere to the same evidence-based standards. If LLM-generated health information consistently fails to meet these standards, this raises concerns about their appropriateness as a source for informed medical decision-making rather than the validity of the assessment tools themselves.

The lower scores observed in participant-driven prompting with ebmNucleus may be explained by the task instructions. Participants were asked to seek general information about BC or PC screening, rather than specifically about mammography or PSA testing. This methodological choice influenced scoring outcomes because broad inquiries (e.g., “Tell me about breast cancer screening”) may generate responses covering multiple screening technologies, leading to less targeted benefit-harm presentations. While this directly affected only 8 out of 38 points in MAPPinfo, it influenced 18 out of 27 points in ebmNucleus, where a structured benefit-risk assessment is central (partial conceptual overlap is reflected in the limited convergent validity, for Study 1 with r = 0.45, for Study 2 with r = 0.31 in relation to the proportions of maximum scores). Nevertheless, these findings reflect real-world search behaviors, where users may not always phrase their questions in the most targeted manner.

Another limitation is that our Study 1 relied on single-turn prompts, rather than engaging in multi-turn conversations. This should ensure controlled and systematic prompt variations, making it easier to isolate the effects of prompt informedness. Given that conversational interactions with LLMs can improve information quality by allowing users to clarify, elaborate, or challenge information, future studies should explore the effect of conversation. In contrast, Study 2 provided a more naturalistic setting, enabling participants to engage in multi-turn conversations with the LLMs. Notably, these LLM conversations did not lead to responses of high quality in terms of evidence-based health communication. Future research could investigate how information quality evolves over time within a more extensive conversation.

Furthermore, it is possible that today’s LLMs may yield better evidence-based responses. However, there is currently no indication that LLMs are explicitly trained on evidence-based health communication principles. Additionally, while some newer models have real-time web access, this does not necessarily imply better compliance with EBM. Future studies should assess how real-time retrieval mechanisms influence response accuracy and bias in health-related LLM interactions.

Our study suggests that differences in education level and prior LLM experience influence the quality of information users retrieve. While more experienced users generated higher-quality responses, those who frequently used LLMs for health-related searches retrieved lower-quality information. This highlights a potential risk of digital inequality, where less-experienced users may unknowingly rely on incomplete or misleading AI-generated health information. Future research should investigate whether LLM interfaces could be adapted to mitigate such disparities, for example, by providing adaptive guidance based on user expertise levels.

LLMs generate output based on probabilistic language modeling, meaning they can reinforce existing biases in health communication. For instance, certain populations may receive different risk-benefit framing depending on how they phrase their queries. Given that health decisions disproportionately affect marginalized groups, fairness considerations in AI-generated health information should be systematically addressed. One possible approach is to audit LLM-generated health responses and related effects across demographic groups to detect and correct disparities. A prerequisite for this lies in future research that focuses on the impact of improved LLM-generated responses – how they shape users’ actual knowledge. Here it is important to distinguish studies that assess how well participants are actually prepared for making an informed decision from assessments of perceived informedness. Subjective impressions of being well-informed may not necessarily align with communication quality. In the case of intervention studies (e.g. boosting) this might promote misleading interpretations of effectiveness.

To sum up, LLMs are used as health information sources, yet our findings emphasize that they do not inform appropriately. While informedness plays a crucial role in improving response quality, even very informed prompts fail to ensure full compliance with standards of evidence-based health information. As LLMs continue to evolve, AI developers must focus not only on integrating the best available clinical evidence but also on leveraging behavioral science insights to optimize how information can be summarized and presented in balanced and transparent way⁵⁰. Then can AI-driven health communication support knowledge acquisition, correct risk perceptions, and enable informed decision-making. But even in the light of further improvement, laypeople must also be informed about how well LLMs perform and where these models fall short, as users may hold unrealistic expectations about algorithm quality⁵¹.

Finally, as it is true for every algorithm system that is to be widely implemented in healthcare, the system’s stakeholders need to understand related consequences for different groups and stakeholders⁵²; ideally with the help of large randomized trials.

Methods

Study 1

Study 1 employed systematic prompting variations to assess the compliance of LLM-generated health risk information with evidence-based health communication standards. Reporting followed the STROBE⁵³ and TRIPOD-LLM⁵⁴ guidelines.

Design and data collection

This content analysis aimed to systematically evaluate the quality of responses generated by three LLMs—OpenAI’s ChatGPT (gpt-3.5-turbo), Google Gemini (1.5-Flash), and Mistral AI Le Chat (mistral-large-2402)—when prompted with health-related queries. These LLMs were selected based on their accessibility and public awareness in Germany during the summer 2024, as they were available without payment or the need to disclose personal data (except for an email address). At that time, specialized medical LLMs did not meet these criteria. The study followed a pre-registered content analysis design (AsPredicted, Registration No. 180732), and was conducted between June and July 2024. Since no human participants were involved and no personal data were used, ethical review was not required.

Our systematic prompting strategy (Table 2) included 18 prompts (Supplementary Tables 1 and 2) for both mammography and for the PSA test. The strategy varied prompts across three levels of contextual informedness: (1) highly informed prompts, which included technical keywords, (2) moderately informed prompts, which used more generic synonyms, and (3) low-informed prompts, which omitted key terms entirely. Furthermore, the strategy incorporated six information requirements that were chosen – in contrast to patients’ subjective information needs⁵⁵ – because they are essential for deliberately weighing the probabilistic consequences of utilizing versus not utilizing a medical option (decision-analytic approach⁵⁶), while taking also uncertainties and the special case of screening test-specific conditional probabilities (reliability of the option) into account (as intended by patient decision aids⁵⁷). Fulfilling all selected requirements may enable an individual to make an informed medical decision⁵⁸ – one of the key principles of evidence-based medicine.

Table 2 Prompting strategy underlying systematic testing

Full size table

LLMs are probabilistic models that produce variable outputs even when given the exact same prompt. To ensure a valid assessment—while many previous studies have relied on insufficient prompt repetitions (e.g., around 1–4)^14,31 —we designed our study with a sample size of twenty independent trials per prompt to account for low probability responses. The three LLMs were accessed via API, without output length limitations, resulting in a total of 2160 prompts.

Mammography for BC and the PSA test for PC screening were selected as study topics because they represent two of the most common cancers worldwide, affecting millions of individuals and posing significant health implications for screening decisions. Both involve complex risk-benefit considerations in screening and treatment, require clear, evidence-based health communication to support informed decision-making. Additionally, BC and PC screenings have been subject to distorted public perceptions^59,60, as well as ongoing debate due to concerns over overdiagnosis, false positives, and varying guidelines. These factors make them ideal case studies for assessing whether LLMs can provide accurate, guideline-adherent health information.

Scoring and analysis of LLM responses

Two independent researchers and research assistants who were blinded to each other’s assessments, the specific LLM, the prompt, and its level of informedness scored the LLM responses using a validated scoring instrument, MAPPinfo⁴² for evidence-based health information quality, and, in addition to that, the proposal ebmNucleus, which draws on the guideline for evidence-based health information³ (Supplementary Table 3). Both tools share certain criteria (Table 3) but differ in their focus: while MAPPinfo provides a broad assessment of information quality, ebmNucleus concentrates on the core aspects of health decision-making, minimizing the influence of contextual factors irrelevant to most LLMs (e.g., authorship, graphical elements). Each LLM response was systematically labeled and scored accordingly.

Table 3 Criteria of the health information quality scoring instruments used in Study 2

Full size table

Here, the evidence underlying the content was assessed according to the best available evidence (summarized under www.hardingcenter.de/en/transfer-and-impact/fact-boxes) from 2022, aligning with the training corpora of the LLMs used. While responses were consistently scored on a zero to two-point scale in MAPPinfo, responses concerning patient-relevant benefits and harms, referencing of single event probabilities, benefit and harm numbers, evidence quality, and interpretation of screening results was scored on a zero to three-point scale in ebmNucleus, while all other aspects were scored on a binary zero to one-point scale. Responses classified as ‘hallucinations’ or absurd statements were scored accordingly.

The raters were research assistants (RAs) and experienced researchers who had previously conducted similar coding tasks in multiple studies. The latter had expertise in evidence-based health communication. Each rater was instructed how to apply the scoring instruments. Two RA’s (each 50%) and one researcher independently scored LLM responses according to MAPPinfo; dyads from a pool of five RA’s independently scored according to ebmNucleus (Supplementary Table S3). The highest possible scoring (three points) of a test result interpretation required, for instance, providing the probability that a positive result may be due to underlying disease (=positive predictive value). The lowest possible score was applied when there was no hint on the limits of the screening test or just one-sided (false negatives).

Interrater reliability across all LLM outputs, calculated using Cohen’s kappa coefficient, was ϰ = 0.41 for MAPPinfo and ϰ = 0.38 for ebmNucleus. Because the reliability was too low, a third rater, a researcher (FGR and CW, respectively), who was blinded to the previous ratings, rated cases of disagreement. Their experienced ratings resulted each in ϰ = 0.94 agreement with the respective prior raters.

Raw sum scores were calculated for both MAPPinfo and ebmNucleus. Average sum scores were calculated for each LLM and prompt category and converted into the rate achieved regarding perfect evidence-based criteria. Descriptive statistics and variance analyses were conducted to examine the effect of prompting informedness on scoring outcomes.

Study 2

Study 2 employed systematic investigations of laypeople’s prompting to assess the extent to which LLM-generated responses were evidence-based and how this could be improved through a boosting intervention. Reporting followed the STROBE⁵³ and TRIPOD-LLM⁵⁴ guidelines.

Sample

Without prior evidence of the effect of a boost on LLM prompting, we focused on detecting a meaningful effect rather than any small difference. So, for detecting a moderate ANOVA main effect (based on Cohen’s⁶¹ convention for eta squared = 0.06, which equals partial eta squared for the one-way ANOVA) when comparing two between-subjects conditions (with and without intervention), we required a minimum sample size of n = 237 participants. To approximate simplified census data of Great Britain in terms of sex, age, and ethnicity, we recruited n = 300 adult participants from an online-representative pool of participants via Prolific. Their remuneration was €1.30 upon completion of the study.

Design

We used a 2 × 3 between-subjects design: participants were randomly assigned either to standard prompting instructions (control) or enhanced prompting instructions (boosting intervention) and to one of three LLMs; OpenAI ChatGPT (gpt-3.5-turbo), Google Gemini (1.5-Flash), or Mistral AI Le Chat (mistral-large-2402). To ensure that participants could generate detailed prompts and receive more comprehensive responses, the survey was restricted to tablets and computers. Typing on smartphones is not equivalent to using a full keyboard, which could affect both the quality and complexity of the prompts as well as the resulting responses.

A minimal boosting intervention encouraged participants to consider the possible consequences of their choices as follows: “Please consider the OARS rule: You need to know your Options, the Advantages and Risks of each, and how Steady they are to happen.” In accordance with the competence principle of boosting⁴⁵, the researchers developed the intervention on the basis of a prior shared decision making intervention for patients (“Ask three questions”⁶²) and included it in a pre-test. Participants in the control group did not receive this intervention.

To ensure a balanced distribution of participants across study groups, we employed built-in block randomization (algorithmically managed, without fixed block size declarations) of the platform SoSci Survey where we hosted the study. Both participants and researchers were blinded to the assigned LLM and the type of prompting instructions to prevent bias in interaction and response evaluation.

The study was pre-registered with the German Clinical Trials Register (DRKS) (Registration No. DRKS00035228, registered on 15 October 2024) and received ethical approval from the Ethics Committee of the University of Potsdam (Approval No. 52/2024). It was conducted in October 2024 in accordance with the Declaration of Helsinki. All participants provided electronic informed consent.

Pre-test

A pre-test that involved n = 20 participants was conducted to verify the effectiveness of the randomization procedures, test the technical functionality of the survey platform and its integration with the LLM APIs. Additionally, it measured the average completion time and identified potential issues related to participant burden. User feedback was collected to address any usability concerns, and necessary adjustments were made to optimize the study protocol for the main trial. The pre-test also assessed the study materials, particularly the clarity and comprehensibility of the survey questions and instructions.

Procedure and material

Participants received a brief introduction outlining the study’s purpose, which involved interacting with one of three LLMs, all preset as a “helpful assistant” to obtain health information about either BC or PC screening. After giving informed consent, participants completed the questionnaire, providing basic demographic details such as gender, age, and education level. Participants could choose between BC and PC screening, regardless of their self-reported gender. This approach aimed to enhance engagement and data relevance while maintaining the integrity of the study design. They were then instructed to gather information on their chosen topic, which was expected to increase variance in information quality compared to focusing solely on mammography and PSA testing.

Participants interacted with the chatbot by entering their queries (prompts) into the questionnaire interface (Supplementary Fig. 1) that was connected via the LLM’s API, and the generated responses were collected alongside the prompts for systematic analysis in pre-defined internal variables in SoSci Survey. The interface allowed for multi-turn conversations: participants could submit follow-up questions within the same session, and the LLMs had access to the full chat history to generate context-sensitive responses. To ensure comparability across participants and maintain a controlled interaction length, each participant could submit a maximum of four prompts. At the end of the session, participants reported their frequency of LLM usage, prior experience with such models, and their attitude on shared decision-making. On average, participants completed the survey (Supplementary Table 4) in five minutes.

Measures

In addition to demographic measures such as self-reported gender, age, and education level, we assessed LLM usage frequency, prior experience, and preferred approach to medical decision-making using specific items. LLM usage frequency was measured with the question, “How often have you used computer programs like ChatGPT (large language models) to get health information?” Response options included: Never, About 1–5 times per year, About 1–2 times per month, About 1–2 times per week, More frequently. Prior general experience was assessed with the item, “Please rate your experience with computer programs like ChatGPT (large language models).” Participants responded on a scale: Definitely no experience, Rather few experience, Some experience, Rather much experience, Definitely much experience. Their attitude on shared decision-making was elicited with statements about decision responsibility between GP and oneself.

Scoring and analysis

To evaluate the responses generated by the LLMs, scoring followed the same procedure as in Study 1. Two raters independently (one RA, one researcher according to MAPPinfo; two researchers according to ebmNucleus) scored the 300 LLM outputs. Interrater reliability, calculated using Cohen’s kappa coefficient, was ϰ = 0.44 for MAPPinfo and ϰ = 0.71 for ebmNucleus. A third rater, a researcher (FGR or CW), who was blinded to the previous ratings, scored cases of disagreement. Their experienced ratings resulted in ϰ = 0.99 (MAPPinfo) and ϰ = 0.98 (ebmNucleus) agreement with the respective prior raters.

Raw sum scores were calculated for both MAPPinfo and ebmNucleus. The average sum scores were computed for each LLM and prompt category and converted into the rate of adherence to perfect evidence-based criteria. Descriptive statistics were calculated for demographic variables such as self-reported gender, age, and education, as well as for LLM usage frequency and prior experience. Variance analyses were conducted to examine the impact of prompt informedness on response quality, while interaction effects between LLM type and prompting instructions were also evaluated. Regression analyses explored the influence of demographic characteristics, LLM usage and prior experience.

All statistical tests were conducted at a significance level of p < 0.05, and effect sizes were calculated to determine the practical significance of findings.

Data availability

All data supporting the findings of this study are available from the authors.

Code availability

The analysis code is available from the authors.

References

Calixte, R., Rivera, A., Oridota, O., Beauchamp, W. & Camacho-Rivera, M. Social and demographic patterns of health-related internet use among adults in the United States: a secondary data analysis of the health information national trends survey. Int. J. Environ. Res. Public Health 17, 6856 (2020).
Article PubMed PubMed Central Google Scholar
Riera, R. et al. Strategies for communicating scientific evidence on healthcare to managers and the population: a scoping review. Health Res. Policy Syst. 21, 71 (2023).
Article PubMed PubMed Central Google Scholar
Lühnen, J., Albrecht, M., Mühlhauser, I. & Steckelberg, A. Leitlinie evidenzbasierte Gesundheitsinformation (EBM network, 2017).
Lühnen, J., Albrecht, M., Hanßen, K., Hildebrandt, J. & Steckelberg, A. Leitlinie evidenzbasierte Gesundheitsinformation: Einblick in die Methodik der Entwicklung und Implementierung. Z. f.ür. Evidenz, Fortbild. und Qual.ät im. Gesundheitswesen 109, 159–165 (2015).
Article Google Scholar
Johnson, S. B. et al. Cancer misinformation and harmful information on Facebook and other social media: a brief report. JNCI: J. Natl. Cancer Inst. 114, 1036–1039 (2022).
Article PubMed Google Scholar
Han, T. et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. Preprint at http://arxiv.org/abs/2304.08247 (2023).
Park, Y.-J. et al. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med. Inform. Decis. Mak. 24, 72 (2024).
Article PubMed PubMed Central Google Scholar
Xiao, Y., Zhou, K. Z., Liang, Y. & Shu, K. Understanding the concerns and choices of public when using large language models for healthcare. Preprint at http://arxiv.org/abs/2401.09090 (2024).
Choudhury, A. & Shamszare, H. Investigating the impact of user trust on the adoption and use of ChatGPT: survey analysis. J. Med. Internet Res. 25, e47184 (2023).
Article PubMed PubMed Central Google Scholar
Yun, H. S. & Bickmore, T. Online health information–seeking in the era of large language models: cross-sectional web-based survey study. J. Med. Internet Res. 27, e68560 (2025).
Article PubMed PubMed Central Google Scholar
König, L. M., Podszun, M., Gaissmaier, W., Giese, H. & König, L. M. Who uses AI chatbots to obtain nutrition-and health-related information? Analysis of a representative German sample. Preprint at http://osf.io/preprints/psyarxiv/ez376_v1 (2025).
Mendel, T., Singh, N., Mann, D. M., Wiesenfeld, B. & Nov, O. Laypeople’s use of and attitudes toward large language models and search engines for health queries: survey study. J. Med. Internet Res. 27, e64290 (2025).
Article PubMed PubMed Central Google Scholar
Anastasio, A. T., Mills IV, F. B., Karavan Jr, M. P. & Adams Jr, S. B. Evaluating the quality and usability of artificial intelligence–generated responses to common patient questions in foot and ankle surgery. Foot Ankle Orthop. 8, 24730114231209919 (2023).
Article PubMed PubMed Central Google Scholar
Wang, G. et al. AI’s deep dive into complex pediatric inguinal hernia issues: a challenge to traditional guidelines?. Hernia 27, 1587–1599 (2023).
Article CAS PubMed Google Scholar
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).
Article PubMed Google Scholar
Onder, C., Koc, G., Gokbulut, P., Taskaldiran, I. & Kuskonmaz, S. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci. Rep. 14, 243 (2024).
Article CAS PubMed PubMed Central Google Scholar
Amin, K., Doshi, R. & Forman, H. P. Large language models as a source of health information: Are they patient-centered? A longitudinal analysis. Healthcare (Elsevier, Amsterdam, Netherlands) 12, 100731.
Haver, H. L. et al. Evaluating the use of ChatGPT to accurately simplify patient-centered information about breast cancer prevention and screening. Radiology 6, e230086 (2024).
PubMed PubMed Central Google Scholar
Peng, C. et al. A study of generative large language model for medical research and healthcare. NPJ Digital Med. 6, 210 (2023).
Article Google Scholar
Davis, R. J. et al. Evaluation of oropharyngeal cancer information from revolutionary artificial intelligence chatbot. Laryngoscope 134, 2252–2257 (2024).
Article CAS PubMed Google Scholar
Razdan, S., Siegal, A. R., Brewer, Y., Sljivich, M. & Valenzuela, R. J. Assessing ChatGPT’s ability to answer questions pertaining to erectile dysfunction: can our patients trust it?. Int. J. Impot. Res. 36, 734–740 (2024).
Article PubMed Google Scholar
Sarraju, A. et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 329, 842–844 (2023).
Article PubMed PubMed Central Google Scholar
Haver, H. L. et al. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology 307, e230424 (2023).
Article PubMed Google Scholar
Tailor, P. D. et al. Appropriateness of ophthalmology recommendations from an online chat-based artificial intelligence model. Mayo Clin. Proc. Digital Health 2, 119–128 (2024).
Article PubMed Google Scholar
Alapati, R. et al. Evaluating insomnia queries from an artificial intelligence chatbot for patient education. J. Clin. Sleep. Med. 20, 583–594 (2024).
Article PubMed PubMed Central Google Scholar
Biswas, S., Logan, N. S., Davies, L. N., Sheppard, A. L. & Wolffsohn, J. S. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiological Opt. 43, 1562–1570 (2023).
Article Google Scholar
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023).
Article PubMed PubMed Central Google Scholar
Caglar, U. et al. Evaluating the performance of ChatGPT in answering questions related to benign prostate hyperplasia and prostate cancer. Minerva Urol. Nephrol. 75, 729–733 (2023).
Article PubMed Google Scholar
Liu, J., Zheng, J., Cai, X., Wu, D. & Yin, C. A descriptive study based on the comparison of ChatGPT and evidence-based neurosurgeons. Iscience 26, 107590 (2023).
Article PubMed PubMed Central Google Scholar
Yan, S. et al. Assessment of the Reliability and Clinical Applicability of ChatGPT’s Responses to Patients’ Common Queries About Rosacea. Patient Preference Adherence, 18, 249-253 (2024).
Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. npj Digital Med. 7, 41 (2024).
Article Google Scholar
Johnson, S. B. et al. Using ChatGPT to evaluate cancer myths and misconceptions: artificial intelligence and cancer information. JNCI Cancer Spectr. 7, pkad015 (2023).
Article PubMed PubMed Central Google Scholar
Alessandri-Bonetti, M., Giorgino, R., Naegeli, M., Liu, H. Y. & Egro, F. M. Assessing the soft tissue infection expertise of ChatGPT and Bard compared to IDSA recommendations. Ann. Biomed. Eng. 52, 1551–1553 (2024).
Article PubMed Google Scholar
Huo, B. et al. Dr. GPT will see you now: the ability of large language model-linked chatbots to provide colorectal cancer screening recommendations. Health Technol. 14, 463–469 (2024).
Article Google Scholar
Cappellani, F., Card, K. R., Shields, C. L., Pulido, J. S. & Haller, J. A. Reliability and accuracy of artificial intelligence ChatGPT in providing information on ophthalmic diseases and management to patients. Eye 38, 1368–1373 (2024).
Article PubMed PubMed Central Google Scholar
Cheong, R. C. T. et al. Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google Bard. Eur. Arch. Oto Rhino Laryngol. 281, 985–993 (2024).
Article Google Scholar
Mejia, M. R. et al. Use of ChatGPT for determining clinical and surgical treatment of lumbar disc herniation with radiculopathy: a North American Spine Society guideline comparison. Neurospine 21, 149 (2024).
Article PubMed PubMed Central Google Scholar
Wang, G. et al. Potential and limitations of ChatGPT 3.5 and 4.0 as a source of COVID-19 information: comprehensive comparative analysis of generative and authoritative information. J. Med. Internet Res. 25, e49771 (2023).
Article PubMed PubMed Central Google Scholar
Working group GPGI. Gute Praxis Gesundheitsinformation [Good practice health information]. Z. f.ür. Evidenz, Fortbild. und Qual.ät im. Gesundheitswesen 110, 85–92 (2016).
Article Google Scholar
Pan, A., Musheyev, D., Bockelman, D., Loeb, S. & Kabarriti, A. E. Assessment of artificial intelligence chatbot responses to top searched queries about cancer. JAMA Oncol. 9, 1437–1440 (2023).
Article PubMed PubMed Central Google Scholar
Walker, H. L. et al. Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J. Med. Internet Res. 25, e47479 (2023).
Article PubMed PubMed Central Google Scholar
Kasper, J. et al. MAPPinfo - mapping quality of health information: validation study of an assessment instrument. PLoS One 18, e0290027 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lautrup, A. D. et al. Heart-to-heart with ChatGPT: the impact of patients consulting AI for cardiovascular health advice. Open Heart 10, e002455 (2023).
Article PubMed PubMed Central Google Scholar
Bashardoust, A., Feng, Y., Geissler, D., Feuerriegel, S. & Shrestha, Y. R. The Effect of Education in Prompt Engineering: Evidence from Journalists. Preprint at http://arxiv.org/abs/2409.12320 (2024).
Herzog, S. M. & Hertwig, R. Boosting: Empowering citizens with behavioral science. Ann. Rev. Psychol. 76, 851–881 (2025).
Kozyreva, A. et al. Toolbox of individual-level interventions against online misinformation. Nat. Hum. Behav. 8, 1044–1052 (2024).
Rebitschek, F. G. & Gigerenzer, G. Einschätzung der Qualität digitaler Gesundheitsangebote: Wie können informierte Entscheidungen gefördert werden? [Assessing the quality of digital health services: How can informed decisions be promoted?]. Bundesgesundheitsblatt-Gesundheitsforschung-Gesundheitsschutz 63, 665–673 (2020).
Article PubMed Google Scholar
Rebitschek, F. G. & Christoph, W. study protocol for a two-phase randomized evaluation of large language models in adherence to evidence-based health communication guidelines for breast and prostate cancer screening: the role of user prompt specificity and minimal interventions (BOOST-AI). Version 1.0 from October 16th 2024. https://hardingcenter.de/sites/default/files/2024-10/StudyProtocol_BOOST-AI_V1_2024-10-16_DRKS00035228_0.pdf (2024).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).
Google Scholar
McDowell, M., Rebitschek, F. G., Gigerenzer, G. & Wegwarth, O. A simple tool for communicating the benefits and harms of health interventions. MDM Policy Pract. 1, 2381468316665365 (2016).
Article PubMed PubMed Central Google Scholar
Rebitschek, F. G., Gigerenzer, G. & Wagner, G. G. People underestimate the errors made by algorithms for credit scoring and recidivism prediction but accept even fewer errors. Sci. Rep. 11, https://doi.org/10.1038/s41598-021-99802-y (2021).
Wilhelm, C., Steckelberg, A. & Rebitschek, F. G. Benefits and harms associated with the use of AI-related algorithmic decision-making systems by healthcare professionals: a systematic review. Lancet Regional Health Eur. 48, 101145 (2025).
Article Google Scholar
von Elm, E. et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J. Clin. Epidemiol. 61, 344–349 (2008).
Article Google Scholar
Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 31, 60–69 (2025).
Article CAS PubMed PubMed Central Google Scholar
Ormandy, P. Defining information need in health–assimilating complex theories derived from information science. Health Expectations 14, 92–104 (2011).
Article PubMed Google Scholar
Entwistle, V. A., Sheldon, T. A., Sowden, A. & Watt, I. S. Evidence-informed patient choice: practical issues of involving patients in decisions about health care technologies. Int. J. Technol. Assess. Health Care 14, 212–225 (1998).
Article CAS PubMed Google Scholar
Stacey, D. et al. Decision aids for people facing health treatment or screening decisions. Cochrane Database Syst. Rev. https://doi.org/10.1002/14651858.CD001431.pub4 (2014).
Article PubMed Google Scholar
Marteau, T. M., Dormandy, E. & Michie, S. A measure of informed choice. Health Expectations 4, 99–108 (2001).
Article CAS PubMed PubMed Central Google Scholar
Wegwarth, O. et al. What do European women know about their female cancer risks and cancer screening? A cross-sectional online intervention survey in 5 European countries. BMJ Open 8, https://doi.org/10.1136/bmjopen-2018-023789 (2018).
Gigerenzer, G., Mata, J. & Frank, R. Public knowledge of benefits of breast and prostate cancer screening in Europe. J. Natl. Cancer Inst. 101, 1216–1220 (2009).
Article PubMed PubMed Central Google Scholar
Cohen, J. Statistical power analysis for the behavioral sciences (Routledge, 1988).
Shepherd, H. L. et al. Can consumers learn to ask three questions to improve shared decision making? A feasibility study of the ASK (AskShareKnow) Patient–Clinician Communication Model® intervention in a primary health-care setting. Health Expectations 19, 1160–1168 (2016).
Article PubMed Google Scholar

Download references

Acknowledgements

This study did not receive any funding.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Harding Center for Risk Literacy, Faculty of Health Sciences Brandenburg, University of Potsdam, Potsdam, Germany
Felix G. Rebitschek, Silja Kohlrausch-Pazin, Michael Zitzmann & Christoph Wilhelm
Max Planck Institute for Human Development, Berlin, Germany
Felix G. Rebitschek
Department of Developmental and Social Psychology, University of Padova, Padova, Italy
Alessandra Carella
Institute of Health, Midwifery and Nursing Science, Medical Faculty, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
Anke Steckelberg
International Graduate Academy (InGrA), Institute of Health, Midwifery and Nursing Science, Medical Faculty, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
Christoph Wilhelm

Authors

Felix G. Rebitschek
View author publications
Search author on:PubMed Google Scholar
Alessandra Carella
View author publications
Search author on:PubMed Google Scholar
Silja Kohlrausch-Pazin
View author publications
Search author on:PubMed Google Scholar
Michael Zitzmann
View author publications
Search author on:PubMed Google Scholar
Anke Steckelberg
View author publications
Search author on:PubMed Google Scholar
Christoph Wilhelm
View author publications
Search author on:PubMed Google Scholar

Contributions

F.G.R. was responsible for the conceptualization and development of the study methodology, conducted the statistical analysis, interpreted the results, and co-drafted the original manuscript. A.C. and S.K.P. worked on the conceptual planning, the development and structuring of the study, and oversaw the literature. M.Z. was responsible for the technical planning and implementation. A.S. contributed to the study design, validated the methodology, and helped revise the manuscript. C.W. was responsible for questionnaire development and pretesting, supervised the study, and managed project administration. He ensured the rigorous execution of the study, contributed to the conceptualization, study design, methodology, and data collection, co-drafted the original manuscript, and was responsible for substantial manuscript revisions and editing. He also serves as the primary contact and guarantor of the manuscript.All authors had full access to the underlying data reported in the manuscript and verified its accuracy. They collectively contributed to the interpretation of results and were involved in manuscript writing.

Corresponding author

Correspondence to Christoph Wilhelm.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

Ethics Committee of the University of Potsdam (Approval No. 52/2024)

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rebitschek, F.G., Carella, A., Kohlrausch-Pazin, S. et al. Evaluating evidence-based health information from generative AI using a cross-sectional study with laypeople seeking screening information. npj Digit. Med. 8, 343 (2025). https://doi.org/10.1038/s41746-025-01752-6

Download citation

Received: 13 March 2025
Accepted: 26 May 2025
Published: 09 June 2025
DOI: https://doi.org/10.1038/s41746-025-01752-6

Subjects

Abstract

Similar content being viewed by others

Evaluating large language models on medical evidence summarization

Leveraging large language models for the deidentification and temporal normalization of sensitive health information in electronic health records

A toolbox for surfacing health equity harms and biases in large language models

Introduction

Results

Study 1

Study 2

Discussion

Methods

Study 1

Design and data collection

Scoring and analysis of LLM responses

Study 2

Sample

Design

Pre-test

Procedure and material

Measures

Scoring and analysis

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links