Introduction

Generative Artificial Intelligence (GenAI) is being explored for its potential to enhance patient-facing healthcare communication. Large language models (LLMs) could significantly impact the informed consent (IC) process, a critical step in medical decision-making that requires clear, accurate, and accessible patient information. However, traditional IC documents often exceed recommended readability levels, limiting patient understanding and leading to a suboptimal consent process1,2,3,4.

LLMs, such as ChatGPT3.5 and GPT-4, may help address these challenges by adapting content to patient literacy levels and reducing the explanatory burden on healthcare providers5,6. However, their effectiveness in generating IC materials remains underexplored, with existing studies reporting mixed results. While GPT-4 can enhance readability with structured prompting7,8,9,10,11, its performance varies based on disease complexity, prompt specificity, and information accuracy12,13,14,15. Some studies suggest GPT-4-generated IC materials meet minimum standards11, while others highlight critical omissions, particularly regarding procedural risks, benefits, and testing alternatives16.

Most studies assessing GPT-4’s performance focus on IC forms for surgical procedures7,10,11. However, the IC process differs from an IC form: the former is a dynamic, interactive exchange between patients and providers, while the latter is a static document with predefined content. This distinction raises questions about GPT-4’s adaptability across different IC structures. Additionally, concerns remain about LLM performance across languages17,18.

This study evaluates GPT-4’s ability to generate IC materials for two widely administered genetic tests: Non-Invasive Prenatal Testing (NIPT) and hereditary breast and ovarian cancer (BRCA) testing. While LLMs have shown some utility in conveying genetic information19,20, they remain limited, particularly in understanding inheritance patterns19. To our knowledge, no prior study has systematically evaluated the accuracy, quality, and accessibility of GPT-4-generated IC materials for genetic testing. Additionally, no research has directly compared GPT-4-generated and human-generated IC materials across different languages. By addressing these gaps, we aim to determine whether LLMs can support the IC process across different genetic tests and linguistic backgrounds by replacing the human-made IC materials.

Results

Demographics

A total of 65 participants started the survey and 25 completed it (completion rate: 38%; dropout rate: 62%). The average completion time was approximately 30 minutes, excluding responses with survey times exceeding 24 hours (N = 6). Table 1 summarizes participant demographics. Our sample included 25 participants, 48% offering NIPT and 52% BRCA testing. Most (76%) were female, with an average of 17.5 years of clinical experience across genetic and non-genetic specialties. Greek and German were the most common language groups; English was the least represented, with three participants (12%).

Table 1 Demographics

Readability

Table 2a (NIPT) and 2b (BRCA) present readability scores assessed using standard metrics (see Methods section), along with GPT-4-generated material, which was consistently shorter than human-generated ones. Among the analyzed languages, Italian materials were the most difficult to read, followed by German. Greek materials were the easiest, with GPT-4 achieving a lower readability score (3.13) than human-generated material (4.71).

Table 2 a Readability scores for NIPT: b Readability scores for BRCA

For NIPT (Table 2a, excluding English), GPT-4-generated material was consistently more readable than human-generated material, with the largest difference in German (9.9 vs. 11.3, respectively).

In contrast, readability varied by language for BRCA (Table 2b). GPT-4-zero-shot materials, that is, material generated without feeding GPT-4 prior examples or domain-specific context21, were slightly less readable in English and Italian than GPT-4-RAG-generated material. In German, both GPT-4-Zero-Shot and GPT-4-RAG had lower readability scores than human-generated material.

Table 3 List of “necessary and critical” concepts for Informed Consent for genetic testing (modified from Ormond et al.46)

Error evaluation

For NIPT, participants identified a mean of 1.0 errors (range: 0–3) in GPT-4-generated material, with one outlier reporting more than four errors in the patient-facing material. Participants identified more errors in the human-generated material, with a mean of 2.0 (range: 0–4).

For BRCA, participants reported a mean of 1.0 errors in both GPT-4-Zero-Shot (range: 0–3) and GPT-4-RAG (range: 0–2). One participant reported more than four errors in GPT-4-RAG-generated material. Errors in the human-generated material were relatively infrequent (mean 0.7, range 0–2).

IC components

Participants evaluated the extent to which various IC components for genetic testing were included in the NIPT (Fig. 1a) and BRCA materials (Fig. 1b), based on a predefined framework (see Table 3 and Methods section). The figures present data combined across the four languages.

Fig. 1
figure 1

a Inclusion of components of Informed Consent on NIPT Materials. 1 – this topic is not applicable to material related to this genetic test; 2 – Applicable but not addressed at all in the material; 3 – Indirectly addressed; 4 – Briefly addressed but not clearly stated; 5 – Sufficiently addressed. X – Median – Outliers. Blue boxes: GPT-4-generated material. Orange boxes: Human-generated material. Variables on the y-axis correspond to the 15-point list we developed based on Ormond et al.46 (pg. 8). The sequence of the variables on the boxplot follows both the original framework and our survey design. b Inclusion of components of Informed Consent on BRCA Materials. 1 – this topic is not applicable to material related to this genetic test; 2 – Applicable but not addressed at all in the material; 3 – Indirectly addressed; 4 – Briefly addressed but not clearly stated; 5 – Sufficiently addressed. X – Median - Outliers. Blue boxes: GPT-4-Zero-shot generated material. Green boxes: GPT-4-RAG generated material. Orange boxes: Human-generated material. Variables on the y-axis correspond to the 15-point list we developed based on Ormond et al.46 (pg. 8). The sequence of the variables on the boxplot follows both the original framework and our survey design.

For NIPT, GPT-4- and human-generated materials performed similarly in Test reason, Test aim, Test benefit, General results, and Clinical limitations. However, human-generated material provided more comprehensive coverage in Test risk and Actions after the results components. Conversely, GPT-4-generated material scored higher in covering the Prognosis and management, Impact on families, Future steps, and Who gets the results components.

Both approaches covered key components well up to General results, though human-generated material received lower ratings in Voluntariness and Sign the consent form. The downward shift in the boxplots indicates a clear decline in overall ratings for the Non-primary results, Decision for returning non-primary results, Prognosis and management, and Impact on families components. While GPT-4-generated material exhibited greater variability in ratings, human-generated material maintained more stable ratings across participants.

For BRCA, participant ratings varied considerably across GPT-4- and human-generated material. The GPT-4-Zero-Shot approach exhibited the greatest variability, particularly in Test benefits, Actions after the results, and Who gets the results. GPT-4-RAG-generated showed more consistent evaluations, though it received slightly lower ratings overall.

Despite this variability, GPT-4-Zero-Shot outperformed GPT-4-RAG in eight of 15 categories, particularly in core components (Voluntariness, Sign the consent form), risk and result disclosure (Test risk, General results, Non-primary results, Decision for returning non-primary results), and personal and familial implications (Impact on families, Who gets the results). However, while GPT-4-Zero-Shot outperformed GPT-4-RAG in covering Non-primary results and Decision for returning them, both approaches still received low ratings in these areas, suggesting a shared weakness in handling these IC components.

Human-generated material performed best in Test reason (median: 4.5), Test risk (median: 4.0), and Actions after the results (median: 4.0). By contrast, RAG-generated material showed inconsistencies, with extreme outliers in Test reason, Non-primary results, Decision for returning non-primary results, and Future steps, where ratings fluctuated significantly from the medians.

Participants’ preferred choice

Participants preferred human-generated material for both NIPT (3.45 ± 1.63 vs. 3.36 ± 1.37 for GPT-4) and BRCA (4.00 ± 1.33 vs. 3.9 ± 1.16 for GPT-4 Zero-Shot, 3.5 ± 1.04 for RAG). For NIPT, participants were more accurate in distinguishing human- from GPT-4-generated material. However, for BRCA, they were more likely to misidentify GPT-4-RAG-generated material as human-written (66.6%), suggesting that GPT-4-RAG outputs more closely resembled human writing in this context.

Case study: Analysis of the effect of language on BRCA materials

The corresponding results are presented in Fig. 2. In the German BRCA case, human-generated material received the highest mean scores for Test reason (4.6 ± 0.8) and Test Benefit (4.8 ± 0.4). In Greek, it consistently scored higher across IC components and showed less variability in participants’ responses (range: 3.0–4.5).

Fig. 2
figure 2

Inclusion of components of Informed Consent: German versus Greek this topic is not applicable to material related to this genetic test; 2 – Applicable but not addressed at all in the material; 3 – Indirectly addressed; 4 – Briefly addressed but not clearly stated; 5 – Sufficiently addressed. Blue line: GPT-4-Zero-shot generated material. Purple line: GPT-4-RAG generated material. Red dotted line: Human-generated material. Figure 2 compares the average responses to Likert-scale questions in German (top-line chart) and Greek (bottom-line chart) regarding including IC components in the BRCA materials.

Among GPT-4-generated material, GPT-4-RAG scored highest in German for Test aim, General results, and Prognosis and management. GPT-4-Zero-Shot outputs were consistently inadequate in covering IC components in Greek, while GPT-4-RAG’s scores aligned more with human-generated material. Notably, GPT-4-RAG outperformed other methods in Test risk (mean: 4.0, SD: 0.63) and Future steps (mean: 3.75, SD: 1.09), but this was not observed across all IC components.

Non-Primary results and Decision-making for returning non-primary results components were consistently underrepresented in both languages. In German BRCA material, all three versions (human, GPT-4-Zero-Shot, and GPT-4-RAG) scored poorly, with little to no discussion of these topics. A review of the translated texts confirmed their absence. In contrast, Greek human-generated material provided more coverage, scoring higher (Non-Primary results: 3.0 ± 1.0; Decision-Making: 3.5 ± 0.87) than both GPT-4-Zero-Shot and GPT-4-RAG-generated material.

Discussion

Our findings showed that GPT-4-generated materials for both NIPT and BRCA cases remained difficult to read according to established readability thresholds for patient-facing information22,23. However, when evaluated using the same metrics, they were, in some cases, easier to read than the human-generated materials. The model did not hallucinate across its outputs. Our analysis showed notable differences in how respondents evaluated the materials, particularly those generated by GPT-4. These inconsistencies raise key questions about the model’s reliability and potential role in the IC process. Part of this variation could also reflect the inherent variability in both human- and GPT-4-generated IC materials. That is, human-generated text can be shaped based on the author’s background, communication style, and institutional norms. This can significantly limit its standardization as a benchmark. Similarly, GPT-4-generated outputs are subject to variability mainly due to the probabilistic nature of the model per se, meaning outputs can differ across sessions even with identical prompts. This dual variability should be considered when interpreting these types of findings, as it reflects the real-world variation that occurs in both clinical communication and LLM-generated text. Acknowledging this dual variability strengthens the transparency of our methodological approach and contextualizes the inconsistencies observed in both content coverage and evaluation.

GPT-4 struggled to generate NIPT and BRCA materials at the readability levels recommended by leading health organizations24,25 when no specific instructions were provided. Readability varied widely, particularly in GPT-4-generated material. These results align with previous research in general medical fields7,26,27,28,29,30 but contrast with studies where ChatGPT-3.5 and GPT-4 were explicitly instructed to simplify consent forms7,10,11. Since GPT-4 was not given pre-written text to simplify, it may have generated material at a higher reading level than expected. GPT-4 Zero-shot learning resulted in slightly harder-to-read material than GPT-4 RAG in both English and Italian, partially aligning with Lai et al.31, who found that GPT-4 zero-shot learning underperforms across different languages. GPT-4 generated the most readable texts in Greek. However, this is possibly due to limitations in existing readability assessment tools (e.g., SMOG) that are not optimized for Greek’s morphology and semantic density. Overall, GPT-4’s readability scores closely matched those of human-generated IC material, suggesting that its default text generation mimics human writing unless explicitly instructed otherwise.

GPT-4-generated materials included some IC components but omitted others, with variation across languages. In German, GPT-4-RAG-generated material outperformed human-written materials, likely due to RAG’s ability to retrieve structured knowledge from reliable sources32,33. This is particularly relevant for BRCA testing, where rising demand34,35 may have expanded databases, improving RAG’s accuracy. However, GPT-4-RAG struggled with non-primary results, yielding lower scores in this area. These findings align with research showing ChatGPT’s difficulty in addressing complex genetics-related questions36 further highlighting its limitations in capturing nuanced IC components regardless of the prompting technique used. GPT-4’s challenges in these components may stem from the distinction between an IC form and an IC process. As a model trained on large text corpora, GPT-4 may be more suited to generating static IC forms that follow standardized formats, such as those commonly used in medical settings where risks, benefits, and procedures are relatively consistent and widely documented. Moreover, as previously noted, most studies evaluating LLMs for IC have focused on surgical settings. These studies7,11 typically involve feeding the model existing IC forms, often sourced from large medical centers11, and prompting it to simplify the content. This may contribute to GPT-4’s stronger performance with such materials, as it aligns closely with both its training data and the structure of the evaluation tasks. In contrast, genomic testing often requires more individualized, context-sensitive information, raising concerns about GPT-4’s ability to generate consent materials that go beyond the scope of standard form templates.

Additionally, GPT-4 underperformed in Greek, highlighting language biases in AI development. Similar disparities have been observed in Japanese17 and Spanish18 compared to English. This supports the assumption that less commonly spoken languages receive weaker AI performance across domains17,18,37,38. Language-based disparities in healthcare have been linked to reduced primary care utilization and poorer health outcomes37,38, raising concerns about the potential for AI-driven inequities in medical information accessibility.

Participants’ preference for human-generated materials for both NIPT and BRCA was modest. This trend was more pronounced for BRCA, where GPT-4-RAG-generated material was frequently misidentified as human-written. This suggests that RAG may not only improve informational content but also enhance tone, structure, and style in ways that closely resemble human-generated materials. This, however, also implies that users need to provide relevant information, which can be more challenging for those new to the task or those with limited time. Nevertheless, it overall indicates that prompting should be viewed not merely as a technical step to produce an output, but as a critical design decision that can significantly influence both content quality and audience reception.

While participants preferred the human-generated material, the relatively small differences in ratings and the difficulty distinguishing GPT-4-generated from human text suggest that, in some contexts, GPT-4 may already be producing content that meets user expectations at least at the level of surface communication. This reinforces our approach to evaluating materials not only in terms of readability, but also in terms of content completeness and clinical relevance. Finally, the variation in participants’ ability to identify GPT-4-generated text in NIPT compared to BRCA further suggests that topic complexity or familiarity may influence how GPT-4-generated materials are interpreted. These findings challenge the assumption that human authorship is inherently superior across different contexts39. However, they also align with existing literature that demonstrates human difficulty in distinguishing between ChatGPT-generated medical manuscripts and those written by humans. This has important implications for the medical community, particularly regarding the circulation of inaccurate material and the risk of increased public distrust40.

This study has several limitations. First, we deliberately sought evaluations from healthcare providers. While this ensured expert assessments, it excluded patients, the primary end-users of IC material. This is a critical limitation, as patient feedback is essential to evaluate whether GPT-4-generated content is clear, relevant, and accessible to its intended audience. Future studies should incorporate patient-centered evaluation. For example, small-scale cognitive interviews or think-aloud sessions could help assess how patients interpret and engage with GenAI-generated IC materials. We are currently conducting a separate study using a think-aloud protocol with patients, which directly builds on this limitation by exploring how patients engage with GenAI-generated IC materials.

Second, our analysis focused on two genetic tests (NIPT and BRCA), limiting the generalizability of these findings to other genetic contexts. However, we observed key patterns, including the omission of certain IC components without explicit prompting, elevated readability levels, and variable expert confidence in patient-facing material quality. These challenges are not unique to NIPT or BRCA and may similarly affect GPT-4-generated materials in other contexts, including carrier screening or whole-genome sequencing, suggesting broader relevance. Future research should assess whether these patterns persist across additional testing scenarios.

Third, our small sample size (N = 25) and uneven distribution across languages pose constraints, and readers should consider the findings preliminary. Specifically, the sample size limits the ability to capture the full range of variation in GPT-4-generated material and provider assessments, thereby reducing the generalizability of the results. Also, the uneven representation of languages limits our ability to draw robust conclusions about language-specific patterns or to generalize across linguistic and cultural contexts. Some findings, therefore, may reflect features unique to specific language groups rather than broader trends in language use. Also, convenience sampling may have attracted individuals with a stronger interest in IC processes or genetic education, introducing selection bias. This could have influenced how participants engaged with the materials, possibly resulting in evaluations that are not fully representative of the broader clinical community. Moreover, the small sample size limits the ability to explore variation across professional roles, language groups, and test types. As a result, generalizability is limited, and the findings should be interpreted with caution. Future studies should use larger and more diverse samples to capture broader perspectives.

Another limitation of this study is the exclusive use of the GPT-4 model. We did not include other LLMs, such as Gemini, Copilot, or medical models like Med-PaLM and Med-Mistral, as they were outside the scope of the study. Such LLMs are likely to differ in clinical accuracy, terminology, and style. For instance, the Mistral 8 x 22 B LLM has shown promise in enhancing the readability, understandability, and actionability of IC forms without compromising accuracy or completeness41. While this highlights the potential of domain-specific models, our focus on a general-purpose model like GPT-4 strengthens the relevance of our findings to broader, real-world clinical contexts where fine-tuned models may not be readily accessible. Finally, while we carefully designed our prompts, they did not account for patient-specific factors, such as literacy level, clinical history, gender, or age. Although our approach enabled a controlled comparison between GPT-4- and human-generated IC material, it did not allow for personalized content generation. Future research should explore LLM-generated IC material tailored to individual patient needs through personalized prompts.

To conclude, GPT-4 struggled to produce comprehensive IC material, failing to address all IC components for NIPT and BRCA testing. Similar results were observed across both testing scenarios and all examined languages, including English. Despite these limitations, the model performed well in structured IC components, such as explaining the purpose of the test, its intended benefits, and the general aim for testing. These components often follow standardized formats and appear in publicly available patient-facing health materials. Considering this, GPT-4 may be most effective in generating standardized patient instructions, medical protocols, or discharge summaries rather than IC materials. GPT-4-RAG-generated materials were more often perceived as human-authored, showed better readability than human-written materials in German and zero-shot outputs in English and Italian, and received more consistent evaluations from participants. Although these differences were not statistically significant, they suggest that RAG may offer practical advantages over zero-shot prompting in complex clinical communication tasks, such as IC for genetic testing, particularly in non-English languages. Integrating explicit instructions through RAG may improve model performance by ensuring more complete coverage of IC components. Its performance in German, Italian, and Greek was poorer compared to English. If LLM-generated IC materials favor English-language content, non-English-speaking patients may receive lower-quality health information, further exacerbating existing inequities. Addressing these challenges requires a multifaceted strategy: improving dataset curation, applying multilingual fine-tuning using high-quality, domain-specific texts from underrepresented languages, and designing culturally adapted prompts that reflect local examples, idioms, and healthcare structures. These, along with post-generation validation techniques, should be prioritized as technical, methodological, and ethical imperatives. For now, a hybrid approach, where GPT-4 generates material and clinicians review and refine it, may be more effective for the IC process in genetic testing.

Methods

We generated IC materials using GPT-4, first in English and then in German, Italian, and Greek. GPT-4 was selected because it was the most current model available at the time of the study, claimed to surpass its predecessors in reasoning capabilities21. Most prior studies relied on ChatGPT-3 or ChatGPT-3.57,8,12,18,19,26. We then conducted an online survey in which healthcare providers evaluated GPT-4-generated IC materials compared to human-generated ones. The study was approved by the ETH Zurich ethics review board (EK 2024-N-154).

Justification for test and language selection

We used GPT-442 to generate accessible information for two types of genetic tests: NIPT and BRCA. We selected these genetic test scenarios because they are common forms of genetic testing, including in Switzerland43. Additionally, referrals for these tests are relatively straightforward23 and both genetic specialists (medical geneticists and genetic counselors) and non-genetic specialists (obstetricians, midwives, and oncologists) can offer them44. This information was produced in English, German, Italian, and Greek. We selected these languages for three reasons. First, English is the most well-documented language in the literature on GenAI and the primary language of most scientific publications. Second, we included German, Italian, and Greek, as these are the primary languages spoken by our research team and are commonly spoken languages in Switzerland (German and Italian) and throughout Europe. Third, research suggests that GenAI systems typically perform worse in languages other than English17,18 highlighting a gap that needs to be addressed further.

Generation of IC materials

We conducted all experiments using the ‘gpt-4-turbo-2024-04-09’ model. We briefly experimented with GPT-4o, but our experience was that this model generated less accurate and comprehensive IC sheets. For increased reproducibility, we have set the temperature in all experiments to zero. Prompt development was an iterative process led by a Language Processing (NLP) engineer (DS) and a genetic counselor (KEO) with expertise in genetic testing, IC and bioethics. All prompts used in this study are detailed in the Supplemental Material.

GPT-4 generated adequate patient information for NIPT without requiring additional instructions, indicating that zero-shot prompting, which refers to generating output without prior examples or domain-specific context21, was sufficient. However, the output for BRCA testing lacked clinical relevance and included vague statements, such as “ask questions” or “consider your options” without addressing key elements of IC. Moreover, outputs often focused narrowly on BRCA1 or BRCA2, failing to reflect current multigene testing practices. These limitations prompted us to introduce retrieval-augmented generation (RAG)21,22,23. RAG is a technique used to enhance AI-generated responses by enabling the model to access information from external sources, such as databases or documents. By adopting RAG, the generated outputs are usually more accurate and reliable, as they derive from the model’s built-in language abilities, which utilize real-world information retrieved as needed23. In this case, by using RAG, we supplemented the system prompt with information from the National Cancer Institute’s website45.

English prompts were first translated into German, Italian, and Greek using DeepL Pro, with accuracy verified by native speakers (Supplemental Material). These translated prompts were then input into GPT-4 to generate the corresponding information in each language, reflecting how healthcare providers might use LLMs in real-world settings. The model consistently responded in the language of the input prompt.

Development of the survey

We developed the survey drawing on Ormond et al.46 framework for IC components in genetic testing (see Table 3). The survey (see Supplemental Material) evaluated the accuracy, relevance, accessibility, and inclusion of these IC components in the GPT-4-generated and human-generated materials. We used five-point Likert scales and multiple-response formats to assess the materials comprehensively. Participants were not informed that they were partially evaluating GPT-4-generated material, but were debriefed upon survey completion per the study protocol. Not disclosing the origin of the patient-facing material ensured that participants were not influenced by preconceptions about the source of the text when evaluating the content.

Participants and recruitment

Our target population included medical and laboratory geneticists, genetic counselors, obstetricians, midwives providing NIPT, oncologists, breast specialists, and surgeons offering BRCA testing. Participants were required to obtain an IC for at least one of these tests and see patients in English, German, Italian, or Greek. Recruitment was conducted via convenience sampling from June 2024 to January 2025. We used available staff lists of Swiss and Greek hospitals. We also distributed study information through the Transnational Alliance for Genetic Counselors (TAGC) emailing list, the Swiss Genetic Counselor Association email list and the SNPPET Newsletter (N = 498). To supplement these recruitment approaches, members of the research team personally recruited potential participants attending the European Society of Human Genetics (ESHG) conference (June 2024) and a Rare Disease Justice Workshop held in January 2025. Finally, study information was posted on several researchers’ LinkedIn pages in January 2025. All participants provided IC electronically prior to survey participation.

Administration of the survey

We administered the survey using Qualtrics, an online tool. We developed four survey versions, each tailored to one of the languages assessed. We used the software’s branch, display, skip and randomization features to ensure that participants only evaluated the consent material for the genetic test they offer (either NIPT or BRCA) and that the human- and GPT-4-generated scenarios were randomized in their order of presentation. This randomization minimized bias and increased the responses’ validity.

Data analysis

We conducted a descriptive analysis of completed surveys (N = 25; NIPT: 12, BRCA: 13) to evaluate expert assessments. Readability was assessed using standard metrics: Flesch-Kincaid47 for English, German, and Italian material and the Simple Measure of Gobbledygook48 (SMOG) for Greek.

We evaluated participants’ evaluations of accuracy, relevance and inclusion of IC components by summarizing their responses using means, standard deviations, and interquartile ranges (IQRs). Given the small sample size in each language, we opted against comparative or inferential analyses due to low statistical power. Data analysis was conducted using IBM SPSS Statistics (Version 29.0.2.0) and Excel (Version 16.93.1), with Python (Version 3.12.0) used to generate line charts for the BRCA case study.

A genetic counselor and native English speaker (KEO) reviewed all materials, checking human-generated material for accuracy and GPT-4-generated material for hallucinations, meaning confidently presented information that diverges from source inputs, lacks factual grounding, and may be misleading, inaccurate, or irrelevant, often due to encoding and decoding errors in the LLM49,50,51. We present a case study comparing Greek and German BRCA materials, as both had equal sample sizes (N = 9). This enabled a structured comparison of GPT-4-generated and human-generated materials.