Introduction

First-trimester screening for fetal trisomies 21, 18, and 13 can be effectively conducted between 11 and 13 weeks of gestation. This is achieved through a combined assessment of maternal age, fetal nuchal translucency (NT) thickness, fetal heart rate (FHR), and maternal serum biomarkers, including free β-human chorionic gonadotropin (hCG) and pregnancy-associated plasma protein-A (PAPP-A)1. Also, Non-invasive prenatal testing (NIPT) is used to screen for anomalies. It offers high sensitivity and specificity for screening prevalent aneuploidies, enhancing the accuracy of prenatal assessments2. First-trimester combined screening and non-invasive NIPT are valuable tools for detecting common aneuploidies. However, interpreting the results and selecting the most suitable screening method can be intricate. This may involve collaboration among specialists, including genetic counselors, obstetricians, and pediatricians/neonatologists3. This multidisciplinary approach ensures informed decision-making and the implementation of appropriate follow-up care. Decision-making tools and information about prenatal testing have been developed to enhance the counseling process4. Research indicates that these tools help patients better understand their options, minimizing confusion about their choices5,6.

Artificial intelligence (AI) is increasingly being integrated into healthcare to enhance patient education, improve access to medical information, and support clinical decision-making7. In the context of prenatal care, there is an increasing exploration of AI-powered chatbots as innovative tools for enhancing patient education8. However, the accuracy and reliability of AI-generated medical information remain areas of concern, particularly in sensitive fields such as prenatal care.

This study aimed to assess the reliability and readability of ChatGPT-4o’s responses on first-trimester prenatal screening and to evaluate its potential to assist healthcare providers in prenatal counseling.

Materials and methods

The present study formulated a series of structured clinical scenarios to assess the reliability of AI-generated counseling in prenatal screening contexts. The study was approved by the Gaziantep City Hospital Local Ethics Committee (Approval No: 112/2024, dated 15 January 2025), and all procedures were carried out in accordance with relevant guidelines and regulations, including the Declaration of Helsinki. The clinical scenarios used in this study were developed based on the recommendations of the American College of Obstetricians and Gynecologists (ACOG) Practice Bulletin on first-trimester screening and NIPT9. The risk groups were delineated based on NT measurements, maternal serum biochemical markers, specifically PAPP-A and free β-hCG, and the calculated aneuploidy risk ratios. Figure 1 presents an overview of the study design, including the risk stratification process and evaluation framework.

Fig. 1
figure 1

Study design.

A total of 14 perinatologists participated in this study, specializing in prenatal care and high-risk pregnancy management. The criteria for inclusion required that participants possess a minimum of 1 year of clinical experience in prenatal screening, be well-versed in the current prenatal screening guidelines, and willingly agree to participate in the research. To ensure a diverse and experienced panel for the evaluation process, participants were identified through academic institutions, professional medical networks, and obstetric associations.

Fifteen clinical scenarios were systematically developed to represent three distinct categories of pregnancy risk: low, intermediate, and high. The low-risk group includes cases with typical combined screening results, where the calculated risk for aneuploidy is equal to or greater than 1 in 1000. The intermediate-risk group comprises cases with a risk between 1 in 100 and 1 in 1000, often characterized by borderline test results or ambiguous findings. In contrast, the high-risk group consists of pregnancies with a calculated risk of less than 1 in 100, abnormal NIPT results, or significant ultrasound anomalies10,11. Table 1 summarizes the classification for these clinical scenarios.

Table 1 Categorization of clinical scenarios based on prenatal risk assessment.

Each scenario was designed to ensure clinical realism, incorporating biochemical markers, ultrasound findings, and genetic risk factors. To ensure authenticity, all scenarios were constructed based on current clinical guidelines and real-case examples. The scenarios were phrased in a patient-oriented manner to simulate real-world questions that clinicians commonly encounter during first-trimester prenatal counseling, while maintaining medical accuracy and consistency with guidelines. Each inquiry was submitted to ChatGPT-4o in a distinct session, with each response meticulously documented. This methodology facilitated an independent assessment of the AI-generated replies.

Each scenario was clearly and concisely presented to ChatGPT-4o using the following prompt: “You are assisting as a medical advisor specializing in prenatal screening and counseling. Your task is to provide a concise, evidence-based response to the question below. Your answer should be based on the latest ACOG guidelines and not exceed 150 words unless additional clarification is necessary”. A full transcript of all AI-generated responses is provided in the supplementary material. The prompt format reflects the study’s focus on AI as a clinical counseling support tool for healthcare providers.

The DISCERN instrument, commonly used to evaluate the reliability of health information12, and the GQS, a tool frequently applied in assessing the quality of online health content13 were adapted in this study to evaluate AI-generated responses. The mDISCERN version was tailored to focus on scientific reliability, adherence to clinical guidelines, objectivity, clarity, and clinical applicability. Similarly, the adapted GQS emphasized completeness, accuracy, clarity, and clinical usefulness to ensure relevance to the study objectives. The specific evaluation criteria for both scales are summarized in Table 2. For each question, the final GQS and mDISCERN scores were determined by summing the individual ratings provided by 14 experts. Since each expert could assign a maximum score of 5, the highest possible total score per question was 70. The final scores represent the cumulative ratings given by the experts, reflecting their overall assessment of the AI-generated responses.

Table 2 Comprehensive quality, reliability, and readability assessment metrics.

Readability was measured using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG, Gunning Fog Index, and Coleman-Liau Index. The methodology for calculating these readability metrics is outlined in Table 2, providing a standardized approach to evaluating text complexity. These analyses facilitated a comparative evaluation between detailed and summary-format responses, offering insights into their accessibility and clarity. Readability scores were obtained using the Readability Test Tool14.

Statistical analyses were conducted to assess the reliability of expert ratings and examine potential relationships between response quality and readability metrics. Inter-rater reliability was evaluated using the Intraclass Correlation Coefficient (ICC) to measure agreement among expert raters, while Cronbach’s Alpha was calculated to assess the internal consistency of the ratings. Correlation analyses were performed in R to explore potential relationships between mDISCERN, GQS, and readability scores, with statistical significance set at p < 0.05.

Additionally, one-way ANOVA was conducted in R to compare differences in GQS, mDISCERN, and readability scores across the three risk groups (low, intermediate, and high risk). When a statistically significant difference was found (p < 0.05), post-hoc Tukey tests were performed to determine pairwise differences between groups. GQS and mDISCERN scores were reported as Mean (SD), while readability indices were presented as Median (Min–Max) to account for skewness in the data. Results were reported with mean differences, confidence intervals, and adjusted p-values to identify specific group differences.

All visualizations were created using Lucidchart and ChatGPT-4o to enhance data interpretation.

Results

The findings indicate that the content generated by artificial intelligence exhibited a high level of reliability among expert evaluators, as evidenced by significant ICC = 0.998 and excellent internal consistency (Cronbach’s Alpha = 0.975). Although scores derived from the Guideline for the Assessment of Scientific Content (GQS) and mDISCERN assessment varied markedly across different risk groups, the readability indices did not present substantial differences.

Table 3 presents the comparison of GQS, mDISCERN, and readability indices across the three risk groups (low, intermediate, and high). The GQS and mDISCERN scores differed among the risk groups. The high-risk group had the highest scores (GQS: 61.64 ± 1.59; mDISCERN: 62.00 ± 1.78), while the low-risk group had the lowest (GQS: 58.20 ± 2.72; mDISCERN: 59.44 ± 3.51). Further pairwise comparisons, detailed in Table 4, indicate that GQS scores were significantly higher in the high-risk group compared to the low-risk group (p = 0.002), while the difference between the intermediate and high-risk groups was not statistically significant (p = 0.199). Similarly, mDISCERN scores were significantly higher in the high-risk group compared to both the intermediate (p = 0.010) and low-risk groups (p < 0.001). Additionally, the intermediate-risk group had significantly higher mDISCERN scores than the low-risk group (p < 0.001). For readability indices, the results presented in Table 3 show that there were no statistically significant differences between risk groups for any of the indices (all p > 0.05).

Table 3 Comparison of GQS, mDISCERN, and readability indices across risk groups.
Table 4 Multiple comparisons of GQS and mDISCERN between risk groups.

Table 5 presents the correlation analysis between readability metrics and GQS/mDISCERN scores. The results indicate that none of the readability indices showed a statistically significant correlation with GQS/mDISCERN scores (all p > 0.05). The strongest correlation was observed between FKGL and GQS/mDISCERN (r = 0.43, p = 0.10), while the weakest correlation was found for CLI (r =  − 0.09, p = 0.79).

Table 5 Correlation analysis between readability metrics and GQS/mDISCERN scores.

A supplementary analysis was performed to examine whether the readability of the prompts influenced the readability of the AI-generated responses. The results showed no significant correlations across the five indices (FRE: r = 0.44, p = 0.10; FKGL: r = 0.32, p = 0.24; GFI: r = 0.05, p = 0.87; CLI: r = 0.26, p = 0.34; SMOG: r = 0.15, p = 0.59). No significant relationship was found between the readability of the prompts and the readability of the AI-generated responses across all indices.

Figure 2 presents the GQS/mDISCERN average scores and readability metrics (FRE, FKGL, GFI, CLI, and SMOG) for each question. The variation in readability scores across questions is evident, with FRE scores ranging from 1.7 to 38.3 and FKGL scores between 11.7 and 17.7. The highest FRE score was observed for Q10 (38.3), while Q15 had the lowest score (1.7), indicating substantial differences in readability across the dataset.Similarly, the GQS/mDISCERN scores also varied, with Q5 having the highest average score (4.59) and Q1 the lowest (4.03). However, no clear pattern emerged between readability and quality scores, suggesting that higher readability did not necessarily correspond to higher GQS/mDISCERN ratings.

Fig. 2
figure 2

GQS/mDISCERN scores and readability metrics for each question.

For instance, in a low-risk case (Q1), ChatGPT-4o correctly reassured that no further testing was necessary but did not mention the importance of continued routine screening, resulting in a lower quality score. In contrast, in a high-risk scenario (Q13) involving markedly increased nuchal translucency, ChatGPT-4o provided a comprehensive explanation of possible genetic causes and follow-up options such as CVS and amniocentesis, which contributed to higher GQS and mDISCERN ratings. These examples illustrate that the model performed more accurately and thoroughly in clinically complex, high-risk situations.

Discussion

Our study demonstrates that ChatGPT-4o offers structured and clinically relevant responses concerning first-trimester prenatal screening methods, including combined screening and NIPT. Notably, the accuracy of responses varies by risk group, showing higher precision in high-risk cases, satisfactory performance in intermediate-risk scenarios, and relatively lower—yet still informative—responses for low-risk cases.

Previous studies have emphasized the importance of genetic education and psychological factors in prenatal decision-making, highlighting that maternal anxiety and societal norms influence the autonomy of choice regarding prenatal screening15. This aligns with our findings, which indicate that AI-generated responses could be further refined to support both clinicians and expectant parents in navigating complex decisions regarding prenatal genetic screening.

A study investigating computerized decision aids for aneuploidy screening demonstrated that such tools can be as effective as genetic counseling in improving patient knowledge and reducing decisional conflict4. Our findings suggest that AI-driven tools, such as ChatGPT-4o, may serve as a complementary resource in prenatal screening counseling, particularly in clinical settings where access to genetic counselors is limited.

A recent study comparing ChatGPT and Google Bard AI emphasized the necessity for continuous refinement of these models, particularly in sensitive healthcare domains, due to observed discrepancies in information accuracy and responsiveness16. Similarly, our findings highlight that while ChatGPT-4o provides clinically relevant and guideline-adherent prenatal screening responses, ensuring AI-generated content’s reliability requires ongoing evaluation and expert oversight.

Chatbot-based tools in patient education have been shown to enhance knowledge and improve satisfaction among patients and healthcare providers17. In line with these findings, our study highlights the potential role of ChatGPT-4o in facilitating prenatal counseling and patient education. The clinical scenarios used in this study were derived from real-world patient inquiries, ensuring that the questions presented to the AI model reflected genuine concerns encountered in prenatal screening. Our results demonstrate that ChatGPT-4o provided high-quality responses, with expert evaluations ranking the AI-generated answers four or above on the quality scale, as illustrated in Fig. 2.

A recent study assessing ChatGPT’s performance as a fertility counseling tool demonstrated that the model provides relevant and meaningful responses comparable to established sources18. However, the study highlighted key limitations, including the inability to reliably cite sources and the risk of generating fabricated information, which may restrict its direct clinical applicability. These findings emphasize the necessity for continuous improvements in AI models to ensure transparency and trustworthiness in medical communication. Similarly, our study found that ChatGPT-4o generates clinically relevant responses in prenatal screening counseling. While the model consistently provided structured and evidence-based information, its limitations remain, particularly in ensuring the traceability of its sources and mitigating potential inaccuracies.

A study conducted in 2019 indicated that while many patients expressed a willingness to utilize AI-based health chatbots, notable hesitation remained a considerable barrier to engagement19. In our study, the responses generated by ChatGPT-4o were evaluated using the GQS and mDISCERN assessment tools. The findings revealed that while the model provides clinically relevant and comprehensible answers, it also received high scores in both evaluation metrics.

A recent study evaluating the application of ChatGPT in femoroacetabular impingement syndrome emphasized that while AI-driven chatbots hold significant promise as medical resources, their integration into clinical practice must be cautiously approached. The study highlighted the necessity of ongoing validation and expert oversight to minimize the risk of misinformation and ensure that AI-generated content adheres to stringent medical accuracy standards20. In line with this perspective, our study found that ChatGPT-4o’s response quality improved as the risk level increased in prenatal screening scenarios. As shown in Table 3, there were statistically significant differences in GQS (F = 10.98, p = 0.002) and mDISCERN (F = 50.45, p < 0.001) scores across the three risk groups.

The correlation analysis between GQS and mDISCERN scores highlights a strong internal consistency within expert evaluations, suggesting that response quality and medical accuracy were systematically assessed. However, the moderate correlation between GQS and mDISCERN implies that ChatGPT-4o’s generally well-structured and readable responses may not always align perfectly with evidence-based medical guidelines. This finding reinforces the need for expert oversight when utilizing AI-generated medical content. Furthermore, the lack of significant correlation between readability and expert-evaluated quality suggests that readability alone cannot indicate response accuracy or clinical relevance (Table 5).

The AI-generated information was generally reliable, with expert evaluations indicating that responses were clinically relevant and evidence-based. However, regarding readability, most responses required an advanced reading level, which may present challenges for users with lower health literacy. Because the system prompt instructed the model to act as a clinical advisor, this may have contributed to the professional tone of the responses. Additional analysis showed no significant relationship between the readability of the questions and the AI-generated responses. This suggests that the complex language used by the model mainly results from how it constructs its answers rather than from the way the questions were written. Simplifying AI-generated content before sharing it with users could therefore improve understanding and accessibility.

This study has several limitations. First, the evaluation was conducted by 14 perinatologists, which, while providing a valuable professional perspective, may not fully represent the diversity of opinions within a broader clinical community. Additionally, 15 standardized clinical scenarios were developed to simulate real-world patient inquiries; however, they may not entirely capture the complexity and nuances of actual patient interactions in clinical practice. The scenarios were phrased in a patient-oriented manner but contained structured clinical details to facilitate expert evaluation, which may differ from the language typically used in real consultations. Another limitation is that AI models, including ChatGPT-4o, are continuously evolving; thus, the performance observed in this study may not directly apply to future versions. In addition, the evaluation panel did not include genetic counselors. Although prenatal counseling is primarily provided by perinatologists in our national healthcare setting, the absence of genetic counselors may limit the generalizability of the findings to contexts where multidisciplinary counseling teams are standard practice. Furthermore, the study did not include patient perspectives, preventing an assessment of how AI-generated responses are perceived and understood by the intended audience. Finally, while this study focused on ChatGPT-4o, a comparison with other AI-based models such as Bard or MedPaLM could provide further insight into the relative strengths and weaknesses of different AI chatbots in prenatal counseling.

Conclusion

In conclusion, ChatGPT-4o demonstrates significant potential in providing information on first-trimester combined screening and NIPT, offering structured and clinically relevant responses. It may serve as a supplementary tool to support clinicians in genetic counseling and prenatal decision-making. However, response quality varies across risk groups, with the highest accuracy in high-risk cases, good performance in intermediate-risk cases, and relatively lower—but still informative—responses in low-risk cases. While AI-generated content shows promise for enhancing counseling quality, continued improvements in reliability and consistency are needed across all clinical scenarios. Ultimately, AI can complement but not replace human judgment; expert supervision is indispensable to prevent misinformation and maintain ethical standards in prenatal counseling.