Introduction

Patient education materials (PEMs) play a crucial role in dental care, serving as an essential resource for educating individuals about their oral health, post-treatment care, and emergency management [1, 2]. These materials help bridge the gap between professional dental advice and patient understanding, ensuring that individuals can follow appropriate self-care practices at home [2]. In a dental setting, well-structured educational content can enhance patient compliance with treatment recommendations, reduce the risk of complications, and improve overall oral health outcomes [1]. Effective patient education materials should be clear, accurate, accessible, and actionable, enabling individuals to easily comprehend and apply the information provided [1, 3].

With advancements in artificial intelligence (AI), Large Language Models (LLMs) have emerged as a potential tool for generating PEMs efficiently and at scale [4, 5]. LLMs, such as Chat GPT 4.0, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Llama 3.1–405b, are trained on vast amounts of textual data and can generate human-like responses to various prompts. These AI-powered models are increasingly being explored for their ability to simplify complex medical information, personalize health education, and improve accessibility for patients with diverse literacy levels [6,7,8]. While these models can generate fluent and coherent text, concerns remain regarding their accuracy, reliability, and readability when applied to healthcare contexts [9]. Given the high stakes of medical and dental information, it is critical to assess whether LLM-generated materials meet the standards of clarity, medical accuracy, and practical usability.

This study aims to evaluate the effectiveness of LLM-generated PEMs for common dental scenarios, focusing on their reliability, readability, and actionability. By assessing materials generated for four key dental situations, this study sought to determine whether AI-generated content aligns with the principles of effective health communication. The findings of this study provide insights into the strengths and limitations of LLMs in generating PEMs, helping to inform future applications of AI in dental communication and patient care.

Methodology

Study design and selection of large language models

This study used a comparative analytical design to evaluate the reliability, readability, and actionability of patient education materials generated by four LLMs: ChatGPT-4.0, Claude 3.5 Sonnet, Gemini 1.5 Flash, Llama 3.1–405B. Figure 1 summarizes the key features of these LLMs. The selection of these models ensured a balanced evaluation of both proprietary and open-source LLM’s in dental health communication. Due to the nature of the study, ethical approval was not sought. However, adherence to the latest Declaration of Helsinki guidelines was maintained.

Fig. 1: Large language models (LLMs) evaluated in this study.
figure 1

The figure summarizes key attributes of four LLMs analyzed. Developer information and feature highlights are provided for each model.

Each model was prompted to generate patient education handouts for four specific dental scenarios:

  • Post-operative instructions following a tooth extraction

  • Immediate steps for managing an avulsed tooth

  • Proper daily tooth brushing technique for optimal oral hygiene

  • Self-examination for oral cancer screening

For consistency, the prompts were carefully structured to ensure that each LLM received identical instructions without additional context or examples. The prompts were designed to be succinct and clear, ensuring that they were easily understandable and replicable by anyone using the same set of instructions. This approach was aimed at minimizing any biases or variations that might arise from differing interpretations or added context. By using straightforward, concise prompts, the study ensured that each model’s performance was based solely on the input provided, allowing for an unbiased and uniform evaluation of the generated materials. The generated materials were then assessed using multiple standardized evaluation metrics.

Assessment of readability, actionability and understandability

The Patient Education Materials Assessment Tool (PEMAT) [3] was used to assess the understandability and actionability of the materials generated by each LLM. Five independent dental professionals, rated each of the four generated materials using the PEMAT criteria. The ratings focused on evaluating how easy the content was to understand (understandability) and how clearly patients could identify and apply the actions or steps recommended (actionability). For each material, mean scores were calculated for both understandability and actionability, allowing for a comprehensive evaluation of each LLM’s output. For understandability, a score of 70% or above indicates that the material is understandable for most patients. For accountability, a score of 70% or higher is considered good, meaning the material clearly outlines actions that are easy to follow [3].

In addition to the PEMAT, other readability scores, such as the Flesch Reading Ease and Reading Level scores, were also calculated to evaluate the accessibility of the materials in terms of their linguistic complexity. These scores were calculated using the online calculators that are freely available. These scores allowed for further comparison between the LLMs, focusing on the ease of reading and the suitability of the language used for various patient populations.

Inter-rater reliability

To assess the consistency and agreement among the five raters, Fleiss’ Kappa was used to measure inter-rater reliability. The level of agreement was categorized according to standard Fleiss’ Kappa interpretation:

  • No agreement (≤0)

  • Slight (0.01–0.20)

  • Fair (0.21–0.40)

  • Moderate (0.41–0.60)

  • Substantial (0.61–0.80)

  • Almost perfect agreement (0.81–1.00)

Results

All LLMs provided responses to each of the scenarios outlined, and these responses are presented in Supplementary File 1.

Inter-rater reliability

Llama 3.1–405b demonstrated the highest level of inter-rater reliability, with Fleiss’ Kappa values ranging from 0.78 to 0.89, indicating almost perfect agreement among the raters for both understandability and actionability across the five evaluations. Chat GPT 4.0 displayed substantial agreement, particularly in the ratings for actionability (κ = 0.69), but showed moderate agreement in the other areas, with Fleiss’ Kappa values ranging from 0.52 to 0.57. Claude 3.5 Sonnet exhibited moderate inter-rater reliability, with Fleiss’ Kappa values ranging from 0.45 to 0.66. Gemini 1.5 Flash demonstrated reasonable consistency with Fleiss’ Kappa values ranging from 0.73 to 0.79, reflecting a consistent level of agreement, though not as strong as Llama 3.1–405b. These findings suggest that while all models demonstrated acceptable inter-rater reliability, Llama 3.1–405b emerged as the most reliable model, particularly in generating materials with high consistency across raters (Supplementary File 2). The radar plot is presented in Fig. 2.

Fig. 2: Inter-rater reliability across scenarios for each LLM.
figure 2

Radar chart comparing Fleiss Kappa scores of four large language models (LLMs) across four scenarios. Each axis represents a scenario, with Fleiss Kappa values plotted radially from the center. Model performance is shown as distinct geometric lines. Higher values toward the outer edges indicate stronger inter-rater agreement.

Understandability and actionability

Scenario 1- post-operative instructions following tooth extraction

The understandability scores for this scenario varied across the models. Chat GPT 4.0 scored 61% for understandability in Scenario 1, indicating moderate clarity, while Llama 3.1–405b scored 49%, suggesting that the content generated by this model may have been more complex and less clear. Chat GPT 4.0 had the highest actionability score at 71% in Scenario 1, indicating a high level of practical guidance. Llama 3.1–405b and Gemini 1.5. Flash scored lower at 60%, suggesting that the instructions were less actionable and may have lacked specific details for patients to follow effectively (Fig. 3).

Fig. 3: PEMAT understandability and actionability scores across scenarios for each LLM.
figure 3

Line graphs comparing Patient Education Materials Assessment Tool (PEMAT) scores for four large language models (LLMs) across four scenarios. Scenarios are plotted along the X-axis, with PEMAT percentage scores (0–100%) on the Y-axis.

Scenario 2- immediate steps for managing an avulsed tooth

Chat GPT 4.0 scored 72% for understandability indicating clear instructions for handling the situation. Gemini 1.5 Flash scored the lowest at 48%, reflecting potentially more technical language or unclear phrasing. Llama 3.1–405b led in actionability with a score of 65%, demonstrating practical guidance for handling the avulsed tooth and seeking dental care. The other models, such as Claude 3.5 Sonnet, Gemini 1.5 Flash and Chat GPT, had relatively lower actionability scores (54%, 51% and 62%, respectively), indicating that the instructions were less detailed or actionable (Fig. 3).

Scenario 3- proper daily tooth brushing technique

Gemini 1.5 Flash scored relatively low for understandability at 51%. Chat GPT 4.0 and Claude 3.5 Sonnet scored higher (75%), reflecting more accessible content. Llama 3.1–405b scored the highest for actionability at 65%, indicating that the material included clear, step-by-step instructions (Fig. 3).

Scenario 4- self-examination for oral cancer screening

Chat GPT achieved the highest understandability scores (70%) indicating clearer communication of the steps involved in oral cancer self-examination. Claude 3.5 Sonnet and Gemini 1.5 Flash led in actionability, with a score of 60%, indicating actionable guidance on how to perform the self-examination and seek professional care when necessary (Fig. 3). Among the evaluated language models, only Chat GPT 4.0 and Claude 3.5 Sonnet achieved understandability scores above 70% in certain scenarios. Chat GPT 4.0 demonstrated strong performance in understandability, surpassing the 70% threshold in Scenario 2, 3 and 4. Additionally, it was the only model to achieve an actionability score above 70% (Scenario 1). Claude 3.5 Sonnet also performed well in understandability, exceeding 70% in Scenario 1 and 3. However, none of the models consistently met the 70% benchmark for both understandability and actionability across all scenarios, highlighting the variability in performance depending on the context.

Readability

Chat GPT 4.0 showed a moderate readability range, with scores between 52.2 and 69.9 across all scenarios. This suggests that its responses were readable at an 8th or 9th-grade level. Claude 3.5 Sonnet displayed variability in readability, with the most difficult text in Scenario 1 (Flesch score of 57.4), while Scenario 2 was easier to read (74.0). Scenarios 3 and 4, however, were harder to read (41.7 and 49.8), which could present challenges for a general audience. Gemini 1.5 Flash produced text that was relatively difficult to read across all scenarios, indicating a high reading difficulty level (10th to 12th-grade level). Llama 3.1–405b exhibited more variation, with Scenario 3 showing the best readability (76.8, equivalent to a 7th-grade level), while other scenarios scored lower (Table 1).

Table 1 Characteristics of responses received.

Word count and sentence structure

The length of responses and sentence structure varied across the models. Chat GPT 4.0 generated the longest responses, averaging 505–655 words per scenario. Its sentence length ranged from 9.9 to 12.2 words, indicating that it tended to produce more detailed responses with relatively long sentences. Claude 3.5 Sonnet generated the shortest responses, averaging between 456 and 551 words. Its sentence structure was more concise, with an average of 4.3–5.8 words per sentence, indicating a more direct and to-the-point approach in response generation. Gemini 1.5 Flash produced responses of similar length to Claude 3.5 Sonnet, with an average of 470 words. Llama 3.1–405b produced responses averaging between 362 and 467 words, with sentence lengths ranging from 9.1 to 11.4 words (Table 1).

Discussion

The aim of this study was to evaluate the performance of four LLM’s—Chat GPT 4.0, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Llama 3.1–405b—in generating dentistry-related content across four different scenarios. The focus was on assessing inter-rater reliability, understandability, actionability, readability, and response characteristics. The findings indicate notable variations in model performance based on these criteria. Llama 3.1–405b demonstrated superior inter-rater reliability, indicating consistent ratings across raters, but it performed less well in understandability and accountability compared to Chat GPT 4.0.

Based on the recommended 6th to 8th grade reading level by both the American Medical Association (AMA) and the National Institutes of Health (NIH) [1, 10, 11], this range is recommended because many patients have reading skills at or below this level, and health materials above this threshold risk being too complex, potentially limiting comprehension and effective self-care. The results of this analysis showed a mixed performance across the LLMs. Llama 3.1–405b and Claude 3.5 Sonnet were the closest to meeting this recommendation, with one scenario each falling within the 7th to 8th grade range. However, Chat GPT 4.0 and Gemini 1.5 Flash tended to produce content at a higher grade level, for all scenarios, which may make the material more challenging for patients to understand. Readability formulas like Flesch-Kincaid provide quantitative estimates but may not fully capture complexity due to medical jargon or sentence structure. This highlights the importance of human oversight to ensure language is appropriately simple and clear for diverse patient populations. While these models performed well in many aspects, none of them consistently hit the ideal 6th grade level, highlighting the need for human intervention to simplify the content to align with the recommended readability levels.

Our findings highlight notable differences in readability, word count, and sentence structure across the LLMs evaluated. Interestingly, these factors can be influenced by how the prompts are framed. For example, explicitly instructing the models to “use simple and easy words so that a sixth grader can understand” or “limit responses to 100 words” may improve readability and conciseness. Such strategies are valuable for tailoring LLMs outputs to different audiences or scenarios, especially in health communication or patient education contexts. Future work could explore systematically how prompt modifications affect readability and length across various models and scenarios.

Patient education materials should be clear, concise, and easily understandable to ensure effective communication [12]. Key features include simple, non-technical language that is accessible to a wide range of literacy levels, along with a logical structure that guides the reader through the content [13]. Visual aids, such as diagrams, infographics, or images, are crucial in enhancing understanding and providing clarity for complex medical concepts [14]. Actionable steps or instructions should be prominently highlighted to help patients follow through with care recommendations. Furthermore, the material should be culturally sensitive and tailored to the patient’s specific needs, ensuring that it resonates with their background and health conditions [15, 16]. It should also include clear contact information for further questions or assistance, fostering patient engagement and empowerment. Lastly, materials should be visually appealing, with a clean layout and ample white space to make it easy for patients to navigate and focus on important information. The responses received from all four models included in this study did not include any images, infographics, or visual representations primarily because these models are designed to generate and process text-based content only. While they excel at providing written responses, they are not inherently equipped to produce or interpret visual elements like images or diagrams [17]. However, it is important to note that ChatGPT 4.0 does have the capability to generate images in some contexts, depending on the platform and settings used. Despite this, the models remain focused on generating human-readable text for a variety of applications, including healthcare communication, but generally lack the integration of image creation or editing functionalities [17,18,19]. As a result, their output is limited to textual information, making it necessary for human intervention to add visual aids, such as images or infographics, during the final stages of content development, especially for PEMs where visual aids play a crucial role in improving comprehension.

In addition to images, LLMs cannot offer personalized content tailored to an individual’s specific health condition, demographic, or preferences, as they rely on general inputs. To overcome the general-purpose nature of these models and improve their domain specificity, recent efforts have focused on fine-tuning LLMs using approaches such as Retrieval Augmented Generation (RAG). RAG combines LLMs with external knowledge retrieval, allowing models to access up-to-date and specialized information relevant to a user’s query. This method can enhance the accuracy and contextual relevance of generated content in healthcare settings. Batool et al. [20]. demonstrated the use of an embedded GPT model tailored for post-operative dental care, showing improved performance compared to standard ChatGPT. Similarly, Umer et al. [21]. applied RAG-enhanced LLM techniques to transform educational journal clubs, addressing specific learning challenges. Incorporating such domain-adapted models may bridge the gap between generalist LLM outputs and the need for precise, personalized patient education materials. They also lack the ability to generate real-time updates or access live data, meaning that the content may not reflect the most current clinical guidelines or patient outcomes. These models also do not provide clinical decision support, patient-specific instructions, or ensure compliance with local healthcare regulations, making human oversight necessary. Furthermore, LLMs cannot replicate the human element of empathy, which is essential for reassuring patients, nor do they always account for cultural sensitivities or provide reliable citations [22, 23]. As a result, while LLMs can generate informative content, they are not fully equipped to produce dynamic, personalized, and compliant patient information materials without human intervention.

One limitation of the current study relates to the simplicity of the prompts provided to the LLMs. Although identical base prompts were used for all models in our study to maintain consistency and minimize variability due to prompt design, these prompts were intentionally kept basic. It is well-established in the literature that the quality of LLM outputs depends heavily on the quality and specificity of the prompts given [24,25,26]. More complex or detailed prompts could potentially elicit more accurate or nuanced responses from the models [27]. However, we deliberately chose simple prompts to simulate typical real-world scenarios where users may not craft elaborate instructions. This approach reflects practical conditions under which PEMs might be generated by users with limited expertise in prompt engineering. Future research could explore how varying prompt complexity impacts the quality of generated health communication materials.

This study evaluated LLM performance using only four dental scenarios. While these scenarios were chosen for their clinical relevance and diversity—covering preventive care, emergency management, routine post-treatment instructions, and early detection—they represent only a subset of the broad range of patient education needs in dentistry. Consequently, the findings may have limited generalizability to other dental topics or more complex clinical situations. Future research should include a wider variety of scenarios to better assess the comprehensive capabilities of LLMs in dental patient education.

In conclusion, while LLMs demonstrate promising capabilities in generating patient education materials, their current limitations underscore the critical need for human oversight and intervention. Although these models excel at producing coherent text-based content, they generally lack the ability to create visual aids, tailor information to individual patient characteristics, or integrate real-time clinical data. Additionally, LLMs cannot fully replicate essential human qualities such as empathy and cultural sensitivity, which are crucial for effective healthcare communication. Recent advancements, including fine-tuning approaches like RAG, offer pathways to enhance model specificity and relevance in healthcare domains. However, even with these improvements, LLM-generated content should be considered as a supportive tool for healthcare professionals rather than a standalone solution. Ensuring optimal patient understanding and engagement requires continued refinement of these models combined with active human involvement to address their current shortcomings.