Hospital discharge summaries, also known as discharge referrals or clinical handovers, communicate key information about a patient related to their hospital admission. Details often include presenting complaints, investigations, diagnoses, treatments received, procedures completed and instructions for continuity of care post discharge such as new medications, and follow-up actions such as appointments, tests, wound care and others1. Discharge summaries are designed to be the main communication tool to support the safe transition from hospital to community care. Most information in the discharge summary is intended for a primary care physician, rather than for the patient. Evidence suggests that while primary care physicians are supportive of patients receiving a copy of their discharge summary2, only one third of discharge summaries contained patient-centred information3.

Medication errors after discharge are common. More than half of patients have misunderstandings about indication, dose, or frequency of the medications they take after discharge and these issues are more common among patients with low health literacy4. A meta-analysis of emergency department discharge summaries found that only 58% of patients could correctly recall their written discharge summary instructions5. In a study of 254 older adults in the United States, 22% of participants did not understand how to take their medication and the rate was 48% among participants with low health literacy6.

Discharge summaries can be designed or augmented to improve patient safety in care transitions. Patient-centred language in discharge instructions has been associated with lower rates of readmission and fewer patient calls to hospital7. Health literacy guidelines recommend reducing medical jargon and using everyday language to improve understanding of health information8,9,10. The Universal Medication Schedule (UMS) is a specific format that explains medication dosage and timing in relation to time periods (morning, noon, evening, bedtime)11. While evidence of its impact is mixed, its use is associated with improved medication adherence, particularly for older adults with more complex medication regimens12,13,14,15.

Language models can be used in tools for simplifying online health information for patients, though current evidence does not yet provide a clear picture of its value16,17,18,19,20. One study examined the use of ChatGPT for simplifying radiology reports into plain language that could be used by patients and healthcare providers21. We know of no studies that have examined the use of generative artificial intelligence models with discharge summaries to generate new patient-centred discharge instructions supporting their medication use and ongoing actions and appointments.

In this study, we evaluated the safety, accuracy and language simplification of patient-centred discharge instructions generated by a GPT-based model. To do this, we developed a prompt to generate patient-centred discharge instructions using GPT-3.5-turbo-16k 2023-07-01-preview version (hereafter, GPT-3.5). Three prompt strategies were developed and evaluated to find a prompt that balanced language simplification with the correctness of medications and follow up actions in the AI-generated response (see “Methods”). Discharge summaries were used as reference documents from which to generate responses. Clinicians then compared the descriptions of medications and follow-up actions for 100 pairs of AI-generated discharge instructions and their original discharge summaries.

The median length of the original discharge summaries was 1506 words (interquartile range [IQR] 1096–1987). The median number of medications was 9 (IQR 6 to 12) and the median number of actions was 5 (IQR 3–7). Across the original discharge summaries, the mean grade reading level was 10.7 (Standard deviation [SD] 0.5) and the mean language complexity was 40.3% (SD 3.9). The patients were generally older, where patients over 60 comprised 48% of examples.

The AI-generated responses were shorter and simpler than the original discharge summaries. The median length of the responses was 267 words (IQR 197–355), with a grade reading level of 10.1 (SD 1.0) and an average language complexity of 31.2% (SD 4.4). Grade reading level (p < 0.001, t = 5.96) and language complexity (p < 0.001, t = 15.7) were both lower in the patient-centred discharge instructions than in the original discharge summaries. Medications were able to be produced in UMS format for 25% (IQR 0–50%) of medications. While the results show a significant reduction in grade reading level and language complexity, the proportion that could be written in UMS format and were correctly represented in UMS format in the AI-generated response was relatively low, suggesting that future studies in the area may wish to consider additional ways to measure how outputs can be best aligned with patient needs and health literacy levels.

Clinicians including pharmacists and primary care physicians compared the text of the original discharge summaries and the patient-centred discharge instructions. Responses captured most of the relevant medications and follow-up actions correctly (Table 1). For example, the median of correctly summarised medications in the patient instructions from the original discharge summary was 100% (IQR 81–100%), while for follow-up actions was 86% (IQR 67–100%). The responses rarely added medications that were not in the original discharge summary (3% of cases) but introduced new actions in 42% of cases (Supplementary Tables 3, 4).

Table 1 Performance of the AI-generated patient discharge instructions

There were a range of safety issues identified in the responses (Fig. 1). Safety issues attributable to the AI-generated response were identified in 18% (18 of 100) of the patient-centred discharge instructions. Other issues that were considered less severe and unlikely to cause harm were identified in 28% (28 of 100). In one case, an AI-generated response included ‘Carbamazepine 400 mg: Take 2 tablets by mouth twice daily’, whereas the original discharge summary had ‘one 400 mg tablet twice daily’ (Supplementary Tables 3, 4). In a post-hoc analysis of factors associated with safety issues, we found no evidence of differences relative to patient age, gender, total medications, or type of care service (Supplementary Table 1).

Fig. 1: A review of safety issues identified in the AI-generated patient-centred discharge instructions.
figure 1

The safety issues were identified in 100 patient-centred discharge instructions, and descriptions include the severity and provenance of the safety issues. Four examples are identified on the figure based on their error category and source.

To our knowledge, this study represents the first investigation of the safety, accuracy and language of patient-centred discharge instructions generated from discharge summaries using an AI tool. The results showed that nearly all medications from the original discharge summaries were correctly reflected in the AI-generated responses, though only around half of the follow-up actions were included and new actions were often added. The AI-generated responses were better aligned with health literacy principles than the discharge summaries but only some of the medication instructions could be simplified into a form that is known to be easier for patients to follow. Importantly, potential safety issues were introduced into the instructions.

In related work, the use of generative AI to support the production of discharge summaries has been proposed22, and early tests for producing discharge summaries have had some positive results23. Other research has examined the use of ChatGPT to simplify surgical consent forms16, and radiology reports21, as well as general health information available online17,18,19,20. Others propose generative AI tools as possible solutions to healthcare communication issues but—consistent with our findings—suggest caution in relation to their safety24,25,26,27.

Recent advances in generative AI have enabled the use of general-purpose generative AI tools in clinical workflows, but our findings suggest that more work needs to be done to ensure that the tools are safely adopted in practice and avoid unintended consequences. For producing patient-centred instructions, generative AI could be used to produce a ‘first draft’ but the need to review the instructions may add to clinical workload. Future research to improve the safety of generative AI used with discharge may benefit from the development and use of tools that help identify the source of safety issues in summarisation tasks, including hallucination, the balance between information extraction and generating new text, or changed meaning through summarisation. Future practice and implementation directions may consider broader goals in transitions of care, including generating discharge summaries from medical records22,23 and generating multiple discharge documents intended for primary care physicians, patients with different levels of health literacy and culturally and linguistically diverse backgrounds, and other care providers.

This study had several limitations. The set of discharge summaries were from one dataset (MIMIC-IV), which represents one location and results may not generalise to other healthcare systems where discharge summaries differ. Data from MIMIC-IV are de-identified, which means that some details were missing and introduced some ambiguity in the original discharge summaries that occasionally hindered the evaluation of the responses. While we used a robust approach for prompt engineering, evaluations of other prompts may have yielded different results. It may be useful to explicitly separate the information extraction from the summarisation and language simplification as two separate tasks, but this approach may also need to consider how best to incorporate contextual information from other sections of the discharge summary. We used GPT-3.5 as the basis for generating responses, and other language models may also have yielded different results. Future studies in the area could replicate the methods we use here and compare different combinations of language models, prompts and discharge summaries from other locations.

Generative AI tools may be used to support discharge planning by generating new patient-centred discharge instructions, filling an important current gap in communication with patients as they leave hospital. While there is a clear need to improve communication with patients on discharge, AI-generated patient discharge instructions can introduce incorrect information, which in some cases could lead to harm. New language models and advances in prompt engineering may help to balance constraints related to health literacy, accuracy and safety. Before considering the use of AI-generated patient-centred discharge instructions with patients, processes for ensuring safety are needed.

Methods

The study design was a comparison between patient-centred discharge instructions generated by prompting a GPT-based model and the doctor-written discharge summaries on which they were based. Evaluations included manual review by experts, and all evaluations of accuracy and safety were undertaken by investigators with qualifications in medicine or pharmacy.

The University of Sydney Research Integrity and Ethics Administration confirmed that the methodology of the study meets ethical review exception guidelines, as per the National Health and Medical Research Council National Statement on Ethical Conduct in Human Research. The study involved the use of existing collections of data or records that contain only non-identifiable data and was deemed to be of negligible risk.

Data sources

Discharge summaries were sourced from the Medical Information Mart for Intensive Care IV (MIMIC-IV) version 2.2 database28,29,30. The database includes deidentified electronic medical records from over 40,000 patients admitted to the Beth Israel Deaconess Medical Centre in Boston, Massachusetts, between 2008 and 2019. All investigators interacting with data from MIMIC-IV were credentialed users of the PhysioNet database. Discharge summaries were randomly sampled from the MIMIC-IV database and used in the development and analysis if they were written in English and if patients were discharged from hospital alive (Supplementary Table 5). Ten discharge summaries were used to help develop prompts and train investigators on the evaluations, and 100 discharge summaries were used in the main evaluation (Supplementary Table 3). Other information from MIMIC-IV related to the patients from the discharge summaries were not accessed or used.

Prompt development and selection

The GPT-3.5 model was accessed via the Microsoft Azure OpenAI service and met the requirements for safe use of MIMIC-IV data. A ChatGPT-like interface was developed to allow the safe access of GPT3.5 to test prompts on examples of discharge summaries from MIMIC-IV (Supplementary Figs. 13, Supplementary Boxes 13).

The language model takes a prompt and an entire discharge summary as inputs and generates a response. The response is not an extraction of the text in the discharge summary but newly generated text in response to the instructions provided in the prompt. Language models are known to be sensitive to small changes in prompts, so the prompt used in the analysis was developed through a process of iterative refinement and testing.

First, expert-derived examples of patient instructions were created. Five discharge summaries from the MIMIC-IV database were used to derive patient discharge instructions (including medication and action lists) by two investigators. Disagreements were resolved by discussion with the broader group of investigators. Following this step, prompts were iteratively refined and tested to produce responses that most closely matched five of the expert-derived examples using three prompt design approaches, including ‘direct’, ‘multi-stage’ and ‘worked example’ approaches (Supplementary Figs. 13, Supplementary Boxes 13). Investigators with clinical expertise scored each of the three prompts across each of five additional examples.

The prompt with the best balance between language complexity and accuracy was selected for the main analysis. The selected prompt was the ‘direct’ approach, which more often correctly represented medications and included more of the follow-up actions than the other prompts, while still reducing grade reading score and language complexity. Note that a two-step process where information is first extracted from the original discharge summary and then simplified to match the needs of patients may seem like a useful approach. However, the challenge with splitting the approach into two stages starting with information extraction (rather than retrieval augmented generation) is that the whole discharge summary provides contextual information that may be important to the details of the medications and follow-up instructions and direct information extraction would not be able to capture that context in the same way.

Analysis and outcome measures

Each response was independently scored by two investigators with expertise in medicine or pharmacy, comparing each response against the information available in the original discharge summary. Inter-rater reliability scores were calculated using Cohen’s Kappa for dichotomous variables and intra-class coefficient for proportional variables. Disagreements were resolved by discussion among the group, producing a final set of scores for each of the 100 discharge summaries. Descriptive statistics were also recorded, including the number of words, medications, and actions in the original discharge summaries and the responses.

Agreement between experts was higher for evaluating whether all discharge medications from the original discharge summary were included in the response (Cohen’s kappa 0.889), that no new medications were added (Cohen’s kappa 0.852) and the percentage of medications that were presented in UMS format (intra-class correlation coefficient 0.738). Agreement was lower for whether all actions from the original discharge summary were included in the response (Cohen’s kappa 0.521), that no new actions were added (Cohen’s kappa 0.569), the percentage of medications that were correct (intraclass correlation coefficient 0.438) and the percentage of actions that were correct (intraclass correlation coefficient 0.512).

Clinicians made note of any potential safety issues while evaluating the completeness and accuracy of the medications and follow-up actions, and these notes were discussed as a group to determine severity and provenance. Errors were categorised as errors of omission such as missing instructions, or errors of commission or translation such as a changed dose or route of a medication, inclusion of medications used during a hospital stay and not intended for use after discharge, where a new medication or follow-up action was introduced as a form of hallucination from the AI model.

The accuracy of the AI-generated responses was evaluated using three measures (Table 2). This included whether all medications and actions in the original discharge summary had been included in the patient instructions, whether responses included additional medications or actions that were not present in the post-discharge instructions within the original discharge summary, and the percentage of medications and actions from the original discharge summary that were included and correctly included in terms of dose, route, frequency and duration.

Table 2 Study outcome measures and assessment method

Health literacy was evaluated using three outcome measures (Table 2). Grade reading level and language complexity was measured using the Sydney Health Literacy Lab Health Literacy Editor24,31. Grade reading score estimates the level of education that most people would need to correctly understand a given text. The Editor calculates grade reading score using the Simple Measure of Gobbledygook, which is widely used in health literacy research32. Language complexity is the percentage of words in the text that are considered medical jargon, acronyms, or uncommon English words. This calculation was based on existing medical and public health thesauri and an English-language word frequency list. For both measures, lower values correspond to simpler text that should be easier to understand. Paired sample t-tests were used to compare grade reading level and language complexity scores between the original discharge summary and the AI-generated patient-centred discharge instructions. For medications that were prescribed up to four times a day, we manually determined the percentage of medications that were presented in the patient-centred discharge instructions in UMS format.