Abstract
Radiology reporting is a time-intensive process, and artificial intelligence (AI) shows potential for textual processing in radiology reporting. In this study, we proposed a keyword-based AI-assisted radiology reporting paradigm and evaluated its potential for clinical implementation. Using MRI data from 100 patients with intracranial tumors, two radiology residents independently wrote both a routine complete report (routine report) and a keyword report for each patient. Based on the keyword reports and a designed prompt, AI-assisted reports were generated (AI-generated reports). The results demonstrated median reporting time reduction ratios of 27.1% and 28.8% (mean, 28.0%) for the two residents, with no significant difference in quality scores between AI-generated and routine reports (p > 0.50). AI-generated reports showed primary diagnosis accuracies of 68.0% (Resident 1) and 76.0% (Resident 2) (mean, 72.0%). These findings suggest that the keyword-based AI-assisted reporting paradigm exhibits significant potential for clinical translation.
Similar content being viewed by others
Introduction
Radiologists usually serve as scouts in the clinical diagnosis and treatment process, and radiology reports represent their documented observations and professional interpretations of specific imaging examinations. Writing radiology reports is not only time-consuming and labor-intensive but also frequently suffers from incompleteness, inconsistent terminology1.
To alleviate the workload of radiologists, with the advancement of artificial intelligence (AI) technology, an increasing number of studies have focused on using AI for radiology report generation. The basic methods for report generation include early retrieval-based methods and template-based methods, as well as recent generative models. Retrieval-based methods are limited in the preparation of large databases2,3. In comparison with human-written reports, template-based approaches exhibit limited coverage, creativity, and complexity2. With the development of deep learning research, generative models have become important methods for report generation. Studies have shown that when generating draft descriptions of chest radiographs based on AI, the average reading time of radiologists has been reduced from 34.2 to 19.8 seconds4. Another study indicated that with the assistance of AI-generated reports, the average time taken by residents to report chest radiographs was significantly shorter than that using normal templates (283 seconds vs. 347 seconds)5. However, this end-to-end image-to-caption mapping approach requires a large amount of high-quality text annotation and is influenced by language6. Additionally, the models often lack the ability to integrate detailed information about the location and characteristics of abnormalities2.
Recent advances in AI, particularly large language models (LLMs), offer significant capabilities in text processing applications. An LLM, as a deep learning pretrained model, is trained on extensive textual datasets and acquires a comprehensive understanding of fundamental linguistic patterns and contextual relationships7. Consequently, LLMs have theoretical potential for textual processing in radiology reporting. However, due to the potential for ‘hallucinations’, these AI models may produce inaccurate information or fabricate facts, which could pose significant risks in radiology reporting5. Therefore, how to optimize the use of AI for radiology report generation warrants further investigation.
In this study, we propose a human AI collaborative radiology report generation paradigm: radiologists provide keywords, and LLMs generate complete radiology reports using designed prompts. This paradigm theoretically differs from existing approaches as it is agnostic to anatomical regions, imaging modalities, or disease types and does not require specialized medical image training data. By capitalizing on the strengths of AI while mitigating its limitations, this method offers a novel framework to accelerate the integration of AI into radiology reporting workflows.
Results
Patient characteristics
A total of 100 patients with intracranial tumors were included in the study. Demographic and clinical data are presented in Table 1.
Reporting time
Resident 1 required a median time of 48 s (range: 36-64 s) for routine reports and 35 s (range: 28-50 s) for keyword reports. Resident 2 demonstrated median completion times of 52 s (mean ± standard deviation [SD]: 52 ± 7 s; range: 40-72 s) for routine reports and 37 s (range: 25-60 s) for keyword reports (Fig. 1).
An intraresident (within each resident) comparison revealed statistically significant differences between routine and keyword report completion times for both residents (both p < 0.001). Interresident comparisons revealed significant differences in both routine and keyword report times (both p < 0.001). This showed that keyword reports significantly reduced time compared with routine reports, with individual variations.
Qualitative evaluation of the reports
The two senior radiologists’ evaluations of Resident 1’s routine reports yielded scores ranging from 3 to 5, with good interrater agreement (κ = 0.791). For Resident 2’s routine reports, the evaluation scores ranged from 2 to 5, which also demonstrated good interrater agreement (κ = 0.687).
For AI-generated reports derived from Resident 1’s keyword-based inputs, the senior radiologists’ evaluations yielded quality scores ranging from 3 to 5, showing good interrater consistency (κ = 0.675). Similarly, for AI-generated reports derived from Resident 2’s keyword reports, scores ranged from 3 to 5, with good interrater consistency (κ = 0.786).
The two senior radiologists’ ratings were significant different between Resident 1’s and Resident 2’s routine reports (p < 0.01) and between the AI-generated reports generated from their respective keyword reports (p < 0.02). However, no significant differences were observed between Resident 1’s routine reports and their corresponding AI-generated reports (p > 0.835) or between Resident 2’s routine reports and their corresponding AI-generated reports (p > 0.509).
The primary diagnosis accuracy rates of the AI-generated reports by the two residents were 68.0% and 76.0%, respectively, and the top-two-diagnosis accuracy rates reached 81.0% and 84.0%, respectively.
Keyword ordering for AI-generated reports
For the 20 randomly selected keyword-based reports, AI-generated reports with the shuffled keyword order consistently achieved scores of 4-5 in evaluations by the senior radiologists, demonstrating good inter-rater reliability (κ = 0.78). No statistically significant difference in report quality was observed among routine reports, AI-generated reports with original keyword order, and AI-generated reports with randomly shuffled keyword order (p > 0.928). The primary diagnosis accuracy rate and top-two-diagnosis accuracy rate of AI-generated reports with shuffled keyword order were 70.0% and 90.0%, respectively.
Validation of AI-generated report quality using LLM B
For the 20 randomly selected keyword reports, AI-generated reports using LLM B consistently achieved scores of 4-5 in evaluations by the senior radiologists, demonstrating good interrater reliability (κ = 0.80). There was no statistically significant difference in report quality among routine reports, AI-generated reports using LLM A, and AI-generated reports using LLM B (p > 0.637). The primary diagnosis accuracy rate and top-two-diagnosis accuracy rate of AI-generated reports with LLM B were 70.0% and 85.0%, respectively.
Discussion
In this study, we proposed a keyword-based AI-assisted radiology reporting paradigm and assessed its clinical applicability. The results demonstrate that this paradigm significantly reduces radiologists’ reporting time while maintaining report quality. Collectively, these findings strongly highlight the paradigm’s substantial potential for clinical implementation.
Imaging findings and impressions are key components of radiology reports7. The imaging findings section, in particular, often serves as the main component of the report text and forms the basis for making accurate diagnoses or impressions. For radiologists, describing any observed abnormalities is a fundamental requirement of report writing, but this requires strong linguistic organizational skills and typically constitutes a significant source of clerical burden. For AI, text processing is a strength; however, the task poses challenges due to the need to understand different imaging modalities/sequences, recognize numerous anatomical structures and pathological signs, and apply extensive medical expertise. This study showed that this human AI collaborative radiology reporting paradigm significantly reduced radiologists' reporting time while maintaining report quality, demonstrating complementary advantages.
Additionally, as large-scale, clinician-annotated datasets are needed to train robust models8, current research on AI-generated radiology reports remains predominantly limited to anatomically simple regions4,9, and faces challenges in reporting rare or uncommon diseases6. Our paradigm requires no additional model training, facilitating its application across all anatomical regions and various disease conditions.
The uncertainty in LLM outputs and the presence of ‘hallucinations’ could raise concerns for medical documentation; however, this study demonstrates that the quality of AI-generated reports is comparable to that of routine reports. This may be attributed to our rigorous prompting strategies, which mainly included report templates, output examples, the handling of exceptional cases such as missing items and generic prompt phrases. Designing simpler and more effective prompts also represents an important research direction for AI-generated radiology report generation in the future.
In our study, AI demonstrated high diagnostic accuracy based on keywords, though these results are preliminary and derived from residents. We believe that the more accurately keywords are used, the higher the accuracy rate will be. If the reporting paradigm introduced in this study is applied clinically in the future, it may impact resident education. For instance, when trainees encounter images without a clear diagnostic direction, entering simple keywords might yield diagnostic insights for reference, thereby fostering exploratory learning among residents. Furthermore, since AI’s diagnostic performance is likely to improve with more precise input descriptions, this reporting approach places greater emphasis on trainees’ ability to identify key imaging features.
Interestingly, our study also revealed that altering the order of keywords did not affect the overall quality of the reports. This may allow more flexible reporting workflows. When reviewing images, we can prioritize documenting midline shift or edema as the initial observation rather than rigidly following a template requiring a systematic description starting from the lesion location. This flexibility is attributable to the underlying mechanism of LLMs. During text generation, while the sequential word order is not strictly preserved, the models maintain contextual relationships through positional encoding, a mathematical representation assigned to each word’s location10.
While this study demonstrates the potential of keyword-based AI-generated radiology reporting to streamline workflows, the keyword-based reporting does not represent a routine workflow for radiologists at present, and several considerations merit attention. Our findings, derived from a limited single-center case series with only two residents contributing reports, warrant validation in larger, multi-institutional cohorts with radiologists of diverse experience levels. The current workflow assumes ideal conditions without accounting for real-world variables such as physician emotional state, environmental factors, or network latency in AI deployment. The use of simplified cases with predetermined findings and unstandardized keywords may not fully reflect clinical complexity. Additionally, potential evaluation biases exist due to partial unblinding from spelling errors, subtle report format discrepancies, and researcher involvement in both template design and report evaluation. Further establishing detailed quality assessment frameworks for AI-generated reports and conducting real-world clinical integration studies to validate and optimize this reporting paradigm are needed in the future. These refinements will better determine the paradigm’s true clinical utility and generalizability.
Furthermore, it is critical to recognize the potential risks inherent in AI-assisted reporting. This paradigm relies substantially on meticulously designed prompt systems. When inputs deviate from predefined frameworks, AI-generated inferences may lead to inaccuracies. Thus, responsible AI utilization and enhanced supervision of AI-generated content are paramount. Highlighting AI-generated content, especially inferential statements, via formatting (e.g., font styles or colors) in generated reports would render subsequent reviews more convenient.
Methods
This study was approved by the Ethics Committee of the Second Affiliated Hospital of Zhejiang University School of Medicine (2020-151), which waived the requirement for written informed consent.
Study patients
We retrospectively included 100 patients with intracranial tumors, comprising both primary and metastatic tumors, who underwent preoperative brain MRI at our hospital between January 2024 and July 2024.
MRI acquisition and imaging presentation
MRI was performed using 1.5 T AREA, AVANTO, and MAGTOM VIDA scanners (Siemens Medical Solutions), 3.0 T SIGNA HDXT and Discovery MR 750 scanners(GE Healthcare), and a 3.0 T uMR 790 scanner (United Imaging Healthcare). The axial images were acquired with a slice thickness of 5.5 mm or 6 mm, while the coronal images used a 5 mm slice thickness.
To minimize the time discrepancies in report writing caused by excessive images and layer-by-layer image review, a senior radiologist (Radiologist A) selected one representative image from each of the following sequences for each patient: axial T1WI, T2WI, contrast-enhanced T1WI, and coronal contrast-enhanced T1WI (Fig. 2). These selected images were saved in TIFF format and inserted into a PowerPoint presentation, with each slide displaying the four selected images for a single case. To reduce interference from diagnostic and differential diagnostic processes, particularly the time differences arising from determining whether a lesion is intra-axial or extra-axial, the readers were informed of the pathological diagnosis of the tumor before report writing began. Additionally, to minimize time variations caused by lesion measurements, we marked the maximum and minimum perpendicular diameters of the lesions on the images.
Report writing by residents
The radiology reports were written by two residents: one third-year resident (Resident 1) and one second-year resident (Resident 2). The residents independently wrote a pair of reports for each patient, encompassing both routine reports that simulated actual clinical practice and keyword reports that included only key diagnostic terminology. Both the routine and keyword reports were typed on computers. When writing the reports, they could use routine structured report templates or keyword templates, and were required to complete them as quickly as possible. The reporting time was recorded with a stopwatch and defined as the duration from report initiation to completion.
After completing the routine reports and a 6-week washout period, the two residents commenced writing keyword reports. Prior to this, they underwent pretraining on keyword report writing. These reports included only key terminology, such as location, shape, cystic/solid composition, size, signal characteristics and enhancement features, surrounding edema, mass effects on adjacent structures, and midline shifts. The keyword reports were designed to concisely summarize imaging findings using the simplest terms possible, with permitted abbreviations. For example: “Right frontal, cystic-solid, 15, 10, irregular, T1 hypo, T2 hyper, marked ring enhance, severe ede, shift left” indicates the following: The lesion is located in the right frontal lobe and appears as a mixed cystic and solid mass measuring approximately 15 mm × 10 mm with irregular shape. On T1WI, it is hypointense, and on T2WI, it is hyperintense compared with normal brain parenchyma. Contrast-enhanced imaging revealed marked ringlike enhancement. Significant perilesional edema is present, along with a leftward midline shift.
AI-generated reports
In this study, we leveraged DeepSeek (https://www.deepseek.com/) as the primary LLM (LLM A) for AI assistance in report generation, using keyword reports as input. The AI-generated radiology reports were required to include a complete imaging findings section and an impression section, with the impression section consisting of the two most probable diagnoses: a primary consideration and the main differential diagnosis (Fig. 3).
To reduce uncertainty in the generated reports, we specifically designed prompts consisting of four key components: generic prompt phrases, structured report templates, complete report examples, and protocols for handling exceptional cases. The AI generation process was supervised by two radiologists (Radiologist A and Radiologist B).
To evaluate the impact of keyword ordering on AI-assisted report generation, we randomly selected 20 keyword reports written by the two residents and altered the keyword order randomly in each report. For example, we changed the original report “Right frontal, cystic-solid, 15, 10, irregular, T1 hypo, T2 hyper, marked ring enhance, severe ede, shift left” to a randomized version: “T1 hypo, Right frontal, cystic-solid, 15, irregular, T2 hyper, marked ring enhance, 10, severe ede, shift left”.
Qualitative evaluation of the reports
Two senior radiologists (Radiologists A and C) independently evaluated the routine reports written by the residents and the AI-generated reports. When evaluating the reports, the senior radiologists were blinded to the authorship of the reports. The overall quality of the descriptions was assessed using a five-point Likert scale: 1 (very poor), 2 (poor), 3 (average), 4 (good), and 5 (excellent)4. The diagnostic performance of the models was evaluated using accuracy metrics, including both primary consideration accuracy and top-two accuracy (accounting for the two most likely diagnoses).
The twenty randomly selected keyword reports were subsequently used to validate the quality of generated reports with a second LLM (LLM B, Doubao: https://www.doubao.com). Identical prompting strategies were employed to maintain consistency between the two LLM implementations.
Statistical analysis
Continuous variables were presented as means and (or) medians based on their distribution normality. Categorical data were expressed as proportions. For comparisons of writing times and quality scores between two groups, either the paired t-test or Wilcoxon signed-rank test was used, depending on whether the data followed a normal distribution. For comparisons of scoring among three groups of reports, the Kruskal-Wallis H test was applied. Inter-rater agreement in report evaluation was assessed using kappa statistics, with κ values > 0.81, 0.61-0.80, and < 0.60 indicating excellent, good, and poor agreement, respectively11. The median reporting time reduction ratio was defined as the difference between the median time for routine reports and the median time for keyword reports, divided by the median time for routine reports (expressed as a percentage). All statistical tests were two-sided, with statistical significance set at p < 0.05. Analyses were conducted using R software (version 4.0.4; http://www.r-project.org/).
Data availability
The raw data supporting the conclusions of this article are available from the corresponding author upon reasonable request.
Code availability
All code for this study is publicly available.
References
Soleimani, M., Seyyedi, N., Ayyoubzadeh, S. M., Kalhori, S. R. N. & Keshavarz, H. Practical Evaluation of ChatGPT Performance for radiology report generation. Acad. Radio. 31, 4823–4832 (2024).
Reale-Nosei, G., Amador-Dominguez, E. & Serrano, E. From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation. Med. Image Anal. 97, 103264 (2024).
Zhu, Q. et al. Utilizing Longitudinal chest x-rays and reports to pre-fill radiology reports. Med. Image. Comput. Comput. Assist. Interv. 14224, 189–198 (2023).
Hong, E. K. et al. Value of using a generative AI model in chest radiography reporting: a reader study. Radiology 314, e241646 (2025).
Zhang, Y. et al. Comparison of Chest Radiograph Captions Based on Natural Language Processing vs Completed by Radiologists. JAMA Netw. Open 6, e2255113 (2023).
Liu, F. et al. A multimodal multidomain multilingual medical foundation model for zero shot clinical diagnosis. NPJ Digit Med. 8, 86 (2025).
Parillo, M., Vaccarino, F., Beomonte Zobel, B. & Mallio, C. A. ChatGPT and radiology report: potential applications and limitations. Radio. Med. 129, 1849–1863 (2024).
Liu, F. et al. Aligning, autoencoding and prompting large language models for novel disease reporting. IEEE Trans. Pattern Anal. Mach. Intell. 47, 3332–3343 (2025).
Stephan, D. et al. AI in dental radiology-improving the efficiency of reporting with ChatGPT: comparative study. J. Med Internet Res 26, e60684 (2024).
Bhayana, R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 310, e232756 (2024).
Li, Q., Dong, F., Jiang, B. & Zhang, M. Exploring MRI Characteristics of brain diffuse midline gliomas with the H3 K27M mutation using radiomics. Front Oncol. 11, 646267 (2021).
Acknowledgements
No funding was received for this study.
Author information
Authors and Affiliations
Contributions
Study conception and design: F.D., Q.L. Investigation the data: F.D. Analysis and interpretation of data: F.D., S.N., M.L., F.X., Q.L. First draft of manuscript: F.D. Revision of manuscript: all authors. Approval of final manuscript: all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Dong, F., Nie, S., Chen, M. et al. Keyword-based AI assistance in the generation of radiology reports: A pilot study. npj Digit. Med. 8, 490 (2025). https://doi.org/10.1038/s41746-025-01889-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-01889-4





