Background & Summary

In the recent literature, the evolution of large-scale foundation models1,2,3,4,5 has sparked significant interest in the development of generalist medical AI (GMAI) systems6,7,8,9,10, particularly within the realm of radiology—a crucial component of medical diagnostics. By training on large-scale visual-language medical datasets, i.e., medical scans paired with global clinical reports, for example, MIMIC-CXR11 has chest X-ray scans from 227,835 studies, and CT-RATE12 contains chest CT scans from 20,000 patients. These medical models have demonstrated the preliminary ability for writing clinical reports, that supports radiologists throughout their workflow and markedly reducing workloads.

While these datasets have advanced medical AI capabilities, they only provide global reports for each medical scan, posing limitations on training models for grounded report generation or grounded question answering, i.e., describing regional abnormalities and relevant normal findings, or answering questions corresponding to certain regions. Several efforts have addressed this limitation for 2D chest X-rays. For example, Chest ImaGenome13 provides bounding box annotations for 39 anatomical regions and relation triplets extracted from radiological reports. PadChest-GR14 offers grounded radiology reports for chest X-rays, while MS-CXR15 delivers semantic segmentation masks for 10 anatomical regions paired with region-specific descriptions. However, for 3D medical imaging like chest CT, visually grounded datasets remain notably scarce, primarily due to the complexity of 3D volumes and intricate anatomical structures. To address this gap and advance the development of more capable generalist models, we propose extending existing image-report datasets with region-wise descriptions by linking the descriptive findings from diagnostic reports to their corresponding anatomical regions in the images, using segmentation masks to enhance explainability.

In this paper, we introduce RadGenome-Chest CT16, a comprehensive, large-scale, and fine-grained annotated dataset for 3D chest CT interpretation, built upon the publicly available CT-RATE12. Initially, we employ the latest powerful text-prompted universal segmentation model, SAT17, to segment primary anatomical targets in the image. Subsequently, utilizing large language models and NER models, we break all reports into an anatomically hierarchical structured format, and link the reports’ sentences to visual regions in CT volume. Finally, we further generate visual question-answering pairs closely related to the structured report and segmented image, from both region level and case level. In summary, we have extended the original image-report datasets from the following aspects:

  1. i)

    Organ-level segmentation masks that covers 197 categories, i.e., all the critical regions existing in clinical CT reports;

  2. ii)

    665K multi-granularity grounded reports, with each sentence grounded to the corresponding anatomical region.

  3. iii)

    1.2M grounded VQA pairs, concerning both critical region-wise findings and comprehensive case-wise impressions. All the questions and answers are linked to segmentation masks for reference.

We believe that RadGenome-Chest CT16, with provided region-to-report associations, can advance the development of the multimodal medical foundation models, that enables to generate texts, grounded on the corresponding visual regions, which is unattainable with previous relevant datasets.

Methods

In this section, we start by introducing the source dataset that RadGenome-Chest CT is built on. Next, we provide a detailed description of the collection procedure for obtaining segmentation masks, and region-wise reports, as illustrated in Fig. 1. The outcomes from each step are presented in Fig. 2.

Fig. 1
figure 1

Data construction pipeline of RadGenome-Chest CT.

Fig. 2
figure 2

Overview of results obtained from each step of the data construction pipeline.

Data Source

We initiate our study with CT-RATE12,18,19, it is a dataset of 25,692 non-contrast 3D chest CT volumes derived from 21,304 unique patients, each volume is accompanied by a radiology text report and annotated with 18 distinct types of abnormalities. These 25,692 non-contrast 3D chest CT volumes have been reconstructed with various methods to accommodate different window settings, totaling 50,188 images. For consistency in this paper, we have standardized all CT volumes to a uniform voxel spacing of 1 × 1 × 3mm, resulting in only 25,692 paired CT volumes and reports. We follow the official division: 20,000 patients (24,128 volumes) were allocated to training and 1,304 (1,564 volumes) for validation.

Constructing RadGenome-Chest CT

The pipeline consists of three major stages, as shown in Fig. 1: (i) segmentation mask generation, where detailed masks for each anatomical region in the chest CT volumes are created; (ii) region-specific report division, that involves the annotation and categorization of radiology text reports by the anatomical regions they refer; (iii) rule-based question generation, which entails extracting entities from the sentence, and formulating visual question answering (VQA) pairs linked to specific segmentation masks.

Segmentation Mask Generation

To segment as many anatomical regions as possible, we employ the recent SAT17 model. It is a knowledge-enhanced segmentation model, that employs natural language as prompts to effectively segment 3D medical volumes. The model has been trained on 72 diverse segmentation datasets, covering 498 classes across various anatomical regions including the brain, head and neck, thorax, spine, abdomen, and limbs. For our research, SAT is adopted to execute detailed segmentation across all volumes of the CT-RATE dataset. Specifically, we focus on segmenting 197 regions pertinent to chest CT scans, enabling precise anatomical analysis. The list of segmented anatomies is organized into a hierarchical tree as shown in Supplementary File. This includes several major regions such as the lungs, trachea and bronchi, mediastinum, heart, pleura, bones, thyroid, breasts, abdomen, and other areas.

Region-wise Reports Generation

The goal here is to break the entire reports into region-wise descriptions, we observe that the OpenAI GPT-42 model can complete such task in very high accuracy20,21. However, employing GPT-4 on the entire set would be prohibitively expensive, thus, we train a model for such report division. Specifically, we first employ GPT-42 to annotate the anatomical regions of each sentence in the “FINDINGS” section of 2,500 radiology reports, which comprise all reports in the validation set of the original CT-RATE dataset. The prompt used is as follows.

This process results in 15,926 annotated sentences. Note that our approach fully supports multi-region annotations, where a single sentence can be linked to multiple anatomical regions if it references them. Subsequently, we divide these sentences into training and validation subsets in an 8:2 ratio and train a GPT-2 model using the annotated sentence along with the two preceding and following sentences from the report as input, if available. The model is designed to output the list of anatomical regions associated with each sentence. For instance, for the target sentence “No pleural effusion was detected on the left.”, the input is “There is minimal pleural effusion on the right. No pleural effusion was detected on the left. Atelectasis is observed in the middle lobe and lower lobe of the right lung. A malignant mass is observed around the lower lobe bronchi of the left lung.”, and the expected output is “left lung”. The model achieves an accuracy of 94.56% on the validation set. Consequently, we employ this model to perform inference on all sentences across the entire dataset of reports. More examples of segmentation results and structured reports can be seen in the Supplementary File.

Named Entity Recognition

Through the abovementioned process, each “FINDINGS” is divided into multiple sentences, and each is associated with one or multiple segmented regions. To facilitate the generation of detailed question-answer pairs from these sentences, we initially employ an RATEScore [?]’s Named-Entity Recognition (NER) model to analyze all sentences. This process involves extracting entities that can be categorized into “anatomy”, “abnormality”, and “non-abnormality”. Here, “anatomy” pertains to the anatomical regions, “abnormality” refers to findings or diseases identified as present, and “non-abnormality” indicates findings or diseases that are reported as absent. Subsequently, all extracted “abnormality” and “non-abnormality” entities undergo quality evaluation using GPT-4, allowing to filter out and revise any inaccuracies. For instance, in cases where the NER model extracts abnormalities such as “structural distortion and volume loss”, GPT-4 will segment it into “structural distortion” and “volume loss” for more accurate categorization. We filter out abnormalities with a GPT-4 output of “no” and update them to the revised versions provided by GPT-4. The used prompt is as follows:

In addition, for the “IMPRESSION” section, we directly utilize GPT-4 to extract all disorders mentioned, and detailed information regarding the presence of any abnormalities in specific anatomical regions. The prompt used is as follows.

Post-processing Pipeline

To ensure the quality and consistency of our dataset, we implemented a comprehensive post-processing pipeline using the state-of-the-art LLM “o3-mini”. This verification system was designed to verify the anatomical hierarchy consistency and appropriate abnormality extraction. We validated that all identified regions correctly adhered to our hierarchical anatomical reference structure for anatomical structures. When discrepancies were detected, the LLM provided corrections while maintaining anatomical accuracy. For example, in the case where a sentence “It may be compatible with parenchymal calcification” was previously assigned to the generic category “abdominal tissue”, our validation system analyzed the context of the preceding text ("A nonspecific density increase is observed in the subcapsular area at the dome level”) to correctly assign it to the “liver”, demonstrating the system’s ability to leverage contextual information for improved anatomical precision. For the identification of abnormalities, we implemented a targeted validation process that examined each finding sentence and its associated abnormalities. The system differentiated between existing abnormalities and normal findings, checking that the extracted entities properly reflected the radiological description in the original report. A common improvement pattern involved consolidating redundant or overlapping abnormality descriptions. For instance, in the finding “When examined in the lung parenchyma window; the mass appearance of the left lower lobe bronchi are stable.” our initial extraction had created multiple separate abnormality entries ("mass appearance” and “mass appearance of the left lower lobe bronchi”), which our validation system correctly merged into a single, more coherent abnormality description. The prompt is provided in the Supplementary File.

Rule-based Question Generation

Here, we describe the procedure for generating grounded visual question-answering (VQA) based on “FINDINGS” from the report section, while case-level visual question-answering data is generated from the “IMPRESSION” section. Table 1 presents the 5 question types in the proposed dataset. The detailed rules will be introduced in the following sections.

Table 1 Question and answer types in the proposed datasets.

First, after the region-wise report generation, the sentences in the findings section can be classified as follows:

  • Normal Findings: Sentences that report no significant changes from normal health conditions.

    • No abnormality entities in the sentence: Sentences that mention specific anatomical regions without noting any abnormalities. For example, “Thoracic aorta diameter is normal.”

    • No anatomy entities in the sentence: Sentences that solely note the absence of specific abnormalities without referring to any anatomical regions. For example, “No pleural effusion was detected.”

    • With anatomy and abnormality entities in the sentence: Sentences that explicitly state the absence of abnormalities in specific anatomical regions. For example, “Bilateral adrenal glands were normal and no space-occupying lesion was detected.”

  • Abnormal Findings: Sentences that report differences from normal anatomical conditions.

    • No anatomy entities in the sentence: Sentences that report an abnormal finding, but do not specify an anatomical region. For example, “Mild hiatal hernia is observed.”

    • With anatomy entities in the sentence: Sentences that include both an anatomical reference and describe an abnormality. For example, “There is narrowing of the spinal canal at the dorso-lumbar level.”

We then construct an anatomical disorder tree for each report, based on the anatomical hierarchical tree introduced in the Supplementary File. This involves marking any abnormalities on the tree for all nodes within the hierarchy if they are present. As shown in Fig. 3, this comprehensive data structure enables us to generate question-answer pairs. Taking inspiration from the previous research22,23, we categorize the questions into four types: 1) abnormality, 2) presence, 3) location, 4) size. Table 1 shows examples of the different question types. Note that, we have designed 50 templates for each question type. The details of all templates are provided in the Supplementary File. For instance, when analyzing a sentence indicating normal findings, such as “Bilateral adrenal glands were normal and no space-occupying lesion was detected”, questions can include “Is there any evidence of abnormality in adrenal glands?” and “What abnormality is seen in the adrenal glands?”. Conversely, for a sentence in abnormal findings, “There is a narrowing of the spinal canal at the dorso-lumbar level.”, the question can be “Is there any evidence of narrowing in the spinal canal?” “What abnormality is seen at the dorso-lumbar level of the spinal canal?” and “Where in the spinal canal is the narrowing located?”. For impression sentences, since we have already extracted disorders, we can generate case-level questions such as “What disorder can be identified from this volume?” for each case.

Fig. 3
figure 3

Partial visualization of the anatomical hierarchical tree with abnormalities annotated.

In summary, we have successfully generated 1.2M grounded Visual Question Answering (VQA) pairs for training and 85k for validation, along with 24,128 case-level visual question answering pairs for training and 1,564 for validation. Detailed counts for each type of VQA pair in both the training and validation sets are provided in Table 1. We performed a thorough manual verification on all samples in the validation set. The verification was conducted systematically, with reviewers examining each sample for accuracy of extracted sentences, anatomical structures, abnormalities, and non-abnormalities. This comprehensive review process covered 9,008 unique finding sentences from the validation set, only 0.58% of sentences containing abnormality extraction errors and 0.46% containing anatomical labeling errors. Note that, these low error rates also reflect that in the training set, as we applied identical post-processing across all datasets. All identified errors were systematically corrected before finalizing the validation set, ensuring the highest data quality for benchmark evaluation.

Benchmark Construction

Utilizing the structured reports and question-answer pairs in RadGenome-Chest CT16, we have developed three benchmarking tasks to assess the performance of generalist medical AI models: Grounded Report Generation (RadGenome-Chest CT-GRG), Grounded Visual Question Answering (RadGenome-Chest CT-GVQA). The first two tasks utilize both an image and a mask of the region of interest as inputs.

  • RadGenome-Chest CT-GRG: This task aims to generate a comprehensive report detailing the findings within a specified region of interest.

  • RadGenome-Chest CT-GVQA: This task involves answering specific questions regarding the presence, location, and size of abnormalities within the highlighted region, guided by the region mask.

We divided the training and validation sets according to the official guidelines provided by CT-RATE. Specifically, 20,000 patients (corresponding to volumes) were allocated for training, and 1,304 patients (1,564 volumes) were designated for validation. Consequently, 624,876 GRG pairs and 1,057,946 GVQA pairs were used for training and 40,342 GRG pairs and 66,875 GVQA pairs for validation.

Baseline Models

We implement two different baseline models, with and without region mask inputs in RadGenome-Chest CT16. These models are evaluated across three benchmark tasks: RadGenome-Chest CT-GRG, RadGenome-Chest CT-GVQA. Note that our experiments serve as a preliminary demonstration of the necessity and superiority of including region masks in the dataset. Due to computational limitations, we have selected a small subset of the dataset for training, consisting of 2000 volumes, 51,959 GRG pairs, and 87,435 GVQA pairs.

  • Baseline-Global: This baseline model is designed to respond solely to whole-image data. As shown in Fig. 4, the query image is input to the visual encoder to obtain the global image token, which is then concatenated with the question token before being input to the model. The visual encoder consists of a 3D ConvNet initialized from SAT17, while the language model is LLaMA 3 [], fine-tuned using LoRA24 with a rank of 8.

    Fig. 4
    figure 4

    The architecture of the Baseline-Grounded model. The image is input to the visual encoder to obtain the global image token. The mask feature, processed by the mask encoder, queries the global image token to generate the grounded image token, which is then used as input to the multimodal model. Note that for the Baseline-Global model, only the global image token is used as input to the multimodal model.

  • Baseline-Grounded: This baseline model supports region mask input, leveraging spatial information from the region of interest to enhance the accuracy of generated reports and responses. As shown in Fig. 4, the query image is input to the visual encoder, and the mask is input to the mask encoder. The encoded mask embedding is used as ‘queries’, while the encoded image embedding serves as ‘keys’ and ‘values’. These embeddings are concatenated with the question token and input to the model. The visual encoder adopts the 3D ConvNet initialized from SAT17, while the language model is LLaMA 325, fine-tuned using LoRA24 with a rank of 8. Additionally, a lightweight mask encoder is employed.

Evaluation Metrics

We use different evaluation metrics for different tasks. The metrics are listed as follows.

  • BLEU. BLEU stands for “BiLingual Evaluation Understudy”. It measures the accuracy of the report generation result. We use nltk.translate.bleu_score.sentence_bleu with weights of (1, 0, 0, 0) to calculate the 1-gram BLEU score.

  • ROUGE. ROUGE stands for “Recall-Oriented Understudy for Gisting Evaluation” and refers to a set of evaluation measures that assess the overlap between a generated summary and a set of reference summaries. We use ROUGE1 by default.

  • BERT-Sim. BERT-Sim stands for “BERT similarity score”, which measures the similarity between the generated report and the reference report using contextualized word embeddings provided by the BERT model. In this paper, we use bert_score.BERTScorer with model type as bert-base-uncased to calculate the BERT-Sim.

  • RadGraph F1. RadGraph F1 is an automatic metric that computes the overlap in clinical entities and relations between a model-generated report and a radiologist-generated report.

  • RadCliQ. RadCliQ predicts a radiologist-determined error score from a combination of automated metrics, including BLEU, BERTScore, CheXbert vector similarity, and RadGraph.

  • Hit Score: The Hit Score is a specific metric designed for evaluating abnormalities. It compares the model’s prediction with the ground truth. If the model’s prediction matches any abnormality specified in the ground truth, the score is set to 1; otherwise, it is 0. The overall score is then calculated as the average of these individual scores.

  • Accuracy: Accuracy is calculated for the presence task. If the model’s prediction matches the ground truth exactly (model prediction == ground truth), the score is 1; if not, the score is 0. The average of these scores is then calculated.

  • L1 (mm): For the size task, both the ground truth and model predictions are converted to millimeters (mm) to ensure uniformity. The L1 score is then calculated based on these values.

Data Records

The dataset is available at Huggingface16. The CT volumes after standardisation are also uploaded. Train set and validation are organized as the following JSON file for each case, we have

  • Volumename: string. The original volume name, each volume name matches with a specific CT scan.

  • ClinicalInformation_EN: string. The original clinical information.

  • Technique_EN: string. The original technique information including the spacing and view position.

  • Findings_EN: string. The original finding section.

  • Impressions_EN: string. The original impression section.

  • Impressions_DIS: list. The disorders extracted from the impression section.

  • hierarchical_findings: json. the hierarchically structured findings.

    • lung

      • abnormalities: list. The abnormalities existed in the lungs.

      • non_abnormalities: list. The abnormalities do not exist in the lungs.

      • sentence: list. The finding sentences related to lung.

    • pleura

      • ...

    • bone

    • ...

To support training on various tasks, we organize data into separate files for different datasets. Here is an overview: RadGenome-Chest CT-GRG dataset is provided in a CSV file with the following columns:

  • Volumename: string. Each volume name corresponds to a specific CT scan.

  • Anatomy: string. This details the specific anatomical region, such as lung, left lung or left lung lower lobe.

  • Sentence: string. This describes the specified anatomy.

RadGenome-Chest CT-GVQA is provided by several CSV files catered to different tasks, abnormality, presence, and size, as described below:

Abnormality: For a given image and specified anatomy, this task involves identifying any abnormalities present. The model is required to predict the specific abnormality located within the anatomy.

  • Volumename: string. Each volume name corresponds to a specific CT scan.

  • Anatomy: string. This specifies the anatomical region, similar to above.

  • Abnormality: string. This lists the abnormalities present in the specified anatomy.

  • Non-Abnormality: string. This lists the abnormalities absent in the specified anatomy.

Presence: For a given image and specified anatomy, this task involves detecting the presence of a specific abnormality.

The model must accurately describe the abnormality, and confirm its presence with a ‘Yes’ or ‘No’.

  • Volumename: string. Each volume name corresponds to a specific CT scan.

  • Anatomy: string. This specifies the anatomical region.

  • Finding: string. The findings associated with the anatomy.

  • Presence: string. This indicates whether the finding is ‘Yes’ or ‘No’.

Size: For a given image and specified anatomy, this task involves predicting the dimensions of a specific abnormality in millimeters or centimeters.

  • Volumename: string. Each volume name corresponds to a specific CT scan.

  • Anatomy: string. This specifies the anatomical region.

  • Abnormality: string. This lists the abnormalities found in the specified anatomy.

  • Size: string. This provides the size of the abnormality, measured in mm or cm.

Technical Validation

Dataset Analysis

In this section, we analyze the distribution of the proposed dataset. The hierarchically structured reports and abnormality trees enable systematic extraction and analysis of anatomical structures and abnormal findings for each case. First, we present the distribution of hierarchical anatomy structures in the structured reports in Fig. 5. We then compare the normal-to-abnormal case ratios for both the training and validation sets, as illustrated in Fig. 6(a),(b). Additionally, we visualize the extracted abnormalities from all anatomical regions using word clouds, presented in Fig. 6(c),(d). These visualizations highlight the frequency and diversity of abnormal findings across different anatomical structures in our dataset.

Fig. 5
figure 5

Distribution of the hierarchical anatomy structures in structured reports.

Fig. 6
figure 6

(a) Distribution of Normal vs. Abnormal Cases: Training set analysis. (b) Distribution of Normal vs. Abnormal Cases: Validation set analysis. (c) Abnormalities Word Cloud: Visual summary of frequent abnormalities from findings. (d) Disorders Word Cloud: Visual summary of frequent disorders from findings.

Benchmark Results

The performance of two baseline models is shown in Tables 2 and 3. Specifically, Table 2 presents the results of these models across all test sets and various tasks. For the abnormality task, we provide results for all cases and specifically for abnormal cases. Since half of the cases have no findings, focusing on abnormal cases better reflects the model’s ability to identify abnormalities. The results show that the Baseline-Grounded model consistently outperforms the other model across all tasks. Specifically, in the abnormality identification task, the Baseline-Grounded model demonstrates superior accuracy in predicting abnormalities in specific regions, both for all cases and for abnormal cases. Table 3 highlights the performance of the two baseline models on test sets of node anatomy, excluding large regions, such as the lungs, mediastinum, etc, as query regions. As shown, the superiority of the Baseline-Grounded model is more pronounced. The BLEU score for GRG tasks improved from 34.76 to 38.43 (a 10.6% relative improvement), demonstrating that localized region information enables more precise and relevant descriptions. Similarly, the presence detection accuracy increased from 95.67% to 98.05%, showing that region masks help the model better determine whether specific structures are present in the designated area. These results conclusively demonstrate that region-specific visual grounding provides essential context for comprehensive radiological analysis, particularly for tasks requiring detailed anatomical understanding and abnormality detection in specific regions. The detailed region masks provide the model with stronger and more fine-grained clues, enhancing its predictive accuracy. We also provide a detailed case study of Baseline-Grounded and Baseline-Global in Fig. 7. As shown, Baseline-Grounded tends to output more detailed and accurate description of abnormalities.

Table 2 Comparison of baseline models on all test sets.
Table 3 Comparison of baseline models on test subsets that focuses on region belonging to node anatomy.
Fig. 7
figure 7

Case study of RadGenome-Chest CT-GRG.