Development of a large-scale grounded vision language dataset for chest CT analysis

Zhang, Xiaoman; Wu, Chaoyi; Zhao, Ziheng; Lei, Jiayu; Tian, Weiwei; Zhang, Ya; Xie, Weidi; Wang, Yanfeng

doi:10.1038/s41597-025-05922-9

Download PDF

Data Descriptor
Open access
Published: 10 October 2025

Development of a large-scale grounded vision language dataset for chest CT analysis

Scientific Data volume 12, Article number: 1636 (2025) Cite this article

1592 Accesses
Metrics details

Subjects

Abstract

Developing generalist foundation model has recently attracted tremendous attention in the field of AI for Medicine, which requires open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation model and large language models, to extend the original datasets from the following aspects: organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; 665K multigranularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume with a segmentation mask; 1.2M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets.

A vision–language foundation model for the generation of realistic chest X-ray images

Article 26 August 2024

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

Article Open access 17 May 2024

A deep learning based dual encoder–decoder framework for anatomical structure segmentation in chest X-ray images

Article Open access 16 January 2023

Background & Summary

In the recent literature, the evolution of large-scale foundation models^1,2,3,4,5 has sparked significant interest in the development of generalist medical AI (GMAI) systems^6,7,8,9,10, particularly within the realm of radiology—a crucial component of medical diagnostics. By training on large-scale visual-language medical datasets, i.e., medical scans paired with global clinical reports, for example, MIMIC-CXR¹¹ has chest X-ray scans from 227,835 studies, and CT-RATE¹² contains chest CT scans from 20,000 patients. These medical models have demonstrated the preliminary ability for writing clinical reports, that supports radiologists throughout their workflow and markedly reducing workloads.

While these datasets have advanced medical AI capabilities, they only provide global reports for each medical scan, posing limitations on training models for grounded report generation or grounded question answering, i.e., describing regional abnormalities and relevant normal findings, or answering questions corresponding to certain regions. Several efforts have addressed this limitation for 2D chest X-rays. For example, Chest ImaGenome¹³ provides bounding box annotations for 39 anatomical regions and relation triplets extracted from radiological reports. PadChest-GR¹⁴ offers grounded radiology reports for chest X-rays, while MS-CXR¹⁵ delivers semantic segmentation masks for 10 anatomical regions paired with region-specific descriptions. However, for 3D medical imaging like chest CT, visually grounded datasets remain notably scarce, primarily due to the complexity of 3D volumes and intricate anatomical structures. To address this gap and advance the development of more capable generalist models, we propose extending existing image-report datasets with region-wise descriptions by linking the descriptive findings from diagnostic reports to their corresponding anatomical regions in the images, using segmentation masks to enhance explainability.

In this paper, we introduce RadGenome-Chest CT¹⁶, a comprehensive, large-scale, and fine-grained annotated dataset for 3D chest CT interpretation, built upon the publicly available CT-RATE¹². Initially, we employ the latest powerful text-prompted universal segmentation model, SAT¹⁷, to segment primary anatomical targets in the image. Subsequently, utilizing large language models and NER models, we break all reports into an anatomically hierarchical structured format, and link the reports’ sentences to visual regions in CT volume. Finally, we further generate visual question-answering pairs closely related to the structured report and segmented image, from both region level and case level. In summary, we have extended the original image-report datasets from the following aspects:

i)
Organ-level segmentation masks that covers 197 categories, i.e., all the critical regions existing in clinical CT reports;
ii)
665K multi-granularity grounded reports, with each sentence grounded to the corresponding anatomical region.
iii)
1.2M grounded VQA pairs, concerning both critical region-wise findings and comprehensive case-wise impressions. All the questions and answers are linked to segmentation masks for reference.

We believe that RadGenome-Chest CT¹⁶, with provided region-to-report associations, can advance the development of the multimodal medical foundation models, that enables to generate texts, grounded on the corresponding visual regions, which is unattainable with previous relevant datasets.

Methods

In this section, we start by introducing the source dataset that RadGenome-Chest CT is built on. Next, we provide a detailed description of the collection procedure for obtaining segmentation masks, and region-wise reports, as illustrated in Fig. 1. The outcomes from each step are presented in Fig. 2.

Data Source

We initiate our study with CT-RATE^12,18,19, it is a dataset of 25,692 non-contrast 3D chest CT volumes derived from 21,304 unique patients, each volume is accompanied by a radiology text report and annotated with 18 distinct types of abnormalities. These 25,692 non-contrast 3D chest CT volumes have been reconstructed with various methods to accommodate different window settings, totaling 50,188 images. For consistency in this paper, we have standardized all CT volumes to a uniform voxel spacing of 1 × 1 × 3mm, resulting in only 25,692 paired CT volumes and reports. We follow the official division: 20,000 patients (24,128 volumes) were allocated to training and 1,304 (1,564 volumes) for validation.

Constructing RadGenome-Chest CT

The pipeline consists of three major stages, as shown in Fig. 1: (i) segmentation mask generation, where detailed masks for each anatomical region in the chest CT volumes are created; (ii) region-specific report division, that involves the annotation and categorization of radiology text reports by the anatomical regions they refer; (iii) rule-based question generation, which entails extracting entities from the sentence, and formulating visual question answering (VQA) pairs linked to specific segmentation masks.

Segmentation Mask Generation

To segment as many anatomical regions as possible, we employ the recent SAT¹⁷ model. It is a knowledge-enhanced segmentation model, that employs natural language as prompts to effectively segment 3D medical volumes. The model has been trained on 72 diverse segmentation datasets, covering 498 classes across various anatomical regions including the brain, head and neck, thorax, spine, abdomen, and limbs. For our research, SAT is adopted to execute detailed segmentation across all volumes of the CT-RATE dataset. Specifically, we focus on segmenting 197 regions pertinent to chest CT scans, enabling precise anatomical analysis. The list of segmented anatomies is organized into a hierarchical tree as shown in Supplementary File. This includes several major regions such as the lungs, trachea and bronchi, mediastinum, heart, pleura, bones, thyroid, breasts, abdomen, and other areas.

Region-wise Reports Generation

The goal here is to break the entire reports into region-wise descriptions, we observe that the OpenAI GPT-4² model can complete such task in very high accuracy^20,21. However, employing GPT-4 on the entire set would be prohibitively expensive, thus, we train a model for such report division. Specifically, we first employ GPT-4² to annotate the anatomical regions of each sentence in the “FINDINGS” section of 2,500 radiology reports, which comprise all reports in the validation set of the original CT-RATE dataset. The prompt used is as follows.

This process results in 15,926 annotated sentences. Note that our approach fully supports multi-region annotations, where a single sentence can be linked to multiple anatomical regions if it references them. Subsequently, we divide these sentences into training and validation subsets in an 8:2 ratio and train a GPT-2 model using the annotated sentence along with the two preceding and following sentences from the report as input, if available. The model is designed to output the list of anatomical regions associated with each sentence. For instance, for the target sentence “No pleural effusion was detected on the left.”, the input is “There is minimal pleural effusion on the right. No pleural effusion was detected on the left. Atelectasis is observed in the middle lobe and lower lobe of the right lung. A malignant mass is observed around the lower lobe bronchi of the left lung.”, and the expected output is “left lung”. The model achieves an accuracy of 94.56% on the validation set. Consequently, we employ this model to perform inference on all sentences across the entire dataset of reports. More examples of segmentation results and structured reports can be seen in the Supplementary File.

Named Entity Recognition

Through the abovementioned process, each “FINDINGS” is divided into multiple sentences, and each is associated with one or multiple segmented regions. To facilitate the generation of detailed question-answer pairs from these sentences, we initially employ an RATEScore [?]’s Named-Entity Recognition (NER) model to analyze all sentences. This process involves extracting entities that can be categorized into “anatomy”, “abnormality”, and “non-abnormality”. Here, “anatomy” pertains to the anatomical regions, “abnormality” refers to findings or diseases identified as present, and “non-abnormality” indicates findings or diseases that are reported as absent. Subsequently, all extracted “abnormality” and “non-abnormality” entities undergo quality evaluation using GPT-4, allowing to filter out and revise any inaccuracies. For instance, in cases where the NER model extracts abnormalities such as “structural distortion and volume loss”, GPT-4 will segment it into “structural distortion” and “volume loss” for more accurate categorization. We filter out abnormalities with a GPT-4 output of “no” and update them to the revised versions provided by GPT-4. The used prompt is as follows:

In addition, for the “IMPRESSION” section, we directly utilize GPT-4 to extract all disorders mentioned, and detailed information regarding the presence of any abnormalities in specific anatomical regions. The prompt used is as follows.

Post-processing Pipeline

To ensure the quality and consistency of our dataset, we implemented a comprehensive post-processing pipeline using the state-of-the-art LLM “o3-mini”. This verification system was designed to verify the anatomical hierarchy consistency and appropriate abnormality extraction. We validated that all identified regions correctly adhered to our hierarchical anatomical reference structure for anatomical structures. When discrepancies were detected, the LLM provided corrections while maintaining anatomical accuracy. For example, in the case where a sentence “It may be compatible with parenchymal calcification” was previously assigned to the generic category “abdominal tissue”, our validation system analyzed the context of the preceding text ("A nonspecific density increase is observed in the subcapsular area at the dome level”) to correctly assign it to the “liver”, demonstrating the system’s ability to leverage contextual information for improved anatomical precision. For the identification of abnormalities, we implemented a targeted validation process that examined each finding sentence and its associated abnormalities. The system differentiated between existing abnormalities and normal findings, checking that the extracted entities properly reflected the radiological description in the original report. A common improvement pattern involved consolidating redundant or overlapping abnormality descriptions. For instance, in the finding “When examined in the lung parenchyma window; the mass appearance of the left lower lobe bronchi are stable.” our initial extraction had created multiple separate abnormality entries ("mass appearance” and “mass appearance of the left lower lobe bronchi”), which our validation system correctly merged into a single, more coherent abnormality description. The prompt is provided in the Supplementary File.

Rule-based Question Generation

Here, we describe the procedure for generating grounded visual question-answering (VQA) based on “FINDINGS” from the report section, while case-level visual question-answering data is generated from the “IMPRESSION” section. Table 1 presents the 5 question types in the proposed dataset. The detailed rules will be introduced in the following sections.

Table 1 Question and answer types in the proposed datasets.

Full size table

First, after the region-wise report generation, the sentences in the findings section can be classified as follows:

Normal Findings: Sentences that report no significant changes from normal health conditions.
- No abnormality entities in the sentence: Sentences that mention specific anatomical regions without noting any abnormalities. For example, “Thoracic aorta diameter is normal.”
- No anatomy entities in the sentence: Sentences that solely note the absence of specific abnormalities without referring to any anatomical regions. For example, “No pleural effusion was detected.”
- With anatomy and abnormality entities in the sentence: Sentences that explicitly state the absence of abnormalities in specific anatomical regions. For example, “Bilateral adrenal glands were normal and no space-occupying lesion was detected.”
Abnormal Findings: Sentences that report differences from normal anatomical conditions.
- No anatomy entities in the sentence: Sentences that report an abnormal finding, but do not specify an anatomical region. For example, “Mild hiatal hernia is observed.”
- With anatomy entities in the sentence: Sentences that include both an anatomical reference and describe an abnormality. For example, “There is narrowing of the spinal canal at the dorso-lumbar level.”

We then construct an anatomical disorder tree for each report, based on the anatomical hierarchical tree introduced in the Supplementary File. This involves marking any abnormalities on the tree for all nodes within the hierarchy if they are present. As shown in Fig. 3, this comprehensive data structure enables us to generate question-answer pairs. Taking inspiration from the previous research^22,23, we categorize the questions into four types: 1) abnormality, 2) presence, 3) location, 4) size. Table 1 shows examples of the different question types. Note that, we have designed 50 templates for each question type. The details of all templates are provided in the Supplementary File. For instance, when analyzing a sentence indicating normal findings, such as “Bilateral adrenal glands were normal and no space-occupying lesion was detected”, questions can include “Is there any evidence of abnormality in adrenal glands?” and “What abnormality is seen in the adrenal glands?”. Conversely, for a sentence in abnormal findings, “There is a narrowing of the spinal canal at the dorso-lumbar level.”, the question can be “Is there any evidence of narrowing in the spinal canal?” “What abnormality is seen at the dorso-lumbar level of the spinal canal?” and “Where in the spinal canal is the narrowing located?”. For impression sentences, since we have already extracted disorders, we can generate case-level questions such as “What disorder can be identified from this volume?” for each case.

In summary, we have successfully generated 1.2M grounded Visual Question Answering (VQA) pairs for training and 85k for validation, along with 24,128 case-level visual question answering pairs for training and 1,564 for validation. Detailed counts for each type of VQA pair in both the training and validation sets are provided in Table 1. We performed a thorough manual verification on all samples in the validation set. The verification was conducted systematically, with reviewers examining each sample for accuracy of extracted sentences, anatomical structures, abnormalities, and non-abnormalities. This comprehensive review process covered 9,008 unique finding sentences from the validation set, only 0.58% of sentences containing abnormality extraction errors and 0.46% containing anatomical labeling errors. Note that, these low error rates also reflect that in the training set, as we applied identical post-processing across all datasets. All identified errors were systematically corrected before finalizing the validation set, ensuring the highest data quality for benchmark evaluation.

Benchmark Construction

Utilizing the structured reports and question-answer pairs in RadGenome-Chest CT¹⁶, we have developed three benchmarking tasks to assess the performance of generalist medical AI models: Grounded Report Generation (RadGenome-Chest CT-GRG), Grounded Visual Question Answering (RadGenome-Chest CT-GVQA). The first two tasks utilize both an image and a mask of the region of interest as inputs.

RadGenome-Chest CT-GRG: This task aims to generate a comprehensive report detailing the findings within a specified region of interest.
RadGenome-Chest CT-GVQA: This task involves answering specific questions regarding the presence, location, and size of abnormalities within the highlighted region, guided by the region mask.

We divided the training and validation sets according to the official guidelines provided by CT-RATE. Specifically, 20,000 patients (corresponding to volumes) were allocated for training, and 1,304 patients (1,564 volumes) were designated for validation. Consequently, 624,876 GRG pairs and 1,057,946 GVQA pairs were used for training and 40,342 GRG pairs and 66,875 GVQA pairs for validation.

Baseline Models

We implement two different baseline models, with and without region mask inputs in RadGenome-Chest CT¹⁶. These models are evaluated across three benchmark tasks: RadGenome-Chest CT-GRG, RadGenome-Chest CT-GVQA. Note that our experiments serve as a preliminary demonstration of the necessity and superiority of including region masks in the dataset. Due to computational limitations, we have selected a small subset of the dataset for training, consisting of 2000 volumes, 51,959 GRG pairs, and 87,435 GVQA pairs.

Baseline-Global: This baseline model is designed to respond solely to whole-image data. As shown in Fig. 4, the query image is input to the visual encoder to obtain the global image token, which is then concatenated with the question token before being input to the model. The visual encoder consists of a 3D ConvNet initialized from SAT¹⁷, while the language model is LLaMA 3 [], fine-tuned using LoRA²⁴ with a rank of 8.
Fig. 4
The architecture of the Baseline-Grounded model. The image is input to the visual encoder to obtain the global image token. The mask feature, processed by the mask encoder, queries the global image token to generate the grounded image token, which is then used as input to the multimodal model. Note that for the Baseline-Global model, only the global image token is used as input to the multimodal model.
Full size image
Baseline-Grounded: This baseline model supports region mask input, leveraging spatial information from the region of interest to enhance the accuracy of generated reports and responses. As shown in Fig. 4, the query image is input to the visual encoder, and the mask is input to the mask encoder. The encoded mask embedding is used as ‘queries’, while the encoded image embedding serves as ‘keys’ and ‘values’. These embeddings are concatenated with the question token and input to the model. The visual encoder adopts the 3D ConvNet initialized from SAT¹⁷, while the language model is LLaMA 3²⁵, fine-tuned using LoRA²⁴ with a rank of 8. Additionally, a lightweight mask encoder is employed.

Evaluation Metrics

We use different evaluation metrics for different tasks. The metrics are listed as follows.

BLEU. BLEU stands for “BiLingual Evaluation Understudy”. It measures the accuracy of the report generation result. We use nltk.translate.bleu_score.sentence_bleu with weights of (1, 0, 0, 0) to calculate the 1-gram BLEU score.
ROUGE. ROUGE stands for “Recall-Oriented Understudy for Gisting Evaluation” and refers to a set of evaluation measures that assess the overlap between a generated summary and a set of reference summaries. We use ROUGE₁ by default.
BERT-Sim. BERT-Sim stands for “BERT similarity score”, which measures the similarity between the generated report and the reference report using contextualized word embeddings provided by the BERT model. In this paper, we use bert_score.BERTScorer with model type as bert-base-uncased to calculate the BERT-Sim.
RadGraph F1. RadGraph F1 is an automatic metric that computes the overlap in clinical entities and relations between a model-generated report and a radiologist-generated report.
RadCliQ. RadCliQ predicts a radiologist-determined error score from a combination of automated metrics, including BLEU, BERTScore, CheXbert vector similarity, and RadGraph.
Hit Score: The Hit Score is a specific metric designed for evaluating abnormalities. It compares the model’s prediction with the ground truth. If the model’s prediction matches any abnormality specified in the ground truth, the score is set to 1; otherwise, it is 0. The overall score is then calculated as the average of these individual scores.
Accuracy: Accuracy is calculated for the presence task. If the model’s prediction matches the ground truth exactly (model prediction == ground truth), the score is 1; if not, the score is 0. The average of these scores is then calculated.
L1 (mm): For the size task, both the ground truth and model predictions are converted to millimeters (mm) to ensure uniformity. The L1 score is then calculated based on these values.

Data Records

The dataset is available at Huggingface¹⁶. The CT volumes after standardisation are also uploaded. Train set and validation are organized as the following JSON file for each case, we have

Volumename: string. The original volume name, each volume name matches with a specific CT scan.
ClinicalInformation_EN: string. The original clinical information.
Technique_EN: string. The original technique information including the spacing and view position.
Findings_EN: string. The original finding section.
Impressions_EN: string. The original impression section.
Impressions_DIS: list. The disorders extracted from the impression section.
hierarchical_findings: json. the hierarchically structured findings.
- lung
  - abnormalities: list. The abnormalities existed in the lungs.
  - non_abnormalities: list. The abnormalities do not exist in the lungs.
  - sentence: list. The finding sentences related to lung.
- pleura
  - ...
- bone
- ...

To support training on various tasks, we organize data into separate files for different datasets. Here is an overview: RadGenome-Chest CT-GRG dataset is provided in a CSV file with the following columns:

Volumename: string. Each volume name corresponds to a specific CT scan.
Anatomy: string. This details the specific anatomical region, such as lung, left lung or left lung lower lobe.
Sentence: string. This describes the specified anatomy.

RadGenome-Chest CT-GVQA is provided by several CSV files catered to different tasks, abnormality, presence, and size, as described below:

Abnormality: For a given image and specified anatomy, this task involves identifying any abnormalities present. The model is required to predict the specific abnormality located within the anatomy.

Volumename: string. Each volume name corresponds to a specific CT scan.
Anatomy: string. This specifies the anatomical region, similar to above.
Abnormality: string. This lists the abnormalities present in the specified anatomy.
Non-Abnormality: string. This lists the abnormalities absent in the specified anatomy.

Presence: For a given image and specified anatomy, this task involves detecting the presence of a specific abnormality.

The model must accurately describe the abnormality, and confirm its presence with a ‘Yes’ or ‘No’.

Volumename: string. Each volume name corresponds to a specific CT scan.
Anatomy: string. This specifies the anatomical region.
Finding: string. The findings associated with the anatomy.
Presence: string. This indicates whether the finding is ‘Yes’ or ‘No’.

Size: For a given image and specified anatomy, this task involves predicting the dimensions of a specific abnormality in millimeters or centimeters.

Volumename: string. Each volume name corresponds to a specific CT scan.
Anatomy: string. This specifies the anatomical region.
Abnormality: string. This lists the abnormalities found in the specified anatomy.
Size: string. This provides the size of the abnormality, measured in mm or cm.

Technical Validation

Dataset Analysis

In this section, we analyze the distribution of the proposed dataset. The hierarchically structured reports and abnormality trees enable systematic extraction and analysis of anatomical structures and abnormal findings for each case. First, we present the distribution of hierarchical anatomy structures in the structured reports in Fig. 5. We then compare the normal-to-abnormal case ratios for both the training and validation sets, as illustrated in Fig. 6(a),(b). Additionally, we visualize the extracted abnormalities from all anatomical regions using word clouds, presented in Fig. 6(c),(d). These visualizations highlight the frequency and diversity of abnormal findings across different anatomical structures in our dataset.

Benchmark Results

The performance of two baseline models is shown in Tables 2 and 3. Specifically, Table 2 presents the results of these models across all test sets and various tasks. For the abnormality task, we provide results for all cases and specifically for abnormal cases. Since half of the cases have no findings, focusing on abnormal cases better reflects the model’s ability to identify abnormalities. The results show that the Baseline-Grounded model consistently outperforms the other model across all tasks. Specifically, in the abnormality identification task, the Baseline-Grounded model demonstrates superior accuracy in predicting abnormalities in specific regions, both for all cases and for abnormal cases. Table 3 highlights the performance of the two baseline models on test sets of node anatomy, excluding large regions, such as the lungs, mediastinum, etc, as query regions. As shown, the superiority of the Baseline-Grounded model is more pronounced. The BLEU score for GRG tasks improved from 34.76 to 38.43 (a 10.6% relative improvement), demonstrating that localized region information enables more precise and relevant descriptions. Similarly, the presence detection accuracy increased from 95.67% to 98.05%, showing that region masks help the model better determine whether specific structures are present in the designated area. These results conclusively demonstrate that region-specific visual grounding provides essential context for comprehensive radiological analysis, particularly for tasks requiring detailed anatomical understanding and abnormality detection in specific regions. The detailed region masks provide the model with stronger and more fine-grained clues, enhancing its predictive accuracy. We also provide a detailed case study of Baseline-Grounded and Baseline-Global in Fig. 7. As shown, Baseline-Grounded tends to output more detailed and accurate description of abnormalities.

Table 2 Comparison of baseline models on all test sets.

Full size table

Table 3 Comparison of baseline models on test subsets that focuses on region belonging to node anatomy.

Full size table

Code availability

The code for utilizing the dataset and reproducing the baseline models presented in this paper is available at https://github.com/xiaoman-zhang/RadGenome-ChestCT.

Data availability

The RadGenome-ChestCT dataset is available at https://huggingface.co/datasets/RadGenome/RadGenome-ChestCT.

References

Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nature Medicine 30, 863–874 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Team, G. et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35, 23716–23736 (2022).
Google Scholar
Wu, C., Zhang, X., Zhang, Y., Wang, Y. & Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications, 16(1) (2025).
Tu, T. et al. Towards generalist biomedical ai. NEJM AI 1, AIoa2300138 (2024).
Article Google Scholar
Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024).
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), 353–367 (PMLR, 2023).
Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6, 317 (2019).
Article PubMed PubMed Central Google Scholar
Hamamci, I. E. et al. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities. CoRR (2024).
Wu, J. et al. Chest imagenome dataset. Physio Net (2021).
Castro, D. C. et al. Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. NEJM AI, 2(7) (2025).
Boecking, B. et al. Ms-cxr: Making the most of text semantics to improve biomedical vision-language processing. https://doi.org/10.13026/b90j-vb87 (2022).
Zhang, X. et al. RadGenome-Chest CT. https://doi.org/10.57967/hf/5331 Hugging Face (2025).
Zhao, Z. et al. Large-vocabulary segmentation for medical images with text prompts. npj Digital Medicine 8,1 (2025).
Hamamci, I. E., Er, S. & Menze, B. Ct2rep: Automated radiology report generation for 3d medical imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 476–486 (2024).
Hamamci, I. E. et al. Generatect: Text-conditional generation of 3d chest ct volumes. In European Conference on Computer Vision, 126–143 (2024).
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023).
Liu, Q. et al. Exploring the boundaries of gpt-4 in radiology. arXiv preprint arXiv:2310.14573 (2023).
Hu, X. et al. Expert knowledge-aware image difference graph representation learning for difference-aware medical visual question answering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4156–4165 (2023).
Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5, 1–10 (2018).
Article Google Scholar
Hu, E. J. et al. Lora: Low-rank adaptation of large language models. ICLR 1, 3 (2022).
Google Scholar
AI@Meta. Llama 3 model card. huggingface (2024).

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (No. 2022ZD0160702).

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Ya Zhang, Weidi Xie & Yanfeng Wang
Shanghai AI Laboratory, Shanghai, China
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Jiayu Lei, Weiwei Tian, Ya Zhang, Weidi Xie & Yanfeng Wang
University of Science and Technology of China, Anhui, China
Jiayu Lei
Fudan University, Shanghai, China
Weiwei Tian

Authors

Xiaoman Zhang
View author publications
Search author on:PubMed Google Scholar
Chaoyi Wu
View author publications
Search author on:PubMed Google Scholar
Ziheng Zhao
View author publications
Search author on:PubMed Google Scholar
Jiayu Lei
View author publications
Search author on:PubMed Google Scholar
Weiwei Tian
View author publications
Search author on:PubMed Google Scholar
Ya Zhang
View author publications
Search author on:PubMed Google Scholar
Weidi Xie
View author publications
Search author on:PubMed Google Scholar
Yanfeng Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

X.Z. conceived the idea, created the dataset, developed the baseline code, wrote the manuscript, and prepared the figures. C.W. contributed to the conception of the idea and manuscript writing. Z.Z. contributed to the idea, developed the code and segmentation model for building the dataset, and wrote the manuscript. J.L. contributed to the idea, developed the baseline models, and assisted in manuscript writing and figure preparation. W.T. conducted response and additional experiments in response to reviewers’ comments. Y.Z. and Y. W. supervised the project, provided support, and supplied computing resources. W.X. contributed to the idea, provided suggestions, wrote the manuscript, and supervised the project. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Weidi Xie or Yanfeng Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, X., Wu, C., Zhao, Z. et al. Development of a large-scale grounded vision language dataset for chest CT analysis. Sci Data 12, 1636 (2025). https://doi.org/10.1038/s41597-025-05922-9

Download citation

Received: 07 August 2024
Accepted: 02 September 2025
Published: 10 October 2025
DOI: https://doi.org/10.1038/s41597-025-05922-9

Subjects

Abstract

Similar content being viewed by others

A vision–language foundation model for the generation of realistic chest X-ray images

CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

A deep learning based dual encoder–decoder framework for anatomical structure segmentation in chest X-ray images

Background & Summary

Methods

Data Source

Constructing RadGenome-Chest CT

Segmentation Mask Generation

Region-wise Reports Generation

Named Entity Recognition

Post-processing Pipeline

Rule-based Question Generation

Benchmark Construction

Baseline Models

Evaluation Metrics

Data Records

Technical Validation

Dataset Analysis

Benchmark Results

Code availability

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links