Text-image alignment for ILD imaging: linking CXR evidence to CT quantification

Gao, Jiani; Ren, Yijiu; Yang, Fengjing; Hu, Xuefei; Sun, Changbo; Wang, Sihua; Chen, Chang

doi:10.1038/s41746-025-02292-9

Download PDF

Article
Open access
Published: 04 February 2026

Text-image alignment for ILD imaging: linking CXR evidence to CT quantification

Jiani Gao¹^na1,
Yijiu Ren¹^na1,
Fengjing Yang²^na1,
Xuefei Hu¹,
Changbo Sun¹,
Sihua Wang² &
…
Chang Chen^1,3

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Assessment of Interstitial Lung Disease (ILD) relies on chest radiographs (CXR) for screening and computed tomography (CT) for definitive quantification. However, current AI pipelines typically treat these modalities in isolation, leading to report hallucinations and cross-modal inconsistencies. To address this fragmentation, we propose a framework (ARCTIC-ILD) that aligns CXR-derived textual evidence with CT-level segmentation and quantification. The system first employs a calibrated CXR evidence extractor to map radiographs to ILD-specific terminology, producing structured findings. These findings condition a terminology-to-mask module that utilizes lightweight cross-attention adapters to generate lobe-aware CT masks and burden estimates. Crucially, an explicit vision-language audit enforces consistency between the generated text and quantitative data. Evaluations on paired CXR-CT cohorts demonstrate that the framework significantly reduces text hallucination and improves phrase-to-mask alignment without incurring additional inference latency. By coupling reporting with quantification under an auditable protocol, this approach aligns with clinical workflows, serving as a robust assistant for triage, structured reporting, and longitudinal follow-up.

Data availability

All datasets used in this study are publicly accessible from the following official sources: MIMIC-CXR: https://physionet.org/content/mimic-cxr/2.0.0/. MIMIC-CXR-JPG (processed JPG version with standard splits): https://physionet.org/content/mimic-cxr-jpg/2.0.0/. HUG-ILD (HRCT with 3D annotations for interstitial lung disease): https://www.uhbs.ch/en/research/research-infrastructures/hug-ild-database. ReXGroundingCT (3D chest CT with text-finding to voxel mask alignments): https://arxiv.org/abs/2507.22030.

Code availability

All experiments were implemented in Python 3.10 using PyTorch (v2.3) with CUDA 12.1 and cuDNN 9, and were executed on four NVIDIA A100 GPUs with 80 GB memory each under a Linux environment. Medical image input, output operations, and sliding window inference follow MONAI (v1.5.1), and evaluation metrics are computed with TorchMetrics using synchronized reduction on a single device. Mixed precision training relies on the torch.amp autocast and GradScaler utilities, and all optimization, augmentation, and calibration settings are exactly as specified in the Training details section to ensure reproducibility. The full training and inference code, together with configuration files and the random seeds used for all reported runs, will be publicly released on GitHub after formal publication of the paper.

References

Raghu, G. et al. Diagnosis of idiopathic pulmonary fibrosis. An official ATS/ERS/JRS/ALAT Clinical Practice Guideline. Am. J. Respir. Crit. Care Med. 198, e44–e68 (2018).
Google Scholar
Chae, K. J. et al. Central role of CT in management of pulmonary fibrosis. Radiographics 44, e230165 (2024).
Google Scholar
Christensen, J. D. et al. ACR Appropriateness Criteria® Chronic Dyspnea-noncardiovascular origin: 2024 update. J. Am. Coll. Radiol. 22, S163–S176 (2025).
Google Scholar
Hansell, D. M. et al. Fleischner Society: glossary of terms for thoracic imaging. Radiology 246, 697–722 (2008).
Google Scholar
Jacob, J. et al. Mortality prediction in idiopathic pulmonary fibrosis: evaluation of computer-based CT analysis with conventional severity measures. Eur. Respir. J. 49, 1601011 (2017).
Google Scholar
Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 100802 (2023).
Google Scholar
Bannur, S. et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15016–15027 (IEEE, 2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In Proc. Advances in Neural Information Processing Systems 34892–34916 (Curran Associates, Inc., 2023).
Li, C. et al. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In Proc. Advances in Neural Information Processing Systems 1240 (Curran Associates Inc., 2023).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (AAAI, 2019).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Google Scholar
Zou, X. et al. Segment everything everywhere all at once. In Proc. Advances in Neural Information Processing Systems 868 (Curran Associates Inc., 2023).
Lüddecke, T. & Ecker, A. Image segmentation using text and image prompts. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 7086–7096 (IEEE, 2022).
Rao, Y. et al. DenseCLIP: language-guided dense prediction with context-aware prompting. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 18082–18091 (IEEE, 2022).
Ryu, J. S., Kang, H., Chu, Y. & Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations. Biomed. Eng. Lett. 15, 809–830 (2025).
Google Scholar
Wasserthal, J. et al. TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 5, e230024 (2023).
Google Scholar
Podolanczuk, A. J. et al. Approach to the evaluation and management of interstitial lung abnormalities: an official American Thoracic Society clinical statement. Am. J. Respir. Crit. Care Med. 211, 1132–1155 (2025).
Google Scholar
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. International Conference on Machine Learning 1321–1330 (PMLR, 2017).
Hu, E. J. et al. LoRA: Low-rank adaptation of large language models. In Proc. International Conference on Learning Representations (International Conference on Learning Representations, 2022).
Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Google Scholar
Johnson, A. E. et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. Preprint at arXiv https://doi.org/10.48550/arXiv.1901.07042 (2019).
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15, 29 (2015).
Google Scholar
Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun. 13, 4128 (2022).
Google Scholar
Baharoon, M. et al. ReXGroundingCT: a 3D chest CT dataset for segmentation of findings from free-text reports. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.22030 (2025).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. International Conference on Learning Representations (International Conference on Learning Representations, 2019).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. In Proc. Advances in Neural Information Processing Systems Vol. 33, 6840–6851 (2020).
Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. In Proc. International Conference on Learning Representations (OpenReview.net, 2021).
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1439–1449 (Association for Computational Linguistics, 2020).
Post, M. A call for clarity in reporting BLEU scores. In Proc. Third Conference on Machine Translation: Research Papers 186–191 (Association for Computational Linguistics, 2018).
Banerjee, S. & Lavie, A. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization 65–72 (Association for Computational Linguistics, 2005).
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. in Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3156–3164 (IEEE, 2015).
Xu, K. et al. Show, attend and tell: neural image caption generation with visual attention. In Proc. International Conference on Machine Learning 2048–2057 (PMLR, 2015).
Lu, J., Xiong, C., Parikh, D. & Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 375–383 (IEEE, 2017).
Anderson, P. et al. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6077–6086 (IEEE, 2018).
Chen, Z., Shen, Y., Song, Y. & Wan, X. Cross-modal memory networks for radiology report generation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 5904–5914 (Association for Computational Linguistics, 2021).
Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 13753–13762 (IEEE, 2021).
You, D. et al. AlignTransformer: hierarchical alignment of visual regions and disease tags for medical report generation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 72–82 (Springer, 2021).
Yang, S., Wu, X., Ge, S., Zhou, S. K. & Xiao, L. Knowledge matters: chest radiology report generation with general and specific knowledge. Med. Image Anal. 80, 102510 (2022).
Google Scholar
Wang, Z., Liu, L., Wang, L. & Zhou, L. METransformer: radiology report generation by transformer with multiple learnable expert tokens. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11558–11567 (IEEE, 2023).
Huang, Z., Zhang, X. & Zhang, S. KiUT: knowledge-injected u-transformer for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 19809–19818 (IEEE, 2023).
Nicolson, A., Dowling, J. & Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 144, 102633 (2023).
Google Scholar
Bu, S., Song, Y., Li, T. & Dai, Z. Dynamic knowledge prompt for chest X-ray report generation. In Proc. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 5425–5436 (ELRA and ICCL, 2024).
Smit, A. et al. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1500–1519 (Association for Computational Linguistics, 2020).
Sharma, H. et al. MAIRA-Seg: enhancing radiology report generation with segmentation-aware multimodal large language models. In Proc. 4th Machine Learning for Health Symposium 941–960 (PMLR, 2025).
Srivastav, S. et al. MAIRA at RRG24: a specialised large multimodal model for radiology report generation. In Proc. 23rd Workshop on Biomedical Natural Language Processing 597–602 (Association for Computational Linguistics, 2024).
Tanno, R. et al. Collaboration between clinicians and vision—language models in radiology report generation. Nat. Med. 31, 599–608 (2025).
Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. UNet++: a nested U-Net architecture for medical image segmentation. In Proc. International Workshop on Deep Learning in Medical Image Analysis 3–11 (Springer, 2018).
Isensee, F., Jaeger, P. F. & Kohl, S. A. A. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021).
Google Scholar
Hatamizadeh, A. et al. UNETR: transformers for 3D medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 1748–1758 (IEEE, 2022).
Zhang D. et al. Improving Medical X-ray Report Generation by Using Knowledge Graph. Appl. Sci. 12, 11111 (2022).
Sloan, P., Clatworthy, P., Simpson, E. & Mirmehdi, M. Automated radiology report generation: a review of recent advances. Rev. Biomed. Eng. 18, 368–387 (2025).
Google Scholar

Download references

Acknowledgements

This study was supported by the Noncommunicable Chronic Diseases-National Science and Technology Major Project (2024ZD0529006).

Author information

These authors contributed equally: Jiani Gao, Yijiu Ren, Fengjing Yang.

Authors and Affiliations

Department of Thoracic Surgery, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai, PR China
Jiani Gao, Yijiu Ren, Xuefei Hu, Changbo Sun & Chang Chen
Department of Thoracic Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, PR China
Fengjing Yang & Sihua Wang
Clinical Center for Thoracic Surgery Research, Tongji University, Shanghai, PR China
Chang Chen

Authors

Jiani Gao
View author publications
Search author on:PubMed Google Scholar
Yijiu Ren
View author publications
Search author on:PubMed Google Scholar
Fengjing Yang
View author publications
Search author on:PubMed Google Scholar
Xuefei Hu
View author publications
Search author on:PubMed Google Scholar
Changbo Sun
View author publications
Search author on:PubMed Google Scholar
Sihua Wang
View author publications
Search author on:PubMed Google Scholar
Chang Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

J.G., Y.R., and F.Y. contributed equally to this work, having full access to all study data and assuming responsibility for the integrity and accuracy of the analyses (validation, formal analysis). J.G. conceptualized the study, designed the methodology, and participated in securing research funding (conceptualization, methodology, funding acquisition). Y.R. carried out data acquisition, curation, and investigation (investigation, data curation) and provided key resources, instruments, and technical support (resources, software). F.Y. drafted the initial manuscript and generated visualizations (writing—original draft, visualization). C.S., S.W., X.H., and C.C. supervised the project, coordinated collaborations, and ensured administrative support (supervision, project administration). All authors contributed to reviewing and revising the manuscript critically for important intellectual content (writing—review and editing) and approved the final version for submission.

Corresponding authors

Correspondence to Xuefei Hu, Changbo Sun, Sihua Wang or Chang Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Consent for publication

Not applicable. This work exclusively utilizes de-identified datasets available from public repositories.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, J., Ren, Y., Yang, F. et al. Text-image alignment for ILD imaging: linking CXR evidence to CT quantification. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02292-9

Download citation

Received: 02 November 2025
Accepted: 16 December 2025
Published: 04 February 2026
DOI: https://doi.org/10.1038/s41746-025-02292-9