Abstract
Assessment of Interstitial Lung Disease (ILD) relies on chest radiographs (CXR) for screening and computed tomography (CT) for definitive quantification. However, current AI pipelines typically treat these modalities in isolation, leading to report hallucinations and cross-modal inconsistencies. To address this fragmentation, we propose a framework (ARCTIC-ILD) that aligns CXR-derived textual evidence with CT-level segmentation and quantification. The system first employs a calibrated CXR evidence extractor to map radiographs to ILD-specific terminology, producing structured findings. These findings condition a terminology-to-mask module that utilizes lightweight cross-attention adapters to generate lobe-aware CT masks and burden estimates. Crucially, an explicit vision-language audit enforces consistency between the generated text and quantitative data. Evaluations on paired CXR-CT cohorts demonstrate that the framework significantly reduces text hallucination and improves phrase-to-mask alignment without incurring additional inference latency. By coupling reporting with quantification under an auditable protocol, this approach aligns with clinical workflows, serving as a robust assistant for triage, structured reporting, and longitudinal follow-up.
Data availability
All datasets used in this study are publicly accessible from the following official sources: MIMIC-CXR: https://physionet.org/content/mimic-cxr/2.0.0/. MIMIC-CXR-JPG (processed JPG version with standard splits): https://physionet.org/content/mimic-cxr-jpg/2.0.0/. HUG-ILD (HRCT with 3D annotations for interstitial lung disease): https://www.uhbs.ch/en/research/research-infrastructures/hug-ild-database. ReXGroundingCT (3D chest CT with text-finding to voxel mask alignments): https://arxiv.org/abs/2507.22030.
Code availability
All experiments were implemented in Python 3.10 using PyTorch (v2.3) with CUDA 12.1 and cuDNN 9, and were executed on four NVIDIA A100 GPUs with 80 GB memory each under a Linux environment. Medical image input, output operations, and sliding window inference follow MONAI (v1.5.1), and evaluation metrics are computed with TorchMetrics using synchronized reduction on a single device. Mixed precision training relies on the torch.amp autocast and GradScaler utilities, and all optimization, augmentation, and calibration settings are exactly as specified in the Training details section to ensure reproducibility. The full training and inference code, together with configuration files and the random seeds used for all reported runs, will be publicly released on GitHub after formal publication of the paper.
References
Raghu, G. et al. Diagnosis of idiopathic pulmonary fibrosis. An official ATS/ERS/JRS/ALAT Clinical Practice Guideline. Am. J. Respir. Crit. Care Med. 198, e44–e68 (2018).
Chae, K. J. et al. Central role of CT in management of pulmonary fibrosis. Radiographics 44, e230165 (2024).
Christensen, J. D. et al. ACR Appropriateness Criteria® Chronic Dyspnea-noncardiovascular origin: 2024 update. J. Am. Coll. Radiol. 22, S163–S176 (2025).
Hansell, D. M. et al. Fleischner Society: glossary of terms for thoracic imaging. Radiology 246, 697–722 (2008).
Jacob, J. et al. Mortality prediction in idiopathic pulmonary fibrosis: evaluation of computer-based CT analysis with conventional severity measures. Eur. Respir. J. 49, 1601011 (2017).
Yu, F. et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns 4, 100802 (2023).
Bannur, S. et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15016–15027 (IEEE, 2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In Proc. Advances in Neural Information Processing Systems 34892–34916 (Curran Associates, Inc., 2023).
Li, C. et al. LLaVA-med: training a large language-and-vision assistant for biomedicine in one day. In Proc. Advances in Neural Information Processing Systems 1240 (Curran Associates Inc., 2023).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 590–597 (AAAI, 2019).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Zou, X. et al. Segment everything everywhere all at once. In Proc. Advances in Neural Information Processing Systems 868 (Curran Associates Inc., 2023).
Lüddecke, T. & Ecker, A. Image segmentation using text and image prompts. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 7086–7096 (IEEE, 2022).
Rao, Y. et al. DenseCLIP: language-guided dense prediction with context-aware prompting. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 18082–18091 (IEEE, 2022).
Ryu, J. S., Kang, H., Chu, Y. & Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations. Biomed. Eng. Lett. 15, 809–830 (2025).
Wasserthal, J. et al. TotalSegmentator: robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 5, e230024 (2023).
Podolanczuk, A. J. et al. Approach to the evaluation and management of interstitial lung abnormalities: an official American Thoracic Society clinical statement. Am. J. Respir. Crit. Care Med. 211, 1132–1155 (2025).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. International Conference on Machine Learning 1321–1330 (PMLR, 2017).
Hu, E. J. et al. LoRA: Low-rank adaptation of large language models. In Proc. International Conference on Learning Representations (International Conference on Learning Representations, 2022).
Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 6, 317 (2019).
Johnson, A. E. et al. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. Preprint at arXiv https://doi.org/10.48550/arXiv.1901.07042 (2019).
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 15, 29 (2015).
Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun. 13, 4128 (2022).
Baharoon, M. et al. ReXGroundingCT: a 3D chest CT dataset for segmentation of findings from free-text reports. Preprint at arXiv https://doi.org/10.48550/arXiv.2507.22030 (2025).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proc. International Conference on Learning Representations (International Conference on Learning Representations, 2019).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. In Proc. Advances in Neural Information Processing Systems Vol. 33, 6840–6851 (2020).
Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. In Proc. International Conference on Learning Representations (OpenReview.net, 2021).
Chen, Z., Song, Y., Chang, T.-H. & Wan, X. Generating radiology reports via memory-driven transformer. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1439–1449 (Association for Computational Linguistics, 2020).
Post, M. A call for clarity in reporting BLEU scores. In Proc. Third Conference on Machine Translation: Research Papers 186–191 (Association for Computational Linguistics, 2018).
Banerjee, S. & Lavie, A. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization 65–72 (Association for Computational Linguistics, 2005).
Lin, C.-Y. ROUGE: a package for automatic evaluation of summaries. in Text Summarization Branches Out 74–81 (Association for Computational Linguistics, 2004).
Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3156–3164 (IEEE, 2015).
Xu, K. et al. Show, attend and tell: neural image caption generation with visual attention. In Proc. International Conference on Machine Learning 2048–2057 (PMLR, 2015).
Lu, J., Xiong, C., Parikh, D. & Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 375–383 (IEEE, 2017).
Anderson, P. et al. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6077–6086 (IEEE, 2018).
Chen, Z., Shen, Y., Song, Y. & Wan, X. Cross-modal memory networks for radiology report generation. In Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 5904–5914 (Association for Computational Linguistics, 2021).
Liu, F., Wu, X., Ge, S., Fan, W. & Zou, Y. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 13753–13762 (IEEE, 2021).
You, D. et al. AlignTransformer: hierarchical alignment of visual regions and disease tags for medical report generation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 72–82 (Springer, 2021).
Yang, S., Wu, X., Ge, S., Zhou, S. K. & Xiao, L. Knowledge matters: chest radiology report generation with general and specific knowledge. Med. Image Anal. 80, 102510 (2022).
Wang, Z., Liu, L., Wang, L. & Zhou, L. METransformer: radiology report generation by transformer with multiple learnable expert tokens. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11558–11567 (IEEE, 2023).
Huang, Z., Zhang, X. & Zhang, S. KiUT: knowledge-injected u-transformer for radiology report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 19809–19818 (IEEE, 2023).
Nicolson, A., Dowling, J. & Koopman, B. Improving chest X-ray report generation by leveraging warm starting. Artif. Intell. Med. 144, 102633 (2023).
Bu, S., Song, Y., Li, T. & Dai, Z. Dynamic knowledge prompt for chest X-ray report generation. In Proc. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 5425–5436 (ELRA and ICCL, 2024).
Smit, A. et al. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1500–1519 (Association for Computational Linguistics, 2020).
Sharma, H. et al. MAIRA-Seg: enhancing radiology report generation with segmentation-aware multimodal large language models. In Proc. 4th Machine Learning for Health Symposium 941–960 (PMLR, 2025).
Srivastav, S. et al. MAIRA at RRG24: a specialised large multimodal model for radiology report generation. In Proc. 23rd Workshop on Biomedical Natural Language Processing 597–602 (Association for Computational Linguistics, 2024).
Tanno, R. et al. Collaboration between clinicians and vision—language models in radiology report generation. Nat. Med. 31, 599–608 (2025).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N. & Liang, J. UNet++: a nested U-Net architecture for medical image segmentation. In Proc. International Workshop on Deep Learning in Medical Image Analysis 3–11 (Springer, 2018).
Isensee, F., Jaeger, P. F. & Kohl, S. A. A. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021).
Hatamizadeh, A. et al. UNETR: transformers for 3D medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 1748–1758 (IEEE, 2022).
Zhang D. et al. Improving Medical X-ray Report Generation by Using Knowledge Graph. Appl. Sci. 12, 11111 (2022).
Sloan, P., Clatworthy, P., Simpson, E. & Mirmehdi, M. Automated radiology report generation: a review of recent advances. Rev. Biomed. Eng. 18, 368–387 (2025).
Acknowledgements
This study was supported by the Noncommunicable Chronic Diseases-National Science and Technology Major Project (2024ZD0529006).
Author information
Authors and Affiliations
Contributions
J.G., Y.R., and F.Y. contributed equally to this work, having full access to all study data and assuming responsibility for the integrity and accuracy of the analyses (validation, formal analysis). J.G. conceptualized the study, designed the methodology, and participated in securing research funding (conceptualization, methodology, funding acquisition). Y.R. carried out data acquisition, curation, and investigation (investigation, data curation) and provided key resources, instruments, and technical support (resources, software). F.Y. drafted the initial manuscript and generated visualizations (writing—original draft, visualization). C.S., S.W., X.H., and C.C. supervised the project, coordinated collaborations, and ensured administrative support (supervision, project administration). All authors contributed to reviewing and revising the manuscript critically for important intellectual content (writing—review and editing) and approved the final version for submission.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Consent for publication
Not applicable. This work exclusively utilizes de-identified datasets available from public repositories.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gao, J., Ren, Y., Yang, F. et al. Text-image alignment for ILD imaging: linking CXR evidence to CT quantification. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02292-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-02292-9