Abstract
Multi-label disease diagnosis in chest X-rays necessitates simultaneous consideration of both global organ structures and local lesion characteristics. However, current methodologies primarily utilize single-branch architectures and lack effective attention guidance mechanisms, which complicates the balance between global context and local details. Furthermore, multi-label datasets for chest X-rays often suffer from significant class imbalance. We propose CR-MSNet, a dual-branch multi-scale attention network designed for multi-label chest X-ray classification. The global branch is constructed using CoAtNet-2-rw to capture holistic semantic representations, while the local branch employs a residual convolutional neural network to extract detailed lesion features. We incorporate a cross-attention mechanism to facilitate adaptive interaction and information exchange between global and local representations. Additionally, we propose a Parallel Multi-Scale Channel-Spatial Attention (PMS-CSA) module to enhance both key semantic channels and potential lesion regions, thereby increasing the discriminative power of feature representations. A two-stage training strategy with an adjusted loss function is implemented to effectively alleviate the detrimental effects of class imbalance on model performance. Experimental results indicate that CR-MSNet achieves a macro-average AUC of 0.847 on the ChestX-ray14 dataset, confirming its effectiveness and potential for application in multi-label classification tasks for chest X-rays. By seamlessly integrating a dual-branch architecture with multi-scale attention mechanisms, this study confirms the critical role of attention-guided feature interactions in reconciling global and local representations.
Data availability
The datasets analyzed during the current study are available at the following links: https://www.kaggle.com/datasets/nih-chest-xrays/data
References
Li, X., Xu, X., Liu, Y. & Zhao, X. CheX-DS: Improving chest X-ray image classification with ensemble learning based on DenseNet and swin transformer. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 5295–5301. https://doi.org/10.1109/BIBM62325.2024.10822262. (IEEE, 2024).
Rajpurkar, P. et al. CheXNet: radiologist-level pneumonia detection on chest X-rays with deep learning. https://arXiv.org/abs/1711.05225 (2017).
Chen, C., Mat Isa, N. A., Liu, X., Ding, J. & Lu, L. MSA-Net: Multi-scale attention-based DenseNet for multi-label chest X-ray image classification. Biomed. Signal Process. Control 113, 109069 (2026).
Chowdary, G. J. & Kanhangad, V. A dual-branch network for diagnosis of thorax diseases from chest X-rays. IEEE J. Biomed. Health Inform. 26, 6081–6092 (2022).
Dai, Z., Liu, H., Le, Q. V. & Tan, M. CoAtNet: Marrying convolution and attention for all data sizes. In Advances in Neural Information Processing Systems (eds Dai, Z. et al.) 3965–3977 (Curran Associates Inc., 2021).
Wang, X. et al. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. 2097–2106 (2017).
Strick, D., Garcia, C. & Huang, A. Reproducing and improving CheXNet: deep learning for chest X-ray disease classification. https://arXiv.org/abs/2505.06646 (2025).
Zhao, X. & Wang, X. Multi-label chest X-ray image classification based on long-range dependencies capture and label relationships learning. Biomed. Signal Process. Control 100, 107018 (2025).
Taslimi, S., Taslimi, S., Fathi, N., Salehi, M. & Rohban, M. H. SwinCheX: multi-label classification on chest X-ray images with transformers. https://arXiv.org/abs/2206.04246 (2022).
Öztürk, Ş, Turalı, M. Y. & Çukur, T. HydraViT: Adaptive multi-branch transformer for multi-label disease classification from chest X-ray images. Biomed. Signal Process. Control 100, 106959 (2025).
Ashraf, S. M. N., Mamun, M. A., Abdullah, H. M. & Alam, M. G. R. SynthEnsemble: a fusion of CNN, vision transformer, and hybrid models for multi-label chest X-ray classification. In 2023 26th International Conference on Computer and Information Technology (ICCIT). 1–6. https://doi.org/10.1109/ICCIT60459.2023.10441433. (2023).
Faisal, M. et al. CheXViT: CheXNet and vision transformer to multi-label chest X-ray image classification. In 2023 IEEE International Symposium on Medical Measurements and Applications (MeMeA). 1–6. https://doi.org/10.1109/MeMeA57477.2023.10171855. (IEEE, Jeju, Korea, Republic of, 2023).
Lee, Y.-W., Huang, S.-K. & Chang, R.-F. CheXGAT: A disease correlation-aware network for thorax disease diagnosis from chest X-ray images. Artif. Intell. Med. 132, 102382 (2022).
Ngo, B. H., Lam, B. T., Nguyen, T. H., Dinh, Q. V. & Choi, T. J. Dual dynamic consistency regularization for semi-supervised domain adaptation. IEEE Access 12, 36267–36279 (2024).
Ngo, B. H., Do-Tran, N.-T., Nguyen, T.-N., Jeon, H.-G. & Choi, T. J. Learning CNN on ViT: A hybrid model to explicitly class-specific boundaries for domain adaptation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 28545–28554. https://doi.org/10.1109/CVPR52733.2024.02697. (IEEE, 2024).
Bui, D. C., Le, T. V., Ngo, B. H. & Choi, T. J. CLEAR: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Pattern Recognit. 164, 111486 (2025).
Xiong, K., Tu, Y., Rao, X., Zou, X. & Du, Y. Multi-label disease detection in chest X-ray imaging using a fine-tuned ConvNeXtV2 with a customized classifier. Informatics 12, 80 (2025).
Yu, S. & Zhou, P. An optimized transformer model for efficient detection of thoracic diseases in chest X-rays with multi-scale feature fusion. PLoS ONE 20, e0323239 (2025).
Wang, Q., Wu, Z., Gao, J., Yu, H. & Cheng, Y. A multi-label chest X-ray image classification algorithm based on multi-scale and attribute-aware semantic graph. Expert Syst. Appl. 298, 129898 (2026).
Alqahtani, O., Ghouse, M., Sabahath, A., Hussain, O. B. & Begum, A. Multi-scale vision transformer with dynamic multi-loss function for medical image retrieval and classification. Comput. Mater. Contin. 83, 2221–2244 (2025).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/cvpr.2016.90. (IEEE, 2016).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141. (2024).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional block attention module. In Computer Vision – ECCV 2018 Vol. 11211 (eds Ferrari, V. et al.) 3–19 (Springer International Publishing, 2018).
Wang, Q. et al. ECA-net: Efficient channel attention for deep convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155. (IEEE, 2020).
Singh, S. Computer-aided diagnosis of thoracic diseases in chest X-rays using hybrid CNN-transformer architecture. https://arXiv.org/abs/2404.11843 (2024).
Pan, X., Ye, T., Xia, Z., Song, S. & Huang, G. Slide-transformer: Hierarchical vision transformer with local self-attention. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2082–2091 (2023).
Xu, Q. & Duan, W. DualAttNet: Synergistic fusion of image-level and fine-grained disease attention for multi-label lesion detection in chest X-rays. https://arXiv.org/abs/2306.13813 (2023).
Ucan, M., Kaya, B., Aygun, O., Kaya, M. & Alhajj, R. Comparison of EfficientNet CNN models for multi-label chest X-ray disease diagnosis. PeerJ Comput. Sci. 11, e2968 (2025).
Jiang, X., Zhu, Y., Liu, Y., Cai, G. & Fang, H. TransDD: A transformer-based dual-path decoder for improving the performance of thoracic diseases classification using chest X-ray. Biomed. Signal Process. Control 91, 105937 (2024).
Wan, J., Lai, Z., Liu, J., Zhou, J. & Gao, C. Robust face alignment by multi-order high-precision hourglass network. IEEE Trans. Image Process. 30, 121–133 (2020).
Wan, J., Lai, Z., Li, J., Zhou, J. & Gao, C. Robust facial landmark detection by multi-order multi-constraint deep networks. IEEE Trans. Neural Netw. Learn. Syst. 33, 2181–2194 (2021).
Wan, J. et al. Fine-grained image captioning by ranking diffusion transformer. IEEE Trans. Image Process. 34, 8332–8344 (2025).
Albahli, S., Rauf, H. T., Algosaibi, A. & Balas, V. E. AI-driven deep CNN approach for multi-label pathology classification using chest X-rays. PeerJ Comput. Sci. 7, e495 (2021).
Nguyen-Mau, T.-H., Huynh, T.-L., Le, T.-D., Nguyen, H.-D. & Tran, M.-T. Advanced augmentation and ensemble approaches for classifying long-tailed multi-label chest X-rays. In 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2721–2730. https://doi.org/10.1109/ICCVW60793.2023.00288. (IEEE, 2023).
Cai, D. et al. Label semantic improvement with graph convolutional networks for multi-label chest X-ray image classification. In 2023 13th International Conference on Information Technology in Medicine and Education (ITME). 711–717. https://doi.org/10.1109/ITME60234.2023.00147. (IEEE, 2023).
Hanif, M. S., Bilal, M., Alsaggaf, A. H. & Al-Saggaf, U. M. Enhancing multi-label chest X-ray classification using an improved ranking loss. Bioengineering 12, 593 (2025).
Seo, H. et al. Enhancing multi-label long-tailed classification on chest X-rays through ML-GCN augmentation. In 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2739–2748. https://doi.org/10.1109/iccvw60793.2023.00290. (IEEE, 2023).
Xiao, J., Bai, Y., Yuille, A. & Zhou, Z. Delving into masked autoencoders for multi-label thorax disease classification. https://arXiv.org/abs/2210.12843 (2022).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. https://doi.org/10.1109/cvpr.2017.243. (IEEE, 2017).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., & Others. An image is worth 16x16 words: transformers for image recognition at scale. https://arXiv.org/abs/2010.11929 (2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992–10002. https://doi.org/10.1109/iccv48922.2021.00986. (IEEE, 2021).
Tu, Z. et al. MaxViT: Multi-axis vision transformer. https://arXiv.org/abs/2204.01697 (2022).
DiCiccio, T. J. & Efron, B. Bootstrap confidence intervals. Stat. Sci. 11, 189–228 (1996).
Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision. 618–626 (2017).
Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence. 33, 590-597 (2019).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning. 1321–1330 (PMLR, 2017).
Acknowledgements
The authors gratefully acknowledge all individualswho contributed directly or indirectly to this work.
Author information
Authors and Affiliations
Contributions
**Yu Wang**: Conceptualization, Methodology, Writing—original draft. **Caiyin Bao**: Data curation, Visualization. **Zichen Wang**: Validation. **Yupeng Shi**: Investigation, Formal analysis. **Jianlan Yang**: Supervision, Writing—review &editing.All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, Y., Bao, C., Wang, Z. et al. CR-MSNet: a dual-branch multi-scale attention network for multi-label chest X-ray classification. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44591-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-44591-5