Abstract
Chest radiography is the most widely used imaging technique for the diagnosing of lung diseases, but interpreting X-rays can be very difficult due to the subtle abnormalities and variations in image quality taken during X-rays. While deep learning models, especially the Convolution Neural Networks (CNNs), have shown strong performance when it comes to detecting pneumonia and other conditions, Vision Transformers (ViTs) have recently surpassed CNNs on several chest X-ray benchmarks by dividing images into small patches and learning global relationships. However, standard ViTs can sometimes focus on irrelevant regions, making their decisions less interpretable. To address this, we propose an enhanced ViT model tailored for chest X-ray analysis that prioritizes both accuracy and explainability. Our model introduces class-attention pooling technique, where each disease-specific class token learns to highlight relevant regions of the image, improving disease-wise focus. Token sparsity and random token dropping further help the model attend to only the most informative patches, enhancing robustness against noise. A convolutional stem is added before patch creation to extract fine local features like edges and textures, ensuring early capture of lung-specific patterns. Additionally, each X-ray undergoes preprocessing using Contrast Limited Adaptive Histogram Equalization (CLAHE), which enhances local contrast and makes subtle lesions more visible. The model is trained with mixed-precision computation, a warm-up cosine learning rate schedule, and the AdamW optimizer, allowing stable and efficient training on large datasets. It is then evaluated on Tuberculosis Chest X-Rays and Pulmonary Chest X-Rays datasets which are publicly available, the proposed framework achieved 99.19% Training accuracy, a Validation Accuracy of 97.78%, an F1-score of 0.94, and an AUC of 0.99, outperforming the baseline ViT. It is pointed out that the above scores are obtained based on the image-level split, owing to the limitations of the dataset, and the performance may be over-estimated compared to the validation on the patient level. The Grad-CAM heatmaps further confirm the fact that the model focuses on clinically relevant areas such as opacities or nodules, reinforcing interpretability and trust. Overall, this improved ViT framework offers both high diagnostic accuracy and also clear visual explanations, implying its possible usage in acting as an AI assistant for radiologists in efficiently detecting lung diseases.
Data availability
**Dataset 1: Tuberculosis (TB) Chest X-ray Database: ****https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset****Dataset 2: Pulmonary Chest X-Ray Abnormalities****https://www.kaggle.com/datasets/kmader/pulmonary-chest-xray-abnormalities**.
Abbreviations
- ViT:
-
Vision transformer
- CLAHE:
-
Contrast limited adaptive histogram equalization
- EMA:
-
Exponential moving average
- Grad-CAM:
-
Gradient—weighted class activation mapping
- CE:
-
Cross entropy
- LR:
-
Learning rate
- ML:
-
Machine learning
- AI:
-
Artificial intelligence
- AUC:
-
Area under the ROC curve
- F1:
-
F1 score
References
Rajaraman, S., Candemir, T., Kim, I. & Antani, S. Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs. Appl. Sci. 8 (10), 1715. https://doi.org/10.3390/app8101715 (2018).
Rahman, T. et al. Transfer learning with deep convolutional neural network (CNN) for pneumonia detection using chest X-ray. Appl. Sci. 10 (9), 3233. https://doi.org/10.3390/app10093233 (2020).
Huy, V. T. Q. & Lin, C. M. An improved densenet deep neural network model for tuberculosis detection using chest X-ray images. IEEE Access. 11, 42839–42849. https://doi.org/10.1109/ACCESS.2023.3270774 (2023).
Xu, T. & Yuan, Z. Convolution neural network with coordinate attention for the automatic detection of pulmonary tuberculosis images on chest X-rays. IEEE Access. 10, 86710–86717. https://doi.org/10.1109/ACCESS.2022.3199419 (2022).
Duong, L. T., Le, N. H., Tran, T. B., Ngo, V. M. & Nguyen, P. T. Detection of tuberculosis from chest X-ray images: Boosting the performance with Vision Transformer and transfer learning. Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2021.115519 (2021).
El-Ghany, S. A., Elmogy, M., Mahmood, M. A. & Abd El-Aziz, A. A. A robust tuberculosis diagnosis using chest X-rays based on a hybrid vision transformer and principal component analysis. Diagnostics 14 (23), 2736. https://doi.org/10.3390/diagnostics14232736 (2024).
Ennab, M. & Mcheick, H. Advancing AI interpretability in medical imaging: A comparative analysis of pixel-level interpretability and Gradient-weighted class activation mapping (Grad-CAM) models. Mach. Learn. Knowl. Extr. 7 (1), 12. https://doi.org/10.3390/make7010012 (2025).
Rajaraman, S., Zamzmi, G., Folio, L. R. & Antani, S. Detecting tuberculosis-consistent findings in lateral chest X-rays using an ensemble of convolutional neural networks and vision Transformers. Front. Genet. 13, 864724. https://doi.org/10.3389/fgene.2022.864724 (2022).
Kotei, E. & Thirunavukarasu, R. Tuberculosis detection from chest X-ray image modalities based on transformer and convolutional neural network. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3428446 (2024). Advance online publication.
Selvaraju, R. R. et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (pp. 618-626). https://doi.org/10.1109/ICCV.2017.74(2017).
Ignatius, J. L. P. et al. Histogram matched chest X-rays based tuberculosis detection using CNN. Comput. Syst. Sci. Eng. 44 (1), 81–97. https://doi.org/10.32604/csse.2023.025195 (2023).
Urooj, S., Suchitra, S., Krishnasamy, L., Sharma, N. & Pathak, N. Stochastic learning-based artificial neural network model for an automatic tuberculosis detection system using chest X-ray images. IEEE Access, 10, 103632-103643. https://doi.org/10.1109/ACCESS.2022.3208882(2022).
Rajakumar, M. P., Sonia, R., Uma Maheswari, B. & Karuppiah, S. P. Tuberculosis detection in chest X-ray using Mayfly-algorithm optimized dual-deep-learning features. J. X-Ray Sci. Technol. 29 (6), 961–974. https://doi.org/10.3233/XST-210976 (2021).
Sharma, V. et al. Deep learning models for tuberculosis detection and infected region visualization in chest X-ray images. Intell. Med. https://doi.org/10.1016/j.imed.2023.06.001 (2023).
Devasia, J., Goswami, H., Lakshminarayanan, S., Rajaram, M. & Adithan, S. Deep learning classification of active tuberculosis lung zone-wise manifestations using chest X-rays: A multi-label approach. Sci. Rep. 13, 887. https://doi.org/10.1038/s41598-023-28079-0 (2023).
Vanitha, K., Mahesh, T. R., Kumar, V., Guluwadi, S. & V., & Enhanced tuberculosis detection using vision Transformers and explainable AI with a Grad-CAM approach on chest X-rays. BMC Med. Imaging. https://doi.org/10.1186/s12880-025-01630-3 (2025)., 25, Article 96.
Dosovitskiy, A. et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=YicbFdNTTy(2021).
Vaswani, A. et al. Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017). https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Chu, X., Tian, Z., Zhang, B., Wang, X. & Shen, C. Conditional positional encodings for vision transformers. In Proceedings of the 10th International Conference on Learning Representations (ICLR 2023 poster session). OpenReview. https://openreview.net/forum?id=3KWnuT-R1bh(2023).
Touvron, H. et al. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 32–42). IEEE. https://doi.org/10.1109/ICCV48922.2021.00010(2021).
Singh, A., Sengupta, S. & Lakshminarayanan, V. Explainable deep learning models in medical image analysis. J. Imaging. 6 (6), 52. https://doi.org/10.3390/jimaging6060052 (2020).
Khan, A. et al. A survey of the vision Transformers and their CNN-Transformer based variants. Artif. Intell. Rev. 56 (Suppl 3), 2917–2970. https://doi.org/10.1007/s10462-023-10595-0 (2023).
Suara, S., Jha, A., Sinha, P. & Sekh, A. Is Grad-CAM explainable in medical images? arXiv. https://doi.org/10.1007/978-3-031-58181-6_11(2023).
Rahman, T. et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access. 8, 191586–191601. https://doi.org/10.1109/ACCESS.2020.3031384 (2020).
Jaeger, S. et al. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant. Imaging Med. Surg. 4 (6), 475–477. https://doi.org/10.3978/j.issn.2223-4292.2014.11.20 (2014).
Zuiderveld, K. Contrast limited adaptive histogram equalization. In (ed Heckbert, P. S.) Graphics Gems IV (474–485). Academic Press Professional, Inc. https://doi.org/10.1016/B978-0-12-336156-1.50061-6 (1994).
Ko, J., Park, S. & Woo, H. G. Optimization of vision transformer-based detection of lung diseases from chest X-ray images. BMC Med. Inf. Decis. Mak. 24, 191. https://doi.org/10.1186/s12911-024-02591-3 (2024).
Fu, X., Lin, R., Du, W., Tavares, A. & Liang, Y. Explainable hybrid transformer for multi-classification of lung disease using chest X-rays. Sci. Rep. 15, 6650. https://doi.org/10.1038/s41598-025-90607-x (2025).
Singh, S. et al. Efficient pneumonia detection using vision Transformers on chest X-rays. Sci. Rep. 14, 2487. https://doi.org/10.1038/s41598-024-52703-2 (2024).
Regmi, S., Subedi, A., Bagci, U. & Jha, D. Vision Transformer for efficient chest X-ray and gastrointestinal image classification [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2304.11529(2023).
Halder, A. et al. Implementing vision transformer for classifying 2D biomedical images. Sci. Rep. 14, 12567. https://doi.org/10.1038/s41598-024-63094-9 (2024).
Funding
Open access funding provided by Vellore Institute of Technology.
Author information
Authors and Affiliations
Contributions
Vaibhav Lokund, Keerthan Sundar & Anuj Khokhar,: Conceptualization, Methodology, Formal analysis, Software, Writing – review & editing.Bhawana Tyagi, Naga Priyadarsini R, & MohanKumar B : Supervision, Writing – review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lokunde, V., Sundar, K., Khokhar, A. et al. Class-attention pooling and token sparsity based vision transformers for chest X-ray interpretation. Sci Rep (2026). https://doi.org/10.1038/s41598-026-37109-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-37109-6