Abstract
Recent advances in cough sound analysis using deep learning techniques enable smartphone-based respiratory disease screening suitable for self-management care in a home setting, yet their utility is limited by device heterogeneity, population diversity, and challenges in multimodal integration. We propose a device-invariant, multimodal deep learning framework that jointly models cough acoustics, demographic data, and symptom descriptions for multi-label classification of adult respiratory diseases. To address the issues of device effect, an adversarial branch is embedded in the audio encoder to enforce device-invariant feature learning, while an invariant risk minimization-augmented loss enhances robustness to non-structural shifts. To evaluate the effectiveness of our proposed method, a real-world, multi-center dataset containing over 10,000 cases spanning seven major respiratory conditions was curated. On the tasks of individual respiratory disease identification for chronic obstructive pulmonary disease (COPD), lower respiratory tract infection (LRTI) and pulmonary shadows (PS), our method achieves superior performance with the area under the receiver operating characteristic curve (AUROC) of 0.9698, 0.8483 and 0.8720, respectively. It also shows promising results in identifying the presence of comorbidities for 7 respiratory diseases with an overall AUROC of 0.8907. More importantly, extensive experimental results demonstrate our method mitigates the issues of device effect and facilitates the cross-device generalization for cough-based respiratory disease diagnoses. This work demonstrates a scalable and transferable AI-based approach for cough-driven respiratory screening, emphasizing the importance of multimodal fusion and robust representation learning in advancing clinical applicability.
Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to the inclusion of sensitive clinical information collected under institutional and regulatory data-use agreements, as well as proprietary components that cannot be openly released, but are available from the corresponding author upon reasonable request.
Code availability
The code developed in this study is proprietary and has substantial commercial potential; accordingly, it cannot be made publicly available. Due to intellectual property protections and ongoing commercialization activities, the source code cannot be shared at this time. All model development, training, and analysis were conducted using Python 3.10 with PyTorch 2.1.0 (which can be accessed at https://pytorch.org/get-started/previous-versions/). Specific training configurations and parameters used to generate and analyze the datasets are detailed in the “Methods” section.
References
Wang, Z. et al. Global, regional, and national burden of chronic obstructive pulmonary disease and its attributable risk factors from 1990 to 2021: an analysis for the global burden of disease study 2021. Respir. Res. 26, 2 (2025).
Bhakta, N. R., McGowan, A., Ramsey, K. A. et al. European Respiratory Society/american thoracic society technical statement: standardisation of the measurement of lung volumes, 2023 update. Eur. Respir. J. 62, 2201519 (2023).
Thawanaphong, S. & Nair, P. Contemporary concise review 2024: chronic obstructive pulmonary disease. Respirology 30, 574–586 (2025).
Agusti, A. & Vogelmeier, C. F. Gold 2024: a brief overview of key changes. J. Bras. Pneumol. 49, e20230369 (2023).
Kim, S. H. & Han, M. K. Challenges and the future of pulmonary function testing in chronic obstructive pulmonary disease (copd): toward earlier diagnosis of copd. Tuberc. Respir. Dis. 88, 413–418 (2025).
Chu, Y. et al. Cycleguardian: a framework for automatic respiratory sound classification based on improved deep clustering and contrastive learning. Complex Intell. Syst. 11, 200 (2025).
Isangula, K. G. & Haule, R. J. Leveraging ai and machine learning to develop and evaluate a contextualized user-friendly cough audio classifier for detecting respiratory diseases: Protocol for a diagnostic study in rural Tanzania. JMIR Res. Protoc. 13, e54388 (2024).
Sharan, R. V. & Xiong, H. Wet and dry cough classification using cough sound characteristics and machine learning: a systematic review. Int. J. Med. Inform. 199, 105912 (2025).
Huddart, S. et al. A dataset of solicited cough sound for tuberculosis triage testing. Sci. Data 11, 1149 (2024).
Morocutti, T., Schmid, F., Koutini, K. & Widmer, G. Device-robust acoustic scene classification via impulse response augmentation. In Proc. 31st European Signal Processing Conference (EUSIPCO) 176-180 (IEEE, 2023).
Mezza, A. I., Habets, E. A., Müller, M. & Sarti, A. Unsupervised domain adaptation for acoustic scene classification using band-wise statistics matching. In Proc. 28th European Signal Processing Conference (EUSIPCO) 11−15 (IEEE, 2020).
Ma, C., Wang, H. & Hoi, S. C. H. Multi-label thoracic disease image classification with cross-attention networks. In Proc. Int. Conf. on Medical Image Computing and Computer-Assisted Intervention 730−738 (Cham: Springer International Publishing, 2019).
Lei, T., Hu, Q., Hou, Z. & Lu, J. Enhancing real-world far-field speech with supervised adversarial training. Appl. Acoust. 229, 110407 (2025).
Arjovsky, M., Bottou, L., Gulrajani, I. & Lopez-Paz, D. Invariant risk minimization. https://arxiv.org/abs/1907.02893 (2020).
Wang, J. et al. Joint asymmetric loss for learning with noisy labels. In Proc. IEEE/CVF International Conference on Computer Vision 1947−1956 (IEEE, 2025).
Gong, Y., Chung, Y.-A. & Glass, J. Ast: audio spectrogram transformer. In Proc. Interspeech 571-575 (2021).
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Santosh, K. C., Rasmussen, N., Mamun, M. & Aryal, S. A systematic review on cough sound analysis for covid-19 diagnosis and screening: is my cough sound COVID-19? PeerJ Comput. Sci. 8, e958 (2022).
Santamaria, M. et al. Longitudinal voice monitoring in a decentralized bring your own device trial for respiratory illness detection. npj Digit. Med. 8, 202 (2025).
Mersha, M., Lam, K., Wood, J., AlShami, A. K. & Kalita, J. Explainable artificial intelligence: a survey of needs, techniques, applications, and future direction. Neurocomputing 599, 128111 (2024).
Baiardi, A. & Naghi, A. A. The value added of machine learning to causal inference: evidence from revisited studies. Econom. J. 27, 213–234 (2024).
Poinsot, A. et al. Position: causal machine learning requires rigorous synthetic experiments for broader adoption. In Proc. 42nd International Conference on Machine Learning 81995-82015 (PMLR, 2025).
Guo, L.-Z., Jia, L.-H., Shao, J.-J. & Li, Y.-F. Robust semi-supervised learning in open environments. Front. Comput. Sci. 19, 198345 (2025).
Schmidt, R. M. Recurrent neural networks (rnns): a gentle introduction and overview https://arxiv.org/abs/1912.05911 (2019).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Liu, Y. et al. Roberta: a robustly optimized bert pretraining approach https://arxiv.org/abs/1907.11692 (2019).
Chen, F., Datta, G., Kundu, S. & Beerel, P. Self-attentive pooling for efficient deep learning. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 3974-3983 (IEEE, 2023).
Ganin, Y. et al. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 1–35 (2016).
Luo, J., Phan, H., Wang, L. & Reiss, J. D. Bimodal connection attention fusion for speech emotion recognition https://arxiv.org/abs/2503.05858 (2025).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems 30 (2023).
Aly, H., Al-Ali, A. K. & Suganthan, P. N. Boosted multilayer feedforward neural network with multiple output layers. Pattern Recognit. 156, 110740 (2024).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs) https://arxiv.org/abs/1606.08415 (2016).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Wang, Z., Xu, B., Yuan, Y., Shen, H. & Cheng, X. Infonce is a free lunch for semantically guided graph contrastive learning. In Proc. 48th International ACM SIGIR Conference on Research and Development in Information Retrieval 719–728 https://doi.org/10.1145/3726302.3730007 (ACM, 2025).
He, K. et al. Dalr: dual-level alignment learning for multimodal sentence representation learning. In Proc. Findings of the Association for Computational Linguistics: ACL 2025 3586-3601 (2025).
Kendall, A., Gal, Y. & Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 7482-7491 (IEEE, 2018).
Chen, W., Liang, Y., Ma, Z., Zheng, Z. & Chen, X. EAT: Self-supervised pre-training with efficient audio transformer. In Proc. Thirty-Third International Joint Conference on Artificial Intelligence 3807-3815 (2024).
Kim, J.-W., Toikkanen, M., Choi, Y., Moon, S.-E. & Jung, H.-Y. Bts: bridging text and sound modalities for metadata-aided respiratory sound classification. In Interspeech 2024 1690–1694 (GitHub, 2024).
Shao, N., Li, X. & Li, X. Fine-tune the pretrained atst model for sound event detection. In Proc. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 911–915 (IEEE, 2024).
Chen, S. et al. BEATs: audio pre-training with acoustic tokenizers. In Proc. 40th International Conference on Machine Learning 5178-5193 (2023).
Acknowledgements
This work was supported by the National Key R&D Program of China (2022YFC2010005).
Author information
Authors and Affiliations
Contributions
M.Y. conceived and designed the study. X.L., W.D., Y.L., W.Z., Z.B., and J.M. collected the data. M.Y., W.Z., and Z.B. analyzed the data. M.Y., Q.W. and Y.L. drafted the manuscript. Q.W., S.C., M.Z., and J.Q. revised the draft. Q.W. supervised the study.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, M., Liu, X., Du, W. et al. A device-invariant multi-modal learning framework for respiratory disease classification. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02445-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02445-4