Abstract
Depression and anxiety are among the most prevalent mental disorders, necessitating accurate characterization for effective diagnosis and treatment. Multimodal deep learning has emerged as an effective approach to enhance diagnostic precision by integrating diverse data sources, including electronic health records, physiological signals and neuroimaging. This Review provides an overview of the recent advancements in multimodal deep learning for depression and anxiety estimation. Key neural network architectures—such as convolutional neural networks for image analysis, recurrent and transformer models for sequential and textual data, and graph neural networks for capturing complex neuroimaging connectivity patterns—are examined. Challenges in data fusion, feature extraction and model interpretability are discussed, alongside strategies to improve generalizability through transfer learning. Future challenges and opportunities are discussed: large-scale datasets, standardized evaluation protocols and interdisciplinary collaboration to bridge the gap between multimodal deep learning and clinical relevance. By summarizing current practices and identifying critical challenges, this Review highlights the transformative potential of multimodal deep learning in advancing the characterization and detection of depression and anxiety.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 digital issues and online access to articles
$79.00 per year
only $6.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
References
World Health Organization. Mental Disorders (WHO, 2022).
World Health Organization. Anxiety Disorders (WHO, 2023).
Alonso, J. et al. Treatment gap for anxiety disorders is global: results of the World Mental Health Surveys in 21 countries. Depress. Anxiety 35, 195–208 (2018).
Kroenke, K., Spitzer, R. L. & Williams, J. B. The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med. 16, 606–613 (2001).
Spitzer, R. L., Kroenke, K., Williams, J. B. & Löwe, B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch. Intern. Med. 166, 1092–1097 (2006).
Sheehan, D. V. et al. The Mini-International Neuropsychiatric Interview (MINI): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. J. Clin. Psychiatry 59, 22–33 (1998).
First, M. B. Structured Clinical Interview for the DSM (SCID). In The Encyclopedia of Clinical Psychology (eds. Cautin, R. L. & Lilienfeld, S. O) https://onlinelibrary.wiley.com/doi/10.1002/9781118625392.wbecp351 (John Wiley & Sons, 2015).
Vaswani, A., Shazeer, N., Parmar, N. et al. Attention is all you need. In Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) (eds. Guyon, I. et al.) 30 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) Vol. 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019); https://doi.org/10.18653/v1/N19-1423
Guntuku, S. C., Yaden, D. B., Kern, M. L., Ungar, L. H. & Eichstaedt, J. C. Detecting depression and mental illness on social media: an integrative review. Curr. Opin. Behav. Sci. 18, 43–49 (2017).
Torous, J. et al. The growing field of digital psychiatry: current evidence and the future of apps, social media, chatbots, and virtual reality. World Psychiatry 20, 318–335 (2021).
Guntuku, S. C., Ramsay, J. R., Merchant, R. M. & Ungar, L. H. Language of ADHD in adults on social media. J. Atten. Disord. 23, 1475–1485 (2019).
Insel, T. R. Digital phenotyping: technology for a new science of behavior. JAMA 318, 1215–1216 (2017).
Tse, N. Y. et al. A mega-analysis of functional connectivity and network abnormalities in youth depression. Nat. Ment. Health 2, 1169–1182 (2024).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Mayberg, H. S. et al. Deep brain stimulation for treatment-resistant depression. Neuron 45, 651–660 (2005).
Basser, P. J., Mattiello, J. & LeBihan, D. MR diffusion tensor spectroscopy and imaging. Biophys. J. 66, 259–267 (1994).
Kessler, R. C. et al. Lifetime prevalence and age-of-onset distributions of DSM-IV disorders. Arch. Gen. Psychiatry 62, 593–602 (2005).
Insel, T. R. et al. Research Domain Criteria (RDoC): toward a new classification framework for research on mental disorders. Am. J. Psychiatry 167, 748–751 (2010).
Cuthbert, B. N. The RDoC framework: facilitating transition from ICD/DSM to dimensional approaches that integrate neuroscience and psychopathology. World Psychiatry 13, 28–35 (2014).
Cuthbert, B. N. & Insel, T. R. Toward the future of psychiatric diagnosis: the seven pillars of RDoC. BMC Med. 11, 126 (2013).
Casey, B., Oliveri, M. & Insel, T. A neurodevelopmental perspective on the Research Domain Criteria (RDoC) framework. Biol. Psychiatry 76, 350–353 (2014).
Morris, S. E. & Cuthbert, B. N. Research Domain Criteria: cognitive systems, neural circuits, and dimensions of behavior. Dialogues Clin. Neurosci. 14, 29–37 (2012).
Woo, C.-W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 20, 365–377 (2017).
Shen, L. & Thompson, P. M. Brain imaging genomics: integrated analysis and machine learning. Proc. IEEE 108, 125–162 (2019).
Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29, 24–54 (2010).
Al-Mosaiwi, M. & Johnstone, T. In an absolute state: elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation. Clin. Psychol. Sci. 6, 529–542 (2018).
Cummins, N. et al. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 71, 10–49 (2015).
Cohn, J. F. et al. Detecting depression from facial actions and vocal prosody. In 2009 3rd International Conf. Affective Computing and Intelligent Interaction and Workshops 1–7 (IEEE, 2009).
Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2019).
Girard, J. M., Cohn, J. F., Mahoor, M. H., Mavadati, S. & Rosenwald, D. P. Social risk and depression: evidence from manual and automatic facial expression analysis. In 2013 10th IEEE International Conf. and Workshops on Automatic Face and Gesture Recognition (FG) 1–8 (IEEE, 2013).
Liu, S. & Gui, R. Fusing multi-scale fmri features using a brain-inspired multi-channel graph neural network for major depressive disorder diagnosis. Biomed. Signal Process. Control. 90, 105837 (2024).
Wang, Q., Li, L., Qiao, L. & Liu, M. Adaptive multimodal neuroimage integration for major depression disorder detection. Front. Neuroinform. 16, 856175 (2022).
Pennebaker, J. W., Mehl, M. R. & Niederhoffer, K. G. Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol. 54, 547–577 (2003).
Tadesse, M. M., Lin, H., Xu, B. & Yang, L. Detection of depression-related posts in Reddit social media forum. IEEE Access 7, 44883–44893 (2019).
Teodorescu, D., Cheng, T., Fyshe, A. & Mohammad, S. Language and mental health: measures of emotion dynamics from text as linguistic biosocial markers. In Proc. 2023 Conf. Empirical Methods in Natural Language Processing (eds. Bouamor, H. et al.) 3117–3133 (Association for Computational Linguistics, 2023); https://doi.org/10.18653/v1/2023.emnlp-main.188
France, D. J., Shiavi, R. G., Silverman, S., Silverman, M. & Wilkes, M. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 47, 829–837 (2000).
Harlev, D., Singer, S., Goldshalger, M., Wolpe, N. & Bergmann, E. Acoustic speech features are associated with late-life depression and apathy symptoms: preliminary findings. Alzheimers Dement. 17, e70055 (2025).
Low, L.-S. A., Maddage, N. C., Lech, M., Sheeber, L. B. & Allen, N. B. Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 58, 574–586 (2010).
Little, B. et al. Deep learning-based automated speech detection as a marker of social functioning in late-life depression. Psychol. Med. 51, 1441–1450 (2021).
Hershey, S. et al. CNN architectures for large-scale audio classification. In 2017 IEEE International Conf. Acoustics, Speech and Signal Processing (ICASSP) 131–135 (IEEE, 2017).
Fraiwan, M., Fraiwan, L. & Alkhodari, M. et al. Recognition of pulmonary diseases from lung sounds using convolutional neural networks AND long short-term memory. J. Ambient Intell. Human Comput. 13, 4759–4771 (2022).
Katmah, R. et al. A review on mental stress assessment methods using EEG signals. Sensors 21, 5043 (2021).
Fitzgerald, P. J. & Watson, B. O. Gamma oscillations as a biomarker for major depression: an emerging topic. Transl. Psychiatry 8, 177 (2018).
Schiweck, C., Piette, D., Berckmans, D., Claes, S. & Vrieze, E. Heart rate and high frequency heart rate variability during stress as biomarker for clinical depression. a systematic review. Psychol. Med. 49, 200–211 (2019).
Chen, S., Yu, Y. & Pan, J. MadNet: EEG-based depression detection using a deep convolution neural network framework with multi-dimensional attention. In International Conf. Artificial Neural Networks 283–294 (Springer, 2023).
Yan, G., Liang, S., Zhang, Y. & Liu, F. Fusing transformer model with temporal features for ECG heartbeat classification. In 2019 IEEE International Conf. Bioinformatics and Biomedicine (BIBM) 898–905 (IEEE, 2019).
Zheng, G. et al. An attention-based multi-modal mri fusion model for major depressive disorder diagnosis. J. Neural Eng. 20, 066005 (2023).
Gao, J., Li, P., Chen, Z. & Zhang, J. A survey on deep learning for multimodal data fusion. Neural Comput. 32, 829–864 (2020).
Ramachandram, D. & Taylor, G. W. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34, 96–108 (2017).
Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P. & Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proc. Conf. Computational Linguistics (ACL) 6558–6569 (2019).
Cai, C., He, Y., Sun, L., Lian, Z., Liu, B., Tao, J., Xu, M. & Wang, K. Multimodal sentiment analysis based on recurrent neural network and multimodal attention. In Proc. 2nd Multimodal Sentiment Analysis Challenge 61–67 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3475957.3484454.
Tay, Y., Dehghani, M., Bahri, D. & Metzler, D. Efficient transformers: a survey. ACM Comput. Surv. 55, 1–28 (2022).
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T. & Xie, S. A ConvNet for the 2020s. In 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR) 11966–11976 (IEEE/CVF, 2022); https://doi.org/10.1109/CVPR52688.2022.01167
Su, Y. & Kuo, C. J. Recurrent neural networks and their memory behavior: a survey. APSIPA Trans. Signal Inf. Process. https://doi.org/10.1561/116.00000123 (2022).
Sharma, K. et al. A survey of graph neural networks for social recommender systems. ACM Comput. Surv. 56, 1–34 (2024).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K. & He, Y. Scaling vision-language models with sparse mixture of experts. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds. Bouamor, H. et al.) 11329–11344 (Association for Computational Linguistics, 2023); https://doi.org/10.18653/v1/2023.findings-emnlp.758
Sahili, Z. A., Patras, I. & Purver, M. Multimodal machine learning in mental health: a survey of data, algorithms, and challenges. Preprint at https://arxiv.org/abs/2407.16804 (2024).
DeVault, D. et al. Simsensei Kiosk: a virtual human interviewer for healthcare decision support. In Proc. 2014 International Conf. Autonomous Agents and Multi-agent Systems 1061–1068 (2014).
Ringeval, F. et al. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proc. 9th International on Audio/Visual Emotion Challenge and Workshop 3–12 (2019).
Yoon, J., Kang, C., Kim, S. & Han, J. D-vlog: multimodal vlog dataset for depression detection. Proc. AAAI Conf. Artif. Intell. 36, 12226–12234 (2022).
Zhu, F., Zhang, J., Dang, R., Hu, B. & Wang, Q. MTNet: multimodal transformer network for mild depression detection through fusion of EEG and eye tracking. Biomed. Signal Process. Control. 100, 106996 (2025).
Zhou, L. et al. TAMFN: time-aware attention multimodal fusion network for depression detection. IEEE Trans. Neural Syst. Rehabil. Eng. 31, 669–679 (2022).
Fang, M., Peng, S., Liang, Y., Hung, C.-C. & Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 82, 104561 (2023).
Ilias, L., Mouzakitis, S. & Askounis, D. Calibration of transformer-based models for identifying stress and depression in social media. IEEE Trans. Comput. Soc. Syst. 11, 1979–1990 (2023).
Sadeghi, M. et al. Harnessing multimodal approaches for depression detection using large language models and facial expressions. npj Ment. Health Res. 3, 66 (2024).
Victor, E., Aghajan, Z. M., Sewart, A. R. & Christian, R. Detecting depression using a framework combining deep multimodal neural networks with a purpose-built automated evaluation. Psychol. Assess. 31, 1019 (2019).
Li, Z. et al. MHA: a multimodal hierarchical attention model for depression detection in social media. Health Inf. Sci. Syst. 11, 6 (2023).
Cai, H., Yuan, Z. & Gao, Y. et al. A multi-modal open dataset for mental-disorder analysis. Sci. Data 9, 178 (2022).
Yan, C.-G. et al. Reduced default mode network functional connectivity in patients with recurrent major depressive disorder. Proc. Natl Acad. Sci. USA 116, 9078–9083 (2019).
Liu, S. et al. An objective quantitative diagnosis of depression using a local-to-global multimodal fusion graph neural network. Patterns 5, (2024).
He, M., Bakker, E. M. & Lew, M. S. DPD (depression detection) net: a deep neural network for multimodal depression detection. Health Inf. Sci. Syst. 12, 1–17 (2024).
Li, X. et al. BrainGNN: interpretable brain graph neural network for fmri analysis. Med. Image Anal. 74, 102233 (2021).
Thapaliya, B. et al. Brain networks and intelligence: a graph neural network based approach to resting state fMRI data. Med. Image Anal. 101, 103433 (2025).
Thapaliya, B., Akbas, E., Sapkota, R. et al. SELF-clustering graph transformer approach to model resting state functional brain activity. In 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI) https://doi.org/10.1109/ISBI60581.2025.10980889 (IEEE, 2025).
Cui, H. et al. BrainGB: a benchmark for brain network analysis with graph neural networks. IEEE Trans. Med. Imaging 42, 493–506 (2022).
Mo, H., Hui, S. C., Liao, X., Li, Y., Zhang, W. & Ding, S. A multimodal data-driven framework for anxiety screening. IEEE Trans. Instrum. Meas. 73, 4003113 (2024).
Shadid, M., Afnan, M. S. & Patwary, M. J. TI-Fusion: a multimodal anxiety disorder detection method. In 2023 6th International Conf. Electrical Information and Communication Technology (EICT) 1–6 (IEEE, 2023).
Lai, S. & Li, Z. Detection of potential anxiety in social media based on multimodal fusion with deep learning methods. In 2023 IEEE International Conf. Bioinformatics and Biomedicine (BIBM) 560–566 (IEEE, 2023).
Kamakshi, K. & Rengaraj, A. Early detection of stress and anxiety based seizures in position data augmented EEG signal using hybrid deep learning algorithms. IEEE Access 12, 35351–35365 (2024).
Bruin, W. B. et al. Brain-based classification of youth with anxiety disorders: transdiagnostic examinations within the ENIGMA-anxiety database using machine learning. Nat. Ment. Health 2, 104–118 (2024).
Aldayel, M. & Al-Nafjan, A. A comprehensive exploration of machine learning techniques for EEG-based anxiety detection. PeerJ Comput. Sci. 10, e1829 (2024).
Zhou, E., Wang, W., Ma, S., Xie, X., Kang, L., Xu, S., Deng, Z., Gong, Q., Nie, Z., Yao, L., Bu, L., Wang, F. & Liu, Z. Prediction of anxious depression using multimodal neuroimaging and machine learning. NeuroImage 285, 120499 (2024).
Diep, B., Stanojevic, M. & Novikova, J. Multi-modal deep learning system for depression and anxiety detection. In Empowering Communities: A Participatory Approach to AI for Mental Health (2022).
Tasnim, M., Ehghaghi, M., Diep, B. & Novikova, J. DEPAC: a corpus for depression and anxiety detection from speech. In Proc. Eighth Workshop on Computational Linguistics and Clinical Psychology (eds. Zirikly, A. et al.) 1–16 (Association for Computational Linguistics, 2022).
Xie, W. et al. Multimodal fusion diagnosis of depression and anxiety based on CNN-LSTM model. Comput. Med. Imaging Graph. 102, 102128 (2022).
Qin, J., Liu, C., Tang, T., Liu, D., Wang, M., Huang, Q. & Zhang, R. Mental-perceiver: audio-textual multi-modal learning for estimating mental disorders. Proc. AAAI Conf. Artif. Intell. 39, 25029–25037 (2025).
Ajith, M. et al. A deep learning approach for mental health quality prediction using functional network connectivity and assessment data. Brain Imaging Behav. 18, 630–645 (2024).
He, L., Chen, K., Zhao, J., Wang, Y., Pei, E., Chen, H., Jiang, J., Zhang, S., Zhang, J., Wang, Z., He, T. & Tiwari, P. LMVD: A large-scale multimodal vlog dataset for depression detection in the wild. Inf. Fusion 126, 103632 (2026).
Diagnostic and Statistical Manual of Mental Disorders, Third Edition (DSM-III) (American Psychiatric Association, 1980).
Sawadogo, M.A.L., Pala, F. & Singh, G. et al. PTSD in the wild: a video database for studying post-traumatic stress disorder recognition in unconstrained environments. Multimed. Tools Appl. 83, 42861–42883 (2024).
Çiftçi, E., Kaya, H., Güleç, H. & Salah, A. A. The Turkish audio-visual bipolar disorder corpus. In 1st Asian Conf. Affective Computing and Intelligent Interaction (ACII Asia 2018) https://doi.org/10.1109/ACIIAsia.2018.8470362 (IEEE, 2018).
Cosma, A. & Radoi, E. PsyMo: a dataset for estimating self-reported psychological traits from gait. In IEEE/CVF Winter Conf. Applications of Computer Vision (WACV) 4591–4601 (IEEE, 2024); https://doi.org/10.1109/WACV57701.2024.00454
Rohanian, M., Hough, J. & Purver, M. Detecting depression with word-level multimodal fusion. In Proc. Interspeech 2019 1443–1447 (2019); https://doi.org/10.21437/Interspeech.2019-2283
Xia, Y. et al. A depression detection model based on multimodal graph neural network. Multimed. Tools Appl. 83, 63379–63395 (2024).
Ansari, G. et al. Multimodal depression detection system using machine learning. In 2023 Second International Conf. Informatics (ICI) 1–7 (IEEE, 2023).
Xing, T. et al. An adaptive multi-graph neural network with multimodal feature fusion learning for MDD detection. Sci. Rep. 14, 28400 (2024).
Fitzgerald, P. & Watson, B. Gamma oscillations as a biomarker for major depression: an emerging topic. Transl. Psychiatry 8, 177 (2018).
OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Guo, D., Yang, D. & Zhang, H. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Voigt, P. & von dem Bussche, A. The EU General Data Protection Regulation (GDPR): A Practical Guide 2nd edn. (Springer, 2024).
Moore, W. & Frye, S. Review of HIPAA Part 1: history protected health information AND privacy AND security rules. J. Nucl. Med. Technol. 47, 269–272 (2019).
McMahan, B., Moore, E., Ramage, D., Hampson, S. & Aguera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proc. 20th International Conf. Artificial Intelligence and Statistics (AISTATS) 1273–1282 (2017).
Breen, L. J. et al. A co-designed systematic review and meta-analysis of the efficacy of grief interventions for anxiety and depression in young people. J. Affect. Disord. 335, 289–297 (2023).
Khalil, S. S., Tawfik, N. S. & Spruit, M. Federated learning for privacy-preserving depression detection with multilingual language models in social media posts. Patterns 5, (2024).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proc. 34th International Conf. Machine Learning (ICML) 1126–1135 (2017).
Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E. & Morency, L.-P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proc. 56th Annual Meeting of the Association for Computational Linguistics Vol. 1 (Long Papers) 2236–2246 (2018).
Busso, C. et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 335–359 (2008).
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics 37, 2112–2120 (2021).
Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R. V. & Liu, H. DNABERT-2: efficient foundation model and benchmark for multi-species genomes. Preprint at https://openreview.net/forum?id=oMLQB4EZE1 (2024).
Thapaliya, B., Calhoun, V. D. & Liu, J. Environmental and genome-wide association study on children anxiety and depression. In 2021 IEEE International Conf. Bioinformatics and Biomedicine (BIBM) 2330–2337 (IEEE, 2021).
Thapaliya, B. et al. Cross-continental environmental and genome-wide association study on children and adolescent anxiety and depression. Front. Psychiatry 15, 1384298 (2024).
Kim, H., Cheon, E., Bai, D., Lee, Y. & Koo, B. Stress and heart rate variability: a meta-analysis and review of the literature. Psychiatry Investig 15, 235–245 (2018).
Thapaliya, B. et al. DSAM: a deep learning framework for analyzing temporal and spatial dynamics in brain networks. Med. Image Anal. 103462 (2025).
Chen, J. et al. Dynamic fusion of genomics and functional network connectivity in UK biobank reveals static and time-varying SNP manifolds. Preprint at medRxiv https://doi.org/10.1101/2024.01.09.24301013 (2024).
Neverova, N., Wolf, C., Taylor, G. & Nebout, F. ModDrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1692–1706 (2016).
Ma, Z., Liu, H., Wang, Y., Zhang, Z. & Wang, F. SMIL: multimodal learning with missing modalities. In Proc. AAAI Conf. Artificial Intelligence Vol. 35, 11508–11516 (2021).
Wise, T., Radua, J. & Via, E. et al. Common and distinct patterns of grey-matter volume alteration in major depression and bipolar disorder: evidence from voxel-based meta-analysis. Mol. Psychiatry 22, 1455–1463 (2017).
Zhuang, L., Wayne, L., Ya, S. & Zhao, J. A robustly optimized BERT pre-training approach with post-training. In Proc. 20th Chinese National Conference on Computational Linguistics 1218–1227 (Chinese Information Processing Society of China, 2021).
De Choudhury, M., Gamon, M., Counts, S. & Horvitz, E. Predicting depression via social media. In Proc. International AAAI Conf. Web and Social Media Vol. 7, 128–137 (2013).
Andalibi, N., Ozturk, P. & Forte, A. Sensitive self-disclosures, responses, and social support on Instagram: the case of #depression. In Proc. 2017 ACM Conf. Computer Supported Cooperative Work and Social Computing 1485–1500 (ACM, 2017).
Conneau, A. & Lample, G. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. Vol 32 (eds. Wallach, H. et al) https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf (2019).
Yamaguchi, A., Villavicencio, A. & Aletras, N. An empirical study on cross-lingual vocabulary adaptation for efficient language model inference. In Findings of the Assoc. Computational Linguistics: EMNLP 2024 6760–6785 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-emnlp.396
Baevski, A., Zhou, H., Mohamed, A. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Adv. Neural Inf. Process. Syst. Vol. 33 12449–12460 (Curran Associates, 2020).
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H. et al. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021).
Flores, R., Tlachac, M., Shrestha, A. & Rundensteiner, E. A. WavFace: a multimodal transformer-based model for depression screening. IEEE J. Biomed. Health Inform. https://doi.org/10.1109/JBHI.2025.3529348 (2025).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G. & Sutskever, I. Learning transferable visual models From natural language supervision. In Proc. 38th International Conf. Machine Learning (PMLR) 8748–8763 (2021).
Zafar, A., Aftab, D., Qureshi, R., Wang, Y. & Yan, H. Multi-explainable TemporalNet: an interpretable multimodal approach using temporal convolutional network for user-level depression detection. In 2024 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops (CVPRW) 2258–2265 (IEEE, 2024); https://doi.org/10.1109/CVPRW63382.2024.00231
Lin, Z., Feng, M., dos Santos, C. N. et al. A structured self-attentive sentence embedding. In Proc. International Conf. Learning Representations (ICLR) 1142–1156 (Curran Associates, 2017).
Wu, Z. et al. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32, 4–24 (2021).
Edge, D. et al. From local to global: a graph RAG approach to query-focused summarization. Preprint at https://arxiv.org/abs/2404.16130 (2024).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proc. 37th International Conf. Machine Learning (ICML) 1597–1607 (2020).
Rudin, C. Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D. & Groh, G. SHAP-based explanation methods: a review for NLP interpretability. In Proc. 29th Int. Conf. Computational Linguistics (COLING 2022) (eds. Calzolari, N. et al.) 4593–4603 (International Committee on Computational Linguistics, 2022).
Garreau, D. & Luxburg, U. Explaining the explainer: a first theoretical analysis of LIME. In International Conf. Artificial Intelligence and Statistics 1287–1296 (PMLR, 2020).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Adv. Neural Inf. Process. Syst. Vol. 30 (eds. Guyon, I. et al.) 4768–4777 (Curran Associates, 2017).
Ukwuoma, C. C. et al. Enhancing histopathological medical image classification for early cancer diagnosis using deep learning and explainable AI-LIME & SHAP. Biomed. Signal Process. Control. 100, 107014 (2025).
Nahiduzzaman, M. et al. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using MRI images. Sci. Rep. 15, 1649 (2025).
Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2, 749–760 (2018).
Yang, C., Rangarajan, A. & Ranka, S. Visual explanations from deep 3D convolutional neural networks for Alzheimer’s disease classification. In AMIA Annual Symposium Proc. 2018 1571–1580 (2018).
Raab, D., Theissler, A. & Spiliopoulou, M. XAI4EEG: spectral and spatio-temporal explanation of deep learning-based seizure detection in EEG time series. Neural Comput. Appl. 35, 10051–10068 (2023).
Qureshi, S. A., Saha, S., Hasanuzzaman, M. & Dias, G. Multitask representation learning for multimodal estimation of depression level. IEEE Intell. Syst. 34, 45–52 (2019).
Liu, X., Shen, H., Li, H., Tao, Y. & Yang, M. Multimodal depression detection based on self-attention network with facial expression and pupil. IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2024.3405949 (2024).
Ceccarelli, F. & Mahmoud, M. Multimodal temporal machine learning for bipolar disorder and depression recognition. Pattern Anal. Appl. 25, 493–504 (2022).
Afzal Aghaei, A. & Khodaei, N. Automated depression recognition using multimodal machine learning: a study on the DAIC-WOZ dataset. Comput. Math. Comput. Model. Appl. 2, 45–53 (2023).
Flores, R., Tlachac, M., Toto, E. & Rundensteiner, E. AudiFace: multimodal deep learning for depression screening. In Proc. 7th Machine Learning for Healthcare Conf., Proc. Machine Learning Research Vol. 182 (eds. Lipton, Z. et al.) 609–630 (PMLR, 2022).
Tao, Y., Yang, M., Li, H., Wu, Y. & Hu, B. DepMSTAT: multimodal spatio-temporal attentional transformer for depression detection. IEEE Trans. Knowl. Data Eng. 36, 2956–2966 (2024).
Li, Y. et al. FPT-former: a flexible parallel transformer of recognizing depression by using audiovisual expert-knowledge- based multimodal measures. Int. J. Intell. Syst. 2024, 1564574 (2024).
Zhang, L. et al. DepITCM: an audio-visual method for detecting depression. Front. Psychiatry 15, 1466507 (2025).
Mohammad, F. & Mansoor, K. M. A. MDD: a unified multimodal deep learning approach for depression diagnosis based on text and audio speech. Comput. Mater. Continua 81, (2024).
Zhang, Z., Lin, W., Liu, M. & Mahmoud, M. Multimodal deep learning framework for mental disorder recognition. In 15th IEEE International Conf. Automatic Face and Gesture Recognition (FG 2020) 344–350 (IEEE, 2020).
Ye, J., Zhang, J. & Shan, H. DepMamba: progressive fusion Mamba for multimodal depression detection. In ICASSP 2025—2025 IEEE International Conf. Acoustics, Speech and Signal Processing https://doi.org/10.1109/ICASSP49660.2025.10889975 (IEEE, 2025).
Cohan, A., Desmet, B., Yates, A., Soldaini, L., MacAvaney, S. & Goharian, N. SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions. In Proc. 27th International Conf. Computational Linguistics 1485–1497 (Association for Computational Linguistics, 2018).
Rastogi, A., Liu, Q. & Cambria, E. Stress detection from social media articles: new dataset benchmark and analytical study. In 2022 International Joint Conf. Neural Networks (IJCNN) 1–8 (IEEE, 2022).
Author information
Authors and Affiliations
Contributions
T.L., L.C. and Z.Q. contributed equally. T.L.: literature review/collection, drafting and revision, and figures. L.C.: literature collection and figures. Z.Q.: literature collection and tables. Yiran Wang: figures. W.N., Y.W.L., Yuchen Wang and H.Z.: literature collection. W.C.C., M.T. and Z.Z.: supervision and review. L.X. and K.-C.W.: project conception, overall supervision, major revision and responsibility for the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Mental Health thanks Mohsen Sadat Shahabi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information (download PDF )
Supplementary Fig. 1 and Table 1.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, T., Cho, L., Qiu, Z. et al. Depression and anxiety characterization and detection with multimodal deep learning. Nat. Mental Health (2026). https://doi.org/10.1038/s44220-026-00632-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s44220-026-00632-6


