Abstract
The recognition of the physical activities of humans, especially in sports events such as boxing, is an intricate issue that has been addressed mainly by the traditional models of videos without the input of psychological dynamics. Other than this, mental states like anxiety, confidence, and focus have been found to impact performance massively, the use of which has been underdeveloped in current deep learning frameworks. This study proposes a multimodal deep learning framework that combines psychological profiling with video-based boxing action recognition. The approach is designed to overcome the shortcomings of existing visual analysis models, which fail to disengage mechanically similar actions because of their differing contextual backgrounds. The proposed framework combines 3D-ResNet for spatiotemporal feature extraction from boxing videos with a BERT-based encoder for athlete psychological profiles, and the resulting representations are fused at the feature level for classification. Experiments were conducted using the HMDB51-Boxing subset and the newly constructed PsyBox-20 dataset, which links psychological states with action instances through standardized self-report scales. Results demonstrate that the multimodal model achieves an accuracy of 91.2% and an F1-score of 90.9%, outperforming video-only and psychology-only baselines as well as several state-of-the-art unimodal methods. Further analysis shows that psychological characteristics are especially appreciable at distinguishing between visually similar actions, e.g., between jab and hook, where context and cognitive condition play a key role in action execution. It is necessary to mention that the current framework does not support real-time deployment and is aimed to be developed in the future. However, the obtained results validate the hypothesis that psychological profiling adds accuracy to recognition and gives helpful information to AI-led sports analytics and coaching behaviors.
Similar content being viewed by others
Data availability
The experimental data can be obtained by contacting the corresponding author.
References
Tong, A. W. The Science and Philosophy of Martial Arts: Exploring the Connections between the cognitive, physical, and Spiritual Aspects of Martial Arts (Blue Snake Books, 2022).
Host, K. & Ivašić-Kos, M. An overview of human action recognition in sports based on computer vision. Heliyon 8, (2022).
Williams, J. M. E. Applied Sport Psychology: Personal Growth To Peak Performance (Mayfield Publishing Co, 1993).
Hanin, Y. Emotions in sport: current issues and perspectives. In G. Tenenbaum & R. Eklund. Handbook of Sport Psychology 31–58. (2007).
Zhang, L. Behaviour detection and recognition of college basketball players based on multimodal sequence matching and deep neural networks. Comput. Intell. Neurosci. 2022, 7599685 (2022).
Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2012).
Martens, R., Vealey, R. S. & Burton, D. Competitive anxiety in sport. (1990).
Vealey, R. S. Conceptualization of sport-confidence and competitive orientation: preliminary investigation and instrument development. J. Sport Exerc. Psychol. 8, 221–246 (1986).
Buss, A. H. & Perry, M. The aggression questionnaire. J. Personal. Soc. Psychol. 63, 452 (1992).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. HMDB: a large video database for human motion recognition. In International Conference on Computer Vision 2556–2563. (2011).
Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2625–2634. (2015).
Poppe, R. A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010).
Tu, Z. et al. Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018).
Aasman, S., Ben-David, A. & Brügger, N. The Routledge Companion To Transnational Web Archive Studies (Taylor & Francis Group, 2024).
Pang, Y. et al. Applications of AI in martial arts: A survey. Proceedings of the Institution of Mechanical Engineers, Part P: Journal of Sports Engineering and Technology 17543371241273827, (2024).
Hackfort, D. & Schinke, R. J. The Routledge International Encyclopedia of Sport and Exercise Psychology: Theoretical and Methodological Concepts 1 (Routledge, 2020).
Bandura, A. Self-efficacy: toward a unifying theory of behavioral change. Psychol. Rev. 84, 191 (1977).
Filaire, E., Alix, D., Ferrand, C. & Verger, M. Psychophysiological stress in tennis players during the first single match of a tournament. Psychoneuroendocrinology 34 150–157, (2009).
Taylor, J. A conceptual model for integrating athletes’ needs and sport demands in the development of competitive mental Preparation strategies. Sport Psychol. 9, 339–357 (1995).
Neverova, N., Wolf, C., Taylor, G. & Nebout, F. Moddrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1692–1706 (2015).
Baltrušaitis, T., Ahuja, C. & Morency, L. P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2018).
Poria, S. et al. Multi-level multiple attentions for contextual multimodal sentiment analysis. In IEEE International Conference on Data Mining (ICDM) 1033–1038. (2017).
Zolfaghari, M., Singh, K. & Brox, T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV) 695–712. (2018).
Petmezas, G., Vanian, V., Konstantoudakis, K., Almaloglou, E. E. & Zarpalas, D. Video deepfake detection using a hybrid CNN-LSTM-Transformer model for identity verification. Multimedia Tools Appl. 1–20, (2025).
Chen, Y. C. et al. Uniter: Learning universal image-text representations. https://arXiv.org/abs/1909.11740. (2019).
Dey, A. & Biswas, S. Workout action recognition in video streams using an attention driven residual DC-GRU network. Comput. Mater. Continua 79, (2024).
Wei, X. & Wang, Z. TCN-attention-HAR: human activity recognition based on attention mechanism time convolutional network. Sci. Rep. 14, 7414 (2024).
SCNUlyx E. L and HMDB51-Boxing Subset, Kaggle. https://www.kaggle.com/datasets/easonlll/hmdb51 (2020).
Si, C., Jing, Y., Wang, W., Wang, L. & Tan, T. Skeleton-based action recognition with hierarchical Spatial reasoning and Temporal stack learning network. Pattern Recogn. 107, 107511 (2020).
Hara, K., Kataoka, H. & Satoh, Y. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops 3154–3160. (2017).
Han, Y., Zhang, P., Zhuo, T., Huang, W. & Zhang, Y. Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit. Lett. 107, 83–90 (2018).
Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, (2018).
Liu, Z. et al. Video swin transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3202–3211. (2022).
Author information
Authors and Affiliations
Contributions
Conceptualization, Lingtao Wen; methodology, Taiping Li; investigation, Taiping Li; resources, Yanqing Yan; writing—original draft preparation, Yanqing Yan; writing—review and editing, Lingtao Wen; supervision, Lingtao Wen. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This study was conducted in accordance with the ethical standards of Guangzhou Huashang College. The research protocol was reviewed and approved by the Institutional Review Board of Guangzhou Huashang College with the approval number 20250116. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Consent to participate
Informed consent was obtained from all individual participants included in the study. All participants were provided with a full explanation of the study’s purpose, procedures, and potential risks and benefits, and they were assured of the confidentiality of their responses and the voluntary nature of their participation.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix
Appendix A. Detailed specification of the PsyBox-20 dataset
The PsyBox-20 dataset was created to capture psychological conditions associated with boxing actions under controlled training conditions. The data set comprises a profile of 20 trained athletes who were required to reenact standard boxing moves (jab, hook, uppercut, block, and footwork). It is balanced with psychological measures taken before each action set. This appendix provides an in-depth explanation of the psychological measures employed, the form of the participants’ profiles, and the scoring methods used to derive model-ready features.
Psychological assessment instruments
Competitive state anxiety inventory-2 (CSAI-2)
CSAI-2 is a 27-item questionnaire widely used in sport psychology to assess three performance-relevant constructs:
Cognitive Anxiety (CA): worries, intrusive thoughts, impaired concentration.
Somatic Anxiety (SA): perceived physiological activation (e.g., muscle tension, heart rate).
Self-Confidence (SC): belief in one’s ability to execute effectively under pressure.
The sub-scales have nine items, rated on a 4-point Likert scale. The sum of the subscales (range 936) is conventional because the subscale items are relatively redundant, and only at the aggregate level make sense. An increase in CA/SA scores means high anxiety; an increase in SC scores means preparedness and perceived control. Such subscale scales were normalized to [0, 1] and added as three distinct features in PsyBox-20.
Sport confidence inventory (SCI)
The SCI assesses an athlete’s confidence in competitive contexts. It includes items measuring:
confidence in skills and decision-making.
confidence in physical conditioning.
confidence relative to opponents.
Responses were summed to produce a single confidence index, later normalized to [0, 1]. This measure complements CSAI-2 by capturing stable rather than state-dependent confidence.
Aggression questionnaire (AQ)
The AQ evaluates four dimensions of aggression:
Physical Aggression.
Verbal Aggression.
Anger.
Hostility.
These subcomponents are behavioral orientations that can affect an athlete’s risk tolerance and responses in combat sports. Just like with other scales, the subscale scores were normalized to [0, 1]. The aggregated aggression index is included, and separate subcomponent scores are included in the final PsyBox-20 profile in order to maintain the nuance.
Dataset structure and feature composition
Each PsyBox-20 entry corresponds to one athlete × one action batch, producing a structured profile with the following elements:
Participant ID.
Action Label (jab, hook, uppercut, block, footwork).
CSAI-2 Subscale Scores (CA, SA, SC).
SCI Score.
AQ Subscale Scores (physical, verbal, anger, hostility).
Optional biometric indicators (e.g., heart rate, reaction time), recorded when available.
All aspects were brought to the [0,1] range. The last psychological vector comprises several semantically significant items, not raw questionnaire items. This eliminates duplication and allows compatibility with feature-level fusion.
Data collection and alignment
Psychological tests were conducted directly before the participants completed the action sets assigned to them. The timing is the mental state of the situation pertinent to the next physical performance. In model training, each profile was then compared to video sequences of the same action class.
Important clarification
Because HMDB51-Boxing does not contain psychological labels, PsyBox-20 profiles were matched to HMDB51 clips by action class, rather than by the presumed psychological condition of the people in HMDB51. It is a modeling approach that investigates the possibility of improving recognition performance by leveraging psychological tendencies related to the execution of particular actions. It is not claimed that the HMdb51 actors were experiencing the same psychological conditions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, T., Yan, Y. & Wen, L. Integrating psychological profiling with deep learning for enhanced boxing action recognition. Sci Rep (2026). https://doi.org/10.1038/s41598-025-34771-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-34771-0


