Abstract
The development of effective automated systems to prevent patient self-harm in psychiatric wards is severely hampered by a scarcity of realistic training data. To address this critical gap, this study introduces a new public dataset of 1120 videos simulating cutting actions in a controlled studio environment, alongside a validation set of 118 real-world videos from secure wards that include more diverse behaviors such as picking and scratching. We conducted a comprehensive benchmark of state-of-the-art action recognition models, including both convolution-based and transformer-based architectures, to evaluate their performance and generalizability from simulated to real-world conditions. Our results reveal a significant “sim-to-real” gap: while the top-performing model, VideoMAEv2, achieved a high F1 score of 0.65 on the simulated data using 7-fold LOAO cross-validation, its performance degraded to a mean F1 score of 0.61 on the real-world data. This performance drop is attributed to the models’ inability to generalize from the uniform, simulated actions to the diverse and often occluded behaviors observed in authentic clinical settings. By providing a foundational dataset, a systematic benchmark, and a qualitative analysis of model failure points, this study quantitatively demonstrates the limitations of current approaches. Our findings underscore the urgent need for more diverse data and advanced approaches to develop robust technologies that can enhance patient safety.
Similar content being viewed by others
Data availability
The studio-based self-harm dataset generated and analyzed during the current study is publicly available in the ZV_Self-harm-Dataset repository, accessible at https://github.com/zv-ai/ZV_Self-harm-Dataset.
Code availability
Code is available upon request by the corresponding author.
References
Allison, S., Bastiampillai, T., Looi, J. C., Kisely, S. R. & Lakra, V. The new world mental health report: Believing impossible things. Aust. Psychiatry 31, 182–185. https://doi.org/10.1177/10398562231154806 (2023).
Kool, N. & Jaspers, A. Self-harm on a closed psychiatric ward. Eur. Psychiatry 65, S663–S663. https://doi.org/10.1192/j.eurpsy.2022.1703 (2022).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV) (2011).
Soomro, K., Zamir, A. R. & Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012).
Kay, W. et al. The kinetics human action video dataset. arXiv:1705.06950 (2017).
Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389 (2016).
Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497 (2015).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? arXiv:2102.05095 (2021).
Li, K. et al. Uniformerv2: Spatiotemporal learning by arming image VITS with video uniformer. arXiv:2211.09552 (2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. (2021).
Liang, Z. et al. Nssi-net: Multi-concept generative adversarial network for non-suicidal self-injury detection using high-dimensional EEG signals in a semi-supervised learning framework. arXiv:2410.12159 (2024).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (Pereira, F., Burges, C., Bottou, L. & Weinberger, K. eds.) . Vol. 25 (Curran Associates, Inc., 2012).
Scherr, S., Arendt, F., Frissen, T. & M, J. O. Detecting intentional self-harm on Instagram: Development, testing, and validation of an automatic image-recognition algorithm to discover cutting-related posts. Soc. Sci. Comput. Rev. 38, 673–685. https://doi.org/10.1177/0894439319836389 (2020).
Yang, G. et al. Detection of non-suicidal self-injury based on spatiotemporal features of indoor activities. IET Biometrics 12, 91–101. https://doi.org/10.1049/bme2.12110 (2023).
Gardner, K. J. et al. The significance of site of cut in self-harm in young people. J. Affect. Disord. 266, 603–609 (2020).
Corporation, C. Computer Vision Annotation Tool (CVAT). https://doi.org/10.5281/zenodo.8416684 (2023).
Wang, L. et al. Temporal segment networks for action recognition in videos (2017). arXiv:1705.02953.
Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. arXiv:1812.03982 (2019).
Wang, L. et al. Videomae v2: Scaling video masked autoencoders with dual masking. arXiv:2303.16727 (2023).
Al-lahham, A., Tastan, N., Zaheer, Z. & Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. arXiv:2310.17650 (2023).
Zhang, H., Li, X. & Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858 (2023).
OpenAIet al. Gpt-4o system card. arXiv:2410.21276 (2024).
Author information
Authors and Affiliations
Contributions
All authors made significant contributions to this study and approved the final manuscript. KL and DL contributed to the research by conducting experiments and drafting the original manuscript. HSH and HCK analyzed the data and provided a revision of the manuscript YL contributed to the collection and extraction of self-harm data and to the revision of the manuscript. HGJ contributed to the understanding of self-harm behaviors and psychiatric wards, provided supervision for the research, and contributed to the revision of the manuscript. HSC contributed to the design and supervision of the research and to the critical revision of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics approval and consent to participate
This study was conducted in adherence to the Declaration of Helsinki and was approved by the International Review Board of Korea University Hospital (2022GR0511). While the IRB waived the general requirement for participant consent, written informed consent for participation and data use was specifically obtained from all actors involved in the studio dataset prior to any recording.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, K., Lee, D., Ham, HS. et al. Benchmarking action recognition models for self-harm detection in studio and real-world datasets. Sci Rep (2026). https://doi.org/10.1038/s41598-026-36999-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-36999-w


