A multi-context fusion-aware graph modelling for group activity recognition using pose-conditioned spatial encoding and actor relations

Tejonidhi, M. R.; Raghunandan, K. R.; Uma, B.; Madhu, C. K.; Vinod, A. M.

doi:10.1038/s41598-026-46296-1

Download PDF

Article
Open access
Published: 10 April 2026

A multi-context fusion-aware graph modelling for group activity recognition using pose-conditioned spatial encoding and actor relations

M. R. Tejonidhi^1,2,
K. R. Raghunandan¹,
B. Uma²,
C. K. Madhu^1,2 &
…
A. M. Vinod^1,2

Scientific Reports , Article number: (2026) Cite this article

492 Accesses
1 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Group activity recognition requires a holistic understanding of individual actions, their spatial relationships, and the surrounding environment. Traditional methods that focus solely on isolated movements often fail to capture the complex inter-player and scene-level dependencies inherent in sports and crowd scenarios. In this research work, a model for group activity recognition is developed. The proposed model combines various contextual features through the integration of poses of individual actors in the scene with the pose-aligned spatial scene context for relational reasoning. Pose features of individual actors are extracted using mmPose, while the scene-level context is encoded through pose-conditioned spatial feature aggregation rather than explicit semantic segmentation. These pose and scene context features extracted are combined and used to construct Actor Relation Graphs (ARGs) using Zero Normalized Cross Correlation (ZNCC) which improves robustness to appearance and variations in illumination. Further, Graph Convolutional Networks (GCNs) are modelled using relationships between individual actors in a scene and their group activities. The proposed framework explicitly combines pose-level and scene-level contextual features into a single relational graph, in contrast to previous ARG-GCN approaches that mainly rely on appearance features. The model is evaluated on two benchmark datasets: the Collective Activity dataset (CAD) and the Volleyball dataset (VD). The model exhibits classification accuracies of 95.02% and 94.81% on CAD and VD, respectively. On a TITAN-XP GPU, the average time per video clip with 41 frames is approximately 0.2 s. The results show that the combination of pose and scene contexts features enhances graph-based relational learning and improves recognition accuracy.

A human activity recognition method based on Vision Transformer

Article Open access 03 July 2024

Hierarchical intertwined graph representation learning for skeleton-based action recognition

Article Open access 10 October 2025

Content oriented 3D-CNN sequence learning architecture for academic activities recognition using a realistic CAD dataset

Article Open access 12 July 2025

Data availability

The Volleyball dataset is publicly available in the “mostafa-saad/deep-activity-rec **”** repository, [https://github.com/mostafa-saad/deep-activity-rec? tab=readme-ov-file#dataset](https:/github.com/mostafa-saad/deep-activity-rec? tab=readme-ov-file) and Collective Activity Dataset is publicly available in the [Computational Vision and Geometry Lab (CVGL) website at Stanford University](https:/cvgl.stanford.edu/projects/collective/collectiveActivity.html) , [https://cvgl.stanford.edu/projects/collective/collectiveActivity.html](https:/cvgl.stanford.edu/projects/collective/collectiveActivity.html) . The datasets used and/or analysed during the current study will be available from the corresponding author on reasonable request.

References

Ullah, H., Muhammad, K., Sajjad, M., Baik, S. W. & Kwak, K. S. Machine learning model for group activity recognition based on discriminative interaction contextual relationship. Appl. Sci. 11, 9545. https://doi.org/10.3390/app11209545 (2021).
Google Scholar
Khan, S. D., Ullah, H. & Kwak, K. S. Human group activity recognition using robust features extraction. J. Electr. Comput. Eng. 3090343, (2017). https://doi.org/10.1155/2017/3090343 (2017).
Ibrahim, M. S., Muralidharan, S., Deng, Z., Vahdat, A. & Mori, G. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P. & Savarese, S. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
Li, Y. & Vasconcelos, N. Efficient multi-person group activity recognition by hierarchical relational modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Li, S. et al. Groupformer: Group activity recognition with clustered spatial-temporal transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13668–13677 (2021).
Du, Z. & Wang, Q. Exploring global context and position-aware representation for group activity recognition. https://ssrn.com/abstract=4493017, (2023). https://doi.org/10.2139/ssrn.4493017
Yuan, H. & Ni, D. Learning visual context for group activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 3261–3269, (2021). https://doi.org/10.1609/aaai.v35i4.16437
Dasgupta, A., Jawahar, C. V. & Alahari, K. Context aware group activity recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), 10098–10105, (2021). https://doi.org/10.1109/ICPR48806.2021.9412306
Li, S., He, X., Song, W., Hao, A. & Qin, H. Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8477–8493. https://doi.org/10.1109/TPAMI.2023 (2023).
Google Scholar
Vahora, S. A. & Chauhan, N. C. Deep neural network model for group activity recognition using contextual relationship. Eng. Sci. Technol. Int. J. 22, 47–54. https://doi.org/10.1016/j.jestch.2018.08.010 (2019).
Google Scholar
Amer, M. R., Lei, P., Todorovic, S. & Hirf Hierarchical random field for collective activity recognition in videos. In Proceedings of the European Conference on Computer Vision (ECCV) (2014).
Amer, M. R., Xie, D., Zhao, M., Todorovic, S. & Zhu, S. C. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In Proceedings of the European Conference on Computer Vision (ECCV) (2012).
Lan, T., Sigal, L. & Mori, G. Social roles in hierarchical models for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012).
Lan, T., Wang, Y., Yang, W., Robinovitch, S. N. & Mori, G. Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 33, 814–830 (2011).
Google Scholar
Gavrilyuk, K., Sanford, R., Javan, M. & Snoek, C. G. M. Actor-transformers for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
Hu, G., Cui, B., He, Y. & Yu, S. Progressive relation learning for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
Ibrahim, M. S. & Mori, G. Hierarchical relational networks for group activity recognition and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV) (2018).
Li, X., Chuah, M. C. & Sbgar Semantics based group activity recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017).
Qi, M. et al. Stagnet: An attentive semantic rnn for group activity recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 101–117 (2018).
Shu, T., Todorovic, S., Zhu, S. C. & Cern Confidence energy recurrent network for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
Wu, J., Wang, L., Wang, L., Guo, J. & Wu, G. Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9964–9974 (2019).
Alexe, B., Heess, N., Teh, Y. W. & Ferrari, V. Searching for objects driven by context. In Advances in Neural Information Processing Systems (NeurIPS) (2012).
Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A. & Hebert, M. An empirical study of context in object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009).
Liang, J., Jiang, L., Niebles, J. C., Hauptmann, A. G. & Fei-Fei, L. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Lisotto, M., Coscia, P. & Ballan, L. Social and scene-aware trajectory prediction in crowded spaces. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops (2019).
Liu, Y., Wang, R., Shan, S. & Chen, X. Structure inference net: Object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).
Mottaghi, R. et al. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014).
Torralba, A. et al. Context-based vision system for place and object recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2003).
Shrivastava, A. & Gupta, A. Contextual priming and feedback for faster r-cnn. In Proceedings of the European Conference on Computer Vision (ECCV) (2016).
Das, S. et al. Learning video-pose embedding for activities of daily living. In Proceedings of the European Conference on Computer Vision (ECCV) (2020).
Ulutan, O., Iftekhar, A., Manjunath, B. S. & Vsgnet Spatial attention network for detecting human object interactions using graph convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
Ding, H., Jiang, X., Shuai, B., Liu, A. Q. & Wang, G. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).
Zhou, Y., Sun, X., Zha, Z. J. & Zeng, W. Context reinforced semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Deng, Z., Vahdat, A., Hu, H. & Mori, G. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4772–4781 (2016).
Wang, M., Ni, B. & Yang, X. Recurrent modeling of interaction context for collective activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
Tejonidhi, M. R., Aravinda, C. V., Kumar, S. V. A., Madhu, C. K. & Vinod, A. M. Optimizing group activity recognition with actor relation graphs and gcn-lstm architectures. IEEE Access. 13, 55957–55969. https://doi.org/10.1109/ACCESS.2025.3552668 (2025).
Google Scholar
Chappa, N. V. S. R. et al. Spartan: Self-supervised spatiotemporal transformers approach to group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 5158–5168, (2023). https://doi.org/10.1109/CVPRW59228.2023.00544
Shu, T., Todorovic, S., Zhu, S. C. & Cern Confidence-energy recurrent network for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5523–5531 (2017).
Yuan, H., Ni, D. & Wang, M. Spatio-temporal dynamic inference network for group activity recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 7456–7465, (2021). https://doi.org/10.1109/ICCV48922.2021.00738
Hajimirsadeghi, H., Yan, W., Vahdat, A. & Mori, G. Visual recognition by counting instances: A multi-instance cardinality potential kernel. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2596–2605 (2015).
Kim, D., Lee, J., Cho, M. & Kwak, S. Detector-free weakly supervised group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20051–20061, (2022). https://doi.org/10.1109/CVPR52688. 01945 (2022).
Choi, W., Shahid, K. & Savarese, S. What are they doing? Collective activity classification using spatio-temporal relationship among people. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops (2009).
He, H., Li, Y., Wang, Y., Li, G. & Guo, W. Runlin Zou. Group Activity Recognition via Spatio-Temporal Reasoning of Key Instances. In Proceedings of the 35th British Machine Vision Conference (2024).
Wang, D. et al. Multi-dimensional convolution transformer for group activity recognition. Multimed Tools Appl 84, 27071–27090 (2025). https://doi.org/10.1007/s11042-024-19973-4 (2024).
Zhu, X., Zhou, Y., Wang, D., Ouyang, W. & Su, R. MLST-Former: Multi-Level Spatial-Temporal Transformer for Group Activity Recognition. IEEE Trans. Circuits Syst. Video Technol. 33(7), 3383–3397 https://doi.org/10.1109/TCSVT.2022.3233069(2023).
Google Scholar
Xie, Z., Jiao, C., Wu, K., Guo, D. & Hong, R. Active Factor Graph Network for Group Activity Recognition. IEEE Trans. Image Process. 33, 1574–1587. https://doi.org/10.1109/TIP.2024.3362140 (2024).
Google Scholar
Zhu, X. et al. Dynamical Attention Hypergraph Convolutional Network for Group Activity Recognition. IEEE Trans. Neural Networks Learn. Syst. 36(5), 8911–8925. https://doi.org/10.1109/TNNLS.2024.3422265 (2025)
Google Scholar
Su, Y. et al. Coming Out of the Dark: Human Pose Estimation in Low-light Conditions, in Proc. 34th Int. Joint Conf. Artif. Intell. (IJCAI), pp. 1888–1896, (2025). https://doi.org/10.24963/ijcai.2025/210
Zhu, S., Liu, X., Xing, M., Oh, C. & Li, J. Spatio-temporal articulation & coordination co-attention graph network for human motion prediction. IEEE Trans. Circuits Syst. Video Technol. 34 (5), 3456–3468 (2024).
Google Scholar
Tang, J. et al. MTAN: Multi-degree Tail-aware Attention Network for Human Motion Prediction. Internet Things. 25, 101134. https://doi.org/10.1016/j.iot.2024.101134 (2024).
Google Scholar

Download references

Author information

Authors and Affiliations

Nitte (Deemed to be University), NMAM Institute of Technology (NMAMIT), Department of Computer Science and Engineering, Nitte, 574110, Karnataka, India
M. R. Tejonidhi, K. R. Raghunandan, C. K. Madhu & A. M. Vinod
Department of Computer Science and Engineering, Malnad College of Engineering, Hassan, 573201, Karnataka, India
M. R. Tejonidhi, B. Uma, C. K. Madhu & A. M. Vinod

Authors

M. R. Tejonidhi
View author publications
Search author on:PubMed Google Scholar
K. R. Raghunandan
View author publications
Search author on:PubMed Google Scholar
B. Uma
View author publications
Search author on:PubMed Google Scholar
C. K. Madhu
View author publications
Search author on:PubMed Google Scholar
A. M. Vinod
View author publications
Search author on:PubMed Google Scholar

Contributions

Tejonidhi M R contributed in the literature survey, problem identification, design implementation and preparing the article. Raghunandan K R guided throughout all these process. Uma B Co-guided during the above process. Madhu C K and Vinod AM contributed in drafting the article and reviewing and fine tuning the contents of the research.

Corresponding authors

Correspondence to M. R. Tejonidhi or K. R. Raghunandan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tejonidhi, M.R., Raghunandan, K.R., Uma, B. et al. A multi-context fusion-aware graph modelling for group activity recognition using pose-conditioned spatial encoding and actor relations. Sci Rep (2026). https://doi.org/10.1038/s41598-026-46296-1

Download citation

Received: 19 February 2026
Accepted: 25 March 2026
Published: 10 April 2026
DOI: https://doi.org/10.1038/s41598-026-46296-1