Abstract
Action quality assessment (AQA) is an important and challenging task in computer vision, which has received wide attention in many fields, especially in sports video analysis. In this paper, for the problem of uneven distribution of AQA scores in long sports videos, an AQA model based on transfer neural network quality score decoupling is proposed, which mainly consists of a dual-stream structure combining dynamic and static streams, a quality score decoupling module, and a pairwise sorting prediction module. Specifically, inspired by the action alignment processing in video understanding, a quality score decoupling module is designed based on the Transformer network decoder, which is able to decouple the input visual features into high/low quality score features, and at the same time adopts temporally average-pooled features as the average quality score representation. The overall skill level of a long video is assessed by focusing on the skill level related parts of the video in a pairwise order, and the task of assessing the quality of the action is accomplished by aligning the scores. In addition, the algorithm adopts a twin neural network structure to compare the input paired samples, and the dual-stream structure combining dynamic and static streams is able to extract the video motion information and frame-specific information separately, so that the model focuses on the dynamic time information and moment-specific action information simultaneously, which contributes to the feature extraction network to obtain a richer feature representation. This explicit separation of motion-centric and posture-centric representations avoids early entanglement of heterogeneous quality cues; the two streams are fused only after quality-aware feature disentanglement, following a late fusion strategy that is particularly suitable for long, untrimmed videos. Finally, by comparing with the existing methods on several public datasets, the comparison experiments and visual validation results show the effectiveness as well as the superiority of the proposed algorithm.
Data availability
All data used and generated during the current study are available from the corresponding author upon reasonable request.
References
Borges, P. V. K., Conci, N. & Cavallaro, A. Video-based human behavior understanding: A survey. IEEE Trans. Circuits Syst. Video Technol. 23, 1993–2008 (2013).
Fu, C. et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMS in video analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference. 24108–24118 (2025).
Tang, Y. et al. Video understanding with large language models: A survey. In IEEE Transactions on Circuits and Systems for Video Technology (2025).
Zhou, J. et al. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264 (2024).
Liu, M., Yu, J. & Zhao, K. Dynamic event-triggered asynchronous fault detection via zonotopic threshold analysis for fuzzy hidden Markov jump systems subject to generally hybrid probabilities. IEEE Transactions on Fuzzy Systems 32, 6363–6377 (2024).
Yang, H., Liu, Z., Cui, H. et al. An electrified railway catenary component anomaly detection frame based on invariant normal region prototype with segment anything model. In IEEE Transactions on Transportation Electrification; IEEE: Piscataway, NJ, USA. 1–12 (2025).
Wang, H., Song, Y., Yang, H. & Liu, Z. Generalized Koopman neural operator for data-driven modelling of electric railway pantograph-catenary systems. In IEEE Transactions on Transportation Electrification (2025).
Yan, J. et al. Research on multimodal techniques for arc detection in railway systems with limited data. Struct. Health Monit. 14759217251336797 (2025).
Zhang, H., Zhang, X., Liu, Y., Gao, S. & Ma, D. Event-triggered adaptive tracking control for USV based on enhanced optimized backstepping technique. ISA Trans. 168, 67–80 (2026).
Liang, X. et al. Three-dimensional printing resin-based dental provisional crowns and bridges: Recent progress in properties, applications, and perspectives. Materials 18, 2202 (2025).
Zhang, S.-J., Pan, J.-H., Gao, J. & Zheng, W.-S. Semi-supervised action quality assessment with self-supervised segment feature recovery. IEEE Trans. Circuits Syst. Video Technol. 32, 6017–6028 (2022).
Sun, W., Min, X., Lu, W. & Zhai, G. A deep learning based no-reference quality assessment model for UGC videos. In Proceedings of the 30th ACM International Conference on Multimedia. 856–865 (2022).
Bai, Y. et al. Action quality assessment with temporal parsing transformer. In European Conference on Computer Vision. 422–438 (Springer, 2022).
Liu, M., Yu, J. & Rodríguez-Andina, J. J. Adaptive event-triggered asynchronous fault detection for nonlinear Markov jump systems with its application: A zonotopic residual evaluation approach. IEEE Trans. Netw. Sci. Eng. 10, 1792–1808 (2023).
Wang, X., Jiang, H., Zeng, T. & Dong, Y. An adaptive fused domain-cycling variational generative adversarial network for machine fault diagnosis under data scarcity. Inf. Fusion 103616 (2025).
Jia, Z., Liu, Z., Li, Z., Wang, K. & Vong, C.-M. Lightweight fault diagnosis via siamese network for few-shot EHA circuit analysis. In IEEE Transactions on Aerospace and Electronic Systems (2025).
Yamada, D.K., Lin, F. & Nakamura, T. Developing a novel recurrent neural network architecture with fewer parameters and good learning performance. Interdiscip. Inf. Sci. 27, 25–40 (2021).
Gao, C. et al. Conditional feature learning based transformer for text-based person search. IEEE Trans. Image Process. 31, 6097–6108 (2022).
Liu, Y., Cheng, X. & Ikenaga, T. A hierarchical joint training based replay-guided contrastive transformer for action quality assessment of figure skating. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 108, 332–341 (2025).
Zhou, K., Cai, R., Wang, L., Shum, H.P. & Liang, X. A comprehensive survey of action quality assessment: Method and benchmark. arXiv preprint arXiv:2412.11149 (2024).
Pirsiavash, H., Vondrick, C. & Torralba, A. Assessing the quality of actions. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. 556–571 (Springer, 2014).
Pan, J.-H., Gao, J. & Zheng, W.-S. Action assessment by joint relation graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6331–6340 (2019).
Sun, K., Xiao, B., Liu, D. & Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693–5703 (2019).
Nekoui, M., Cruz, F. O. T. & Cheng, L. Eagle-eye: Extreme-pose action grader using detail bird’s-eye view. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 394–402 (2021).
Parmar, P. & Tran Morris, B. Learning to score Olympic events. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 20–28 (2017).
Parmar, P. & Morris, B. T. What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 304–313 (2019).
Wang, S., Yang, D., Zhai, P., Chen, C. & Zhang, L. Tsa-net: Tube self-attention network for action quality assessment. In Proceedings of the 29th ACM International Conference on Multimedia. 4902–4910 (2021).
Yu, X., Rao, Y., Zhao, W., Lu, J. & Zhou, J. Group-aware contrastive regression for action quality assessment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7919–7928 (2021).
Parmar, P. & Morris, B. Action quality assessment across multiple actions. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). 1468–1476 (IEEE, 2019).
Xu, J. et al. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2949–2958 (2022).
Zeng, L.-A. et al. Hybrid dynamic-static context-aware attention network for action assessment in long videos. In Proceedings of the 28th ACM International Conference on Multimedia. 2526–2534 (2020).
Tang, Y. et al. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9839–9848 (2020).
Xu, A., Zeng, L.-A. & Zheng, W.-S. Likert scoring with grade decoupling for long-term action assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3232–3241 (2022).
Li, Z., Huang, Y., Cai, M. & Sato, Y. Manipulation-skill assessment from videos with spatial attention network. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019).
Jain, H., Harit, G. & Sharma, A. Action quality assessment using siamese network-based deep metric learning. IEEE Trans. Circuits Syst. Video Technol. 31, 2260–2273 (2020).
Liu, M., Yu, J. & Liu, Y. Dynamic event-triggered asynchronous fault detection for Markov jump systems with partially accessible hidden information and subject to aperiodic dos attacks. Appl. Math. Comput. 431, 127317 (2022).
Doughty, H., Damen, D. & Mayol-Cuevas, W. Who’s better? Who’s best? Pairwise deep ranking for skill determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6057–6066 (2018).
Doughty, H., Mayol-Cuevas, W. & Damen, D. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7862–7871 (2019).
Xiang, X., Tian, Y., Reiter, A., Hager, G.D. & Tran, T. D. S3d: Stacking segmental p3d for action quality assessment. In 2018 25th IEEE International Conference on Image Processing (ICIP). 928–932 (IEEE, 2018).
Piergiovanni, A., Fan, C. & Ryoo, M. Learning latent subevents in activity videos using temporal attention filters. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31 (2017).
Carreira, J. & Zisserman, A. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308 (2017).
Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision. 213–229 (Springer, 2020).
Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
Lin, T.-Y. et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. 740–755 (Springer, 2014).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Joachims, T. Training linear SVMS in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 217–226 (2006).
Yao, T., Mei, T. & Rui, Y. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 982–990 (2016).
Pan, J.-H., Gao, J. & Zheng, W.-S. Adaptive action assessment. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8779–8795 (2021).
Nguyen, P., Liu, T., Prasad, G. & Han, B. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761 (2018).
Author information
Authors and Affiliations
Contributions
Lei Gao: conceptualization, data curation, formal analysis, investigation, methodology, supervision, validation, writing-review and editing, and writing-original draft. Yuhong Ma: formal analysis, investigation, data curation, writing-review and editing. Sijuan Bi: formal analysis, investigation, data curation, writing-review and editing. Shuangjun Li: conceptualization, funding acquisition, investigation, validation, visualization, resources, software, writing-review and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gao, L., Ma, Y., Bi, S. et al. Athlete action quality assessment based on transfer neural network quality score decoupling in complex sports scenarios. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43987-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-43987-7