Semantic knowledge enhanced 3D human–object interaction learning

Li, Xuyang; Li, Qiyue; Tan, Xiaopeng; Wang, Yingbin; Yin, Yichuan; Yan, Jiapeng; Li, Yuanqing; Du, Getao

doi:10.1038/s41598-026-49105-x

Download PDF

Article
Open access
Published: 20 April 2026

Semantic knowledge enhanced 3D human–object interaction learning

Xuyang Li¹,
Qiyue Li²,
Xiaopeng Tan¹,
Yingbin Wang³,
Yichuan Yin²,
Jiapeng Yan²,
Yuanqing Li² &
…
Getao Du²

Scientific Reports (2026) Cite this article

892 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Learning 3D human-object interactions (HOI) from 2D images is one of the important approaches for understanding human-object interactions in 3D space and is crucial for the advancement of embodied AI and interaction modeling. Existing 3D human-object interaction learning methods often fail to model fine-grained interactions in complex scenarios due to their reliance on visual features alone, leading to ambiguities in human contact, object affordance, and spatial relation. To address this, we propose SKE-3DHOI, a semantic knowledge enhanced framework that integrates semantic knowledge derived from large multimodal models into visual 3D human-object interaction reasoning. By generating 3D HOI semantic knowledge tensors through HOI-specific textual queries of large multimodal models, our method encodes critical HOI semantics and fuses them with visual embeddings via cross-attention fusion layers. This enables explicit alignment of visual patterns with semantic knowledge priors. Extensive experiments validate that SKE-3DHOI achieves state-of-the-art performance, significantly outperforming existing methods across all metrics in 3D human-object interaction learning. The framework bridges the gap between geometric plausibility and semantic validity, advancing robust 3D HOI understanding.

Abstract visual reasoning based on algebraic methods

Article Open access 28 January 2025

Construction of a multiscale feature fusion model for indoor scene recognition and semantic segmentation

Article Open access 27 April 2025

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

Article Open access 10 May 2023

Data availability

The data supporting the findings of this study is publicly available. The 3DIR dataset is available at https://github.com/yyvhang/lemon_3d. The HAKE dataset is available at http://hake-mvig.cn. The PIAD dataset is available at https://github.com/yyvhang/IAGNet. The V-COCO dataset is available at https://github.com/s-gupta/v-coco.

References

Cheng, K.-H. & Tsai, C.-C. Affordances of augmented reality in science learning: Suggestions for future research. J. Sci. Educ. Technol. 22, 449–462 (2013).
Google Scholar
del Amo, I. F., Erkoyuncu, J. A., Farsi, M. & Ariansyah, D. Hybrid recommendations and dynamic authoring for AR knowledge capture and re-use in diagnosis applications. Knowl.-Based Syst. 239, 107954 (2022).
Google Scholar
Zare, M., Kebria, P. M., Khosravi, A. & Nahavandi, S. Algorithms, recent developments, and challenges. In IEEE Transactions on Cybernetics, A Survey of Imitation Learning (2024).
Hu, Y. et al. Fusion dynamical systems with machine learning in imitation learning: A comprehensive overview. Inf. Fusion 102379 (2024).
Ren, L., Dong, J., Liu, S., Zhang, L. & Wang, L. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. In IEEE/ASME Transactions on Mechatronics (2024).
Liu, H., Guo, D. & Cangelosi, A. A synergy of morphology, action, perception and learning. ACM Comput. Surv. Embodied Intell. (2025).
Dai, Y. & Wang, J. Co-evolving embodied intelligence with design for artificial intelligence architecture. Nat. Rev. Electr. Eng. 1–2 (2025).
Chen, Y., Dwivedi, S. K., Black, M. J. & Tzionas, D. Detecting human-object contact in images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17100–17110 (2023).
Tripathi, S. et al. Deco: Dense estimation of 3D human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8001–8013 (2023).
Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2252–2261 (2022).
Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Leverage interactive affinity for affordance learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6809–6819 (2023).
Petrov, I. A., Marin, R., Chibane, J. & Pons-Moll, G. Object pop-up: Can we infer 3D objects and their poses from human interactions alone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4726–4736 (2023).
Li, Y.-L. et al. Detailed 2D-3D joint representation for human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10166–10175 (2020).
Yang, Y., Zhai, W., Luo, H., Cao, Y. & Zha, Z.-J. Lemon: Learning 3D human-object interaction relation from 2D images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16284–16295 (2024).
Gkioxari, G., Girshick, R., Dollár, P. & He, K. Detecting and recognizing human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8359–8367 (2018).
Chao, Y.-W., Liu, Y., Liu, X., Zeng, H. & Deng, J. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vision. 381–389 (2018).
Tamura, M., Ohashi, H. & Yoshinaga, T. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419 (2021).
Kim, B., Lee, J., Kang, J., Kim, E.-S. & Kim, H. J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 74–83 (2021).
Qi, S., Wang, W., Jia, B., Shen, J. & Zhu, S.-C. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision. 401–417 (2018).
Li, Q., Xie, X., Zhang, C., Zhang, J. & Shi, G. Detecting human-object interactions in videos by modeling the trajectory of objects and human skeleton. Neurocomputing 509, 234–243 (2022).
Google Scholar
Nagarajan, T., Feichtenhofer, C. & Grauman, K. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8688–8697 (2019).
Zhai, W., Luo, H., Zhang, J., Cao, Y. & Tao, D. One-shot object affordance detection in the wild. Int. J. Comput. Vis. 130, 2472–2500 (2022).
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G. & Black, M. J. Smpl: A skinned multi-person linear model. ACM Trans. Graph. (TOG) 34, 1–16 (2015).
Google Scholar
Pavlakos, G. et al. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975–10985 (2019).
Romero, J., Tzionas, D. & Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022).
Huang, C.-H. P. et al. Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13274–13285 (2022).
Deng, S., Xu, X., Wu, C., Chen, K. & Jia, K. 3D affordancenet: A benchmark for visual object affordance understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1778–1787 (2021).
Han, S. & Joo, H. Chorus: Learning canonicalized 3d human-object spatial relations from unbounded synthesized images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15835–15846 (2023).
Yang, Y. et al. Grounding 3D object affordance from 2D interactions in images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10905–10915 (2023).
Bui, D. C., Le, T. V. & Ngo, B. H. C2t-net: Channel-aware cross-fused transformer-style networks for pedestrian attribute recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 351–358 (2024).
Bui, D. C., Le, T. V., Ngo, B. H. & Choi, T. J. Clear: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Pattern Recognit. 164, 111486 (2025).
Google Scholar
Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3349–3364 (2020).
Google Scholar
Wang, Y. et al. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 38, 1–12 (2019).
Google Scholar
Team, Q. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763 (2021).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2980–2988 (2017).
Milletari, F., Navab, N. & Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV). 565–571 (2016).
Li, Y. et al. Hake: A knowledge engine foundation for human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8494–8506 (2022).
Google Scholar
Gupta, S. & Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255 (2009).

Download references

Funding

This work was supported by the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2025JC-YBQN-942, National Natural Science Foundation of China under Grant (No. 62071378, 62071379, 62306235) and the Scientific Research Program Funded by Shaanxi Provincial Education Department under Grant 25JK0683.

Author information

Authors and Affiliations

Guangzhou Institute of Technology, Xidian University, Guangzhou, 510555, China
Xuyang Li & Xiaopeng Tan
School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an, 710121, China
Qiyue Li, Yichuan Yin, Jiapeng Yan, Yuanqing Li & Getao Du
National Key Laboratory of Science and Technology on Space Microwave, Xi’an Institute of Space Radio Technology, Xi’an, 710100, China
Yingbin Wang

Authors

Xuyang Li
View author publications
Search author on:PubMed Google Scholar
Qiyue Li
View author publications
Search author on:PubMed Google Scholar
Xiaopeng Tan
View author publications
Search author on:PubMed Google Scholar
Yingbin Wang
View author publications
Search author on:PubMed Google Scholar
Yichuan Yin
View author publications
Search author on:PubMed Google Scholar
Jiapeng Yan
View author publications
Search author on:PubMed Google Scholar
Yuanqing Li
View author publications
Search author on:PubMed Google Scholar
Getao Du
View author publications
Search author on:PubMed Google Scholar

Contributions

Xuyang Li: Conceptualization, Formal analysis, Writing-Original manuscript. Qiyue Li: Methodology, Writing-Review & Editing, Supervision. Xiaopeng Tan: Data Curation. Yingbin Wang, Yichuan Yin, Jiapeng Yan, Yuanqing Li and Getao Du: Visualization, Investigation. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qiyue Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, X., Li, Q., Tan, X. et al. Semantic knowledge enhanced 3D human–object interaction learning. Sci Rep (2026). https://doi.org/10.1038/s41598-026-49105-x

Download citation

Received: 02 October 2025
Accepted: 13 April 2026
Published: 20 April 2026
DOI: https://doi.org/10.1038/s41598-026-49105-x