Abstract
Learning 3D human-object interactions (HOI) from 2D images is one of the important approaches for understanding human-object interactions in 3D space and is crucial for the advancement of embodied AI and interaction modeling. Existing 3D human-object interaction learning methods often fail to model fine-grained interactions in complex scenarios due to their reliance on visual features alone, leading to ambiguities in human contact, object affordance, and spatial relation. To address this, we propose SKE-3DHOI, a semantic knowledge enhanced framework that integrates semantic knowledge derived from large multimodal models into visual 3D human-object interaction reasoning. By generating 3D HOI semantic knowledge tensors through HOI-specific textual queries of large multimodal models, our method encodes critical HOI semantics and fuses them with visual embeddings via cross-attention fusion layers. This enables explicit alignment of visual patterns with semantic knowledge priors. Extensive experiments validate that SKE-3DHOI achieves state-of-the-art performance, significantly outperforming existing methods across all metrics in 3D human-object interaction learning. The framework bridges the gap between geometric plausibility and semantic validity, advancing robust 3D HOI understanding.
Similar content being viewed by others
Data availability
The data supporting the findings of this study is publicly available. The 3DIR dataset is available at https://github.com/yyvhang/lemon_3d. The HAKE dataset is available at http://hake-mvig.cn. The PIAD dataset is available at https://github.com/yyvhang/IAGNet. The V-COCO dataset is available at https://github.com/s-gupta/v-coco.
References
Cheng, K.-H. & Tsai, C.-C. Affordances of augmented reality in science learning: Suggestions for future research. J. Sci. Educ. Technol. 22, 449–462 (2013).
del Amo, I. F., Erkoyuncu, J. A., Farsi, M. & Ariansyah, D. Hybrid recommendations and dynamic authoring for AR knowledge capture and re-use in diagnosis applications. Knowl.-Based Syst. 239, 107954 (2022).
Zare, M., Kebria, P. M., Khosravi, A. & Nahavandi, S. Algorithms, recent developments, and challenges. In IEEE Transactions on Cybernetics, A Survey of Imitation Learning (2024).
Hu, Y. et al. Fusion dynamical systems with machine learning in imitation learning: A comprehensive overview. Inf. Fusion 102379 (2024).
Ren, L., Dong, J., Liu, S., Zhang, L. & Wang, L. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. In IEEE/ASME Transactions on Mechatronics (2024).
Liu, H., Guo, D. & Cangelosi, A. A synergy of morphology, action, perception and learning. ACM Comput. Surv. Embodied Intell. (2025).
Dai, Y. & Wang, J. Co-evolving embodied intelligence with design for artificial intelligence architecture. Nat. Rev. Electr. Eng. 1–2 (2025).
Chen, Y., Dwivedi, S. K., Black, M. J. & Tzionas, D. Detecting human-object contact in images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17100–17110 (2023).
Tripathi, S. et al. Deco: Dense estimation of 3D human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8001–8013 (2023).
Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2252–2261 (2022).
Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Leverage interactive affinity for affordance learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6809–6819 (2023).
Petrov, I. A., Marin, R., Chibane, J. & Pons-Moll, G. Object pop-up: Can we infer 3D objects and their poses from human interactions alone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4726–4736 (2023).
Li, Y.-L. et al. Detailed 2D-3D joint representation for human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10166–10175 (2020).
Yang, Y., Zhai, W., Luo, H., Cao, Y. & Zha, Z.-J. Lemon: Learning 3D human-object interaction relation from 2D images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16284–16295 (2024).
Gkioxari, G., Girshick, R., Dollár, P. & He, K. Detecting and recognizing human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8359–8367 (2018).
Chao, Y.-W., Liu, Y., Liu, X., Zeng, H. & Deng, J. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vision. 381–389 (2018).
Tamura, M., Ohashi, H. & Yoshinaga, T. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419 (2021).
Kim, B., Lee, J., Kang, J., Kim, E.-S. & Kim, H. J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 74–83 (2021).
Qi, S., Wang, W., Jia, B., Shen, J. & Zhu, S.-C. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision. 401–417 (2018).
Li, Q., Xie, X., Zhang, C., Zhang, J. & Shi, G. Detecting human-object interactions in videos by modeling the trajectory of objects and human skeleton. Neurocomputing 509, 234–243 (2022).
Nagarajan, T., Feichtenhofer, C. & Grauman, K. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8688–8697 (2019).
Zhai, W., Luo, H., Zhang, J., Cao, Y. & Tao, D. One-shot object affordance detection in the wild. Int. J. Comput. Vis. 130, 2472–2500 (2022).
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G. & Black, M. J. Smpl: A skinned multi-person linear model. ACM Trans. Graph. (TOG) 34, 1–16 (2015).
Pavlakos, G. et al. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975–10985 (2019).
Romero, J., Tzionas, D. & Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022).
Huang, C.-H. P. et al. Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13274–13285 (2022).
Deng, S., Xu, X., Wu, C., Chen, K. & Jia, K. 3D affordancenet: A benchmark for visual object affordance understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1778–1787 (2021).
Han, S. & Joo, H. Chorus: Learning canonicalized 3d human-object spatial relations from unbounded synthesized images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15835–15846 (2023).
Yang, Y. et al. Grounding 3D object affordance from 2D interactions in images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10905–10915 (2023).
Bui, D. C., Le, T. V. & Ngo, B. H. C2t-net: Channel-aware cross-fused transformer-style networks for pedestrian attribute recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 351–358 (2024).
Bui, D. C., Le, T. V., Ngo, B. H. & Choi, T. J. Clear: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Pattern Recognit. 164, 111486 (2025).
Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3349–3364 (2020).
Wang, Y. et al. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 38, 1–12 (2019).
Team, Q. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763 (2021).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2980–2988 (2017).
Milletari, F., Navab, N. & Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV). 565–571 (2016).
Li, Y. et al. Hake: A knowledge engine foundation for human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8494–8506 (2022).
Gupta, S. & Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255 (2009).
Funding
This work was supported by the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2025JC-YBQN-942, National Natural Science Foundation of China under Grant (No. 62071378, 62071379, 62306235) and the Scientific Research Program Funded by Shaanxi Provincial Education Department under Grant 25JK0683.
Author information
Authors and Affiliations
Contributions
Xuyang Li: Conceptualization, Formal analysis, Writing-Original manuscript. Qiyue Li: Methodology, Writing-Review & Editing, Supervision. Xiaopeng Tan: Data Curation. Yingbin Wang, Yichuan Yin, Jiapeng Yan, Yuanqing Li and Getao Du: Visualization, Investigation. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, X., Li, Q., Tan, X. et al. Semantic knowledge enhanced 3D human–object interaction learning. Sci Rep (2026). https://doi.org/10.1038/s41598-026-49105-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-49105-x


