Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Semantic knowledge enhanced 3D human–object interaction learning
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 20 April 2026

Semantic knowledge enhanced 3D human–object interaction learning

  • Xuyang Li1,
  • Qiyue Li2,
  • Xiaopeng Tan1,
  • Yingbin Wang3,
  • Yichuan Yin2,
  • Jiapeng Yan2,
  • Yuanqing Li2 &
  • …
  • Getao Du2 

Scientific Reports (2026) Cite this article

  • 892 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Learning 3D human-object interactions (HOI) from 2D images is one of the important approaches for understanding human-object interactions in 3D space and is crucial for the advancement of embodied AI and interaction modeling. Existing 3D human-object interaction learning methods often fail to model fine-grained interactions in complex scenarios due to their reliance on visual features alone, leading to ambiguities in human contact, object affordance, and spatial relation. To address this, we propose SKE-3DHOI, a semantic knowledge enhanced framework that integrates semantic knowledge derived from large multimodal models into visual 3D human-object interaction reasoning. By generating 3D HOI semantic knowledge tensors through HOI-specific textual queries of large multimodal models, our method encodes critical HOI semantics and fuses them with visual embeddings via cross-attention fusion layers. This enables explicit alignment of visual patterns with semantic knowledge priors. Extensive experiments validate that SKE-3DHOI achieves state-of-the-art performance, significantly outperforming existing methods across all metrics in 3D human-object interaction learning. The framework bridges the gap between geometric plausibility and semantic validity, advancing robust 3D HOI understanding.

Similar content being viewed by others

Abstract visual reasoning based on algebraic methods

Article Open access 28 January 2025

Construction of a multiscale feature fusion model for indoor scene recognition and semantic segmentation

Article Open access 27 April 2025

A deep learning method for optimizing semantic segmentation accuracy of remote sensing images based on improved UNet

Article Open access 10 May 2023

Data availability

The data supporting the findings of this study is publicly available. The 3DIR dataset is available at https://github.com/yyvhang/lemon_3d. The HAKE dataset is available at http://hake-mvig.cn. The PIAD dataset is available at https://github.com/yyvhang/IAGNet. The V-COCO dataset is available at https://github.com/s-gupta/v-coco.

References

  1. Cheng, K.-H. & Tsai, C.-C. Affordances of augmented reality in science learning: Suggestions for future research. J. Sci. Educ. Technol. 22, 449–462 (2013).

    Google Scholar 

  2. del Amo, I. F., Erkoyuncu, J. A., Farsi, M. & Ariansyah, D. Hybrid recommendations and dynamic authoring for AR knowledge capture and re-use in diagnosis applications. Knowl.-Based Syst. 239, 107954 (2022).

    Google Scholar 

  3. Zare, M., Kebria, P. M., Khosravi, A. & Nahavandi, S. Algorithms, recent developments, and challenges. In IEEE Transactions on Cybernetics, A Survey of Imitation Learning (2024).

  4. Hu, Y. et al. Fusion dynamical systems with machine learning in imitation learning: A comprehensive overview. Inf. Fusion 102379 (2024).

  5. Ren, L., Dong, J., Liu, S., Zhang, L. & Wang, L. Embodied intelligence toward future smart manufacturing in the era of AI foundation model. In IEEE/ASME Transactions on Mechatronics (2024).

  6. Liu, H., Guo, D. & Cangelosi, A. A synergy of morphology, action, perception and learning. ACM Comput. Surv. Embodied Intell. (2025).

  7. Dai, Y. & Wang, J. Co-evolving embodied intelligence with design for artificial intelligence architecture. Nat. Rev. Electr. Eng. 1–2 (2025).

  8. Chen, Y., Dwivedi, S. K., Black, M. J. & Tzionas, D. Detecting human-object contact in images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17100–17110 (2023).

  9. Tripathi, S. et al. Deco: Dense estimation of 3D human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8001–8013 (2023).

  10. Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Learning affordance grounding from exocentric images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2252–2261 (2022).

  11. Luo, H., Zhai, W., Zhang, J., Cao, Y. & Tao, D. Leverage interactive affinity for affordance learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6809–6819 (2023).

  12. Petrov, I. A., Marin, R., Chibane, J. & Pons-Moll, G. Object pop-up: Can we infer 3D objects and their poses from human interactions alone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4726–4736 (2023).

  13. Li, Y.-L. et al. Detailed 2D-3D joint representation for human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10166–10175 (2020).

  14. Yang, Y., Zhai, W., Luo, H., Cao, Y. & Zha, Z.-J. Lemon: Learning 3D human-object interaction relation from 2D images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16284–16295 (2024).

  15. Gkioxari, G., Girshick, R., Dollár, P. & He, K. Detecting and recognizing human-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8359–8367 (2018).

  16. Chao, Y.-W., Liu, Y., Liu, X., Zeng, H. & Deng, J. Learning to detect human-object interactions. In IEEE Winter Conference on Applications of Computer Vision. 381–389 (2018).

  17. Tamura, M., Ohashi, H. & Yoshinaga, T. Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419 (2021).

  18. Kim, B., Lee, J., Kang, J., Kim, E.-S. & Kim, H. J. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 74–83 (2021).

  19. Qi, S., Wang, W., Jia, B., Shen, J. & Zhu, S.-C. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision. 401–417 (2018).

  20. Li, Q., Xie, X., Zhang, C., Zhang, J. & Shi, G. Detecting human-object interactions in videos by modeling the trajectory of objects and human skeleton. Neurocomputing 509, 234–243 (2022).

    Google Scholar 

  21. Nagarajan, T., Feichtenhofer, C. & Grauman, K. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8688–8697 (2019).

  22. Zhai, W., Luo, H., Zhang, J., Cao, Y. & Tao, D. One-shot object affordance detection in the wild. Int. J. Comput. Vis. 130, 2472–2500 (2022).

    Google Scholar 

  23. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G. & Black, M. J. Smpl: A skinned multi-person linear model. ACM Trans. Graph. (TOG) 34, 1–16 (2015).

    Google Scholar 

  24. Pavlakos, G. et al. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975–10985 (2019).

  25. Romero, J., Tzionas, D. & Black, M. J. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022).

  26. Huang, C.-H. P. et al. Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13274–13285 (2022).

  27. Deng, S., Xu, X., Wu, C., Chen, K. & Jia, K. 3D affordancenet: A benchmark for visual object affordance understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1778–1787 (2021).

  28. Han, S. & Joo, H. Chorus: Learning canonicalized 3d human-object spatial relations from unbounded synthesized images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15835–15846 (2023).

  29. Yang, Y. et al. Grounding 3D object affordance from 2D interactions in images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10905–10915 (2023).

  30. Bui, D. C., Le, T. V. & Ngo, B. H. C2t-net: Channel-aware cross-fused transformer-style networks for pedestrian attribute recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 351–358 (2024).

  31. Bui, D. C., Le, T. V., Ngo, B. H. & Choi, T. J. Clear: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Pattern Recognit. 164, 111486 (2025).

    Google Scholar 

  32. Wang, J. et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3349–3364 (2020).

    Google Scholar 

  33. Wang, Y. et al. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 38, 1–12 (2019).

    Google Scholar 

  34. Team, Q. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025).

  35. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763 (2021).

  36. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2980–2988 (2017).

  37. Milletari, F., Navab, N. & Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV). 565–571 (2016).

  38. Li, Y. et al. Hake: A knowledge engine foundation for human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8494–8506 (2022).

    Google Scholar 

  39. Gupta, S. & Malik, J. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).

  40. Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255 (2009).

Download references

Funding

This work was supported by the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2025JC-YBQN-942, National Natural Science Foundation of China under Grant (No. 62071378, 62071379, 62306235) and the Scientific Research Program Funded by Shaanxi Provincial Education Department under Grant 25JK0683.

Author information

Authors and Affiliations

  1. Guangzhou Institute of Technology, Xidian University, Guangzhou, 510555, China

    Xuyang Li & Xiaopeng Tan

  2. School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an, 710121, China

    Qiyue Li, Yichuan Yin, Jiapeng Yan, Yuanqing Li & Getao Du

  3. National Key Laboratory of Science and Technology on Space Microwave, Xi’an Institute of Space Radio Technology, Xi’an, 710100, China

    Yingbin Wang

Authors
  1. Xuyang Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Qiyue Li
    View author publications

    Search author on:PubMed Google Scholar

  3. Xiaopeng Tan
    View author publications

    Search author on:PubMed Google Scholar

  4. Yingbin Wang
    View author publications

    Search author on:PubMed Google Scholar

  5. Yichuan Yin
    View author publications

    Search author on:PubMed Google Scholar

  6. Jiapeng Yan
    View author publications

    Search author on:PubMed Google Scholar

  7. Yuanqing Li
    View author publications

    Search author on:PubMed Google Scholar

  8. Getao Du
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Xuyang Li: Conceptualization, Formal analysis, Writing-Original manuscript. Qiyue Li: Methodology, Writing-Review & Editing, Supervision. Xiaopeng Tan: Data Curation. Yingbin Wang, Yichuan Yin, Jiapeng Yan, Yuanqing Li and Getao Du: Visualization, Investigation. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qiyue Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Li, Q., Tan, X. et al. Semantic knowledge enhanced 3D human–object interaction learning. Sci Rep (2026). https://doi.org/10.1038/s41598-026-49105-x

Download citation

  • Received: 02 October 2025

  • Accepted: 13 April 2026

  • Published: 20 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-49105-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics