Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Extended reality technologies for applications in the metaverse

Abstract

The metaverse has gained increasing attention with advances in artificial intelligence (AI), semiconductor devices and high-speed networks. Although the metaverse has potential across various industries and consumer markets, it remains in the early stages of development, with further progress in extended reality (XR) technologies anticipated. In this Review, we provide an overview of essential XR technologies for immersive metaverse experiences enabling human–digital interactions. Motion sensing, eye tracking, pose estimation and 3D mapping, scene understanding, digital humans, conversational AI for metaverse non-player characters, motion-to-photon latency compensation and optical display systems are important for human–digital interaction in the metaverse, with AI accelerating the evolution of these technologies. Key challenges include the accuracy and robustness of sensing and recognition of users and surrounding environments, real-time content generation reflecting the users’ responses and environments, and high-performance XR head-mounted displays with compact form factors. Realizing this potential will enable people to interact more genuinely with each other and digital objects in healthcare, education, retail, manufacturing and everyday life.

Key points

  • One of the key user values of the metaverse is a sense of immersion and presence.

  • The technologies to present a sense of immersion and presence to users are extended reality (XR) technologies, which can enhance reality expressions and natural interactions.

  • An XR workflow consists of sensing and recognition, content generation and output, and various technologies comprise the workflow.

  • Artificial intelligence (AI) technologies have crucial roles in the sensing and recognition fields, as well as in the XR content generation domain.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The metaverse and extended reality (XR) workflow.
Fig. 2: Timeline of motion sensing for professional and mid-level users.
Fig. 3: Eye tracking methods.
Fig. 4: Comparison between the physically based and image-based approaches.
Fig. 5: Asynchronous time warp (ATW) for rendering during rotational head motion.
Fig. 6: Optical architectures for extended reality (XR) head-mounted displays (HMDs).

Similar content being viewed by others

References

  1. Martí-Testón, A. et al. Using WebXR metaverse platforms to create touristic services and cultural promotion. Appl. Sci. 13, 8544–8548 (2023).

    Article  Google Scholar 

  2. Lin, H. et al. Metaverse in education: vision, opportunities, and challenges. In 2022 IEEE Int. Conf. on Big Data 2857–2866 (IEEE, 2022).

  3. Lee, C. W. Application of metaverse service to healthcare industry: a strategic perspective. Int. J. Environ. Res. Public. Health 19, 13038 (2022).

    Article  Google Scholar 

  4. Sobolev, T. Top 15 metaverse games in 2025. Fgfactory https://fgfactory.com/top-15-metaverse-games-in-2024 (21 April 2025).

  5. GVR. Metaverse market size, share & trends analysis report by product, by platform, by technology (blockchain, virtual reality (VR) & augmented reality (AR), mixed reality (MR)), by application, by end-use, by region, and segment forecasts, 2023–2030. Grand View Research https://www.grandviewresearch.com/industry-analysis/metaverse-market-report (2022).

  6. Nahalingam, K. & Katehi, L. P. A review of the recent developments in the fabrication processes of CMOS image sensors for smartphones. Preprint at https://arxiv.org/abs/2306.05339 (2023).

  7. Sun, Y., Agostini, N. B., Dong, S. & Kaeli, D. Summarizing CPU and GPU design trends with product data. Preprint at https://arxiv.org/abs/1911.11313 (2019).

  8. Kang, C. & Lee, H. Recent progress of organic light-emitting diode microdisplays for augmented reality/virtual reality applications. J. Inf. Disp. 23, 19–32 (2022).

    Article  Google Scholar 

  9. Kshirsagar, P. R., Reddy, D. H., Dhingra, M., Dhabliya, D. & Gupta, A. A review on comparative study of 4G, 5G and 6G networks. In 2022 5th Int. Conf. Contemporary Computing and Informatics 1830–1833 (2022).

  10. Hu, Y., Hu, W. & Quigley, A. Towards using generative AI for facilitating image creation in spatial augmented reality. In 2023 IEEE International Symposium on Mixed and Augmented Reality Adjunct 441–443 (IEEE, 2023).

  11. Tiple, B. et al. AI based augmented reality assistant. Int. J. Intell. Syst. Appl. Eng. 12, 505–516 (2024).

    Google Scholar 

  12. Sahu, K. C., Young, C. & Rai, R. Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review. Int. J. Prod. Res. 59, 4903–4959 (2021).

    Article  Google Scholar 

  13. Jeremiah, R., Hutson J., and Wright, A. A proposed meta-reality immersive development pipeline: generative AI models and extended reality (XR) content for the metaverse. J. Intell. Learn. Syst. Appl. 15, 24–35 (2023).

    Google Scholar 

  14. Mukawa, H. Review and perspective of XR technologies for immersive experience. SID Symp. Digest Technical Papers 55, 559–562 (2024).

    Article  Google Scholar 

  15. Wongkitrungrueng, A. and Suprawan, L. Metaverse meets branding: examining consumer responses to immersive brand experiences. Int. J. Hum. Comput. Interact. 40, 2905–2924 (2024).

    Article  Google Scholar 

  16. Jafar, R. M. S. & Ahmad, W. Tourist loyalty in the metaverse: the role of immersive tourism experience and cognitive perceptions. Tour. Rev. 79, 321–336 (2024).

    Article  Google Scholar 

  17. Queiroz, A. et al. Collaborative tasks in immersive virtual reality increase learning. In Proc.16th Int. Conf. Computer-Supported Collaborative Learning—CSCL 2023 27–34 (ISLS, 2023).

  18. Moeslund, B., Adrian, H. & Volker, K. A survey of advances in vision-based human motion capture and analysis. Comp. Vis. Image Underst. 104, 90–126 (2006).

    Article  Google Scholar 

  19. Filippeschi, A. et al. Survey of motion tracking methods based on inertial sensors: a focus on upper limb human motion. Sensors 17, 1257 (2017).

    Article  Google Scholar 

  20. Van, E. & Reijne, M. Accuracy of human motion capture systems for sport applications; state-of-the-art review. Eur. J. Sport. Sci. 18, 806–819 (2018).

    Article  Google Scholar 

  21. Titterton, H. & Weston, L. Strapdown Inertial Navigation Technology (IET, 2004).

  22. Schepers, M., Giuberti, M. & Bellusci, G. Xsens MVN: consistent tracking of human motion using inertial sensing. Xsens Technol. 1, 1–8 (2018).

    Google Scholar 

  23. Nakano, N. et al. Evaluation of 3D markerless motion capture accuracy using openpose with multiple video cameras. Front. Sports Act. Living 2, 50 (2020).

    Article  Google Scholar 

  24. Cao, Z., Simon, T., Wei, E. & Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 7291–7299 (IEEE/CVF, 2017).

  25. Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A. & Malik, J. Humans in 4D: reconstructing and tracking humans with transformers. In Proc. IEEE/CVF Int. Conf. Computer Vision 14783–14794 (IEEE/CVF, 2023).

  26. Huang, Y. et al. Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. 37, 1–15 (2018). This work is an example of how machine learning can substantially reduce the number of sensors in a conventional motion capture system.

    Google Scholar 

  27. Jiang, J. et al. Avatarposer: articulated full-body pose tracking from sparse motion sensing. In Eur. Conf. Computer Vision 443–460 (2022).

  28. Rhodin, H. et al. Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35, 1–11 (2016).

    Article  Google Scholar 

  29. Shiratori, T., Park, S., Sigal, L., Sheikh, Y. & Hodgins, K. Motion capture from body-mounted cameras. In ACM SIGGRAPH 2011 Papers Vol. 30 31 (2011).

  30. Kulozik, J., Nathanaël, J. Evaluating the precision of the HTC VIVE ultimate tracker with robotic and human movements under varied environmental conditions. Preprint at https://arxiv.org/abs/2409.01947 (2024).

  31. Mourikis, I. & Roumeliotis, I. A multi-state constraint kalman filter for vision-aided inertial navigation. In Proc. 2007 IEEE Int. Conf. Robotics and Automation 3565–3572 (IEEE, 2007).

  32. Du, Y. et al. Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 481–490 (IEEE/CVF, 2023).

  33. Peng, B., Abbeel, P., Levine, S. & Van, M. Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37, 1–14 (2018).

    Google Scholar 

  34. Taheri, O., Choutas, V., Black, J. & Tzionas, Goal: generating 4D whole-body motion for hand-object grasping. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 13263–13273 (IEEE/CVF, 2022).

  35. Zhang, X., Bhatnagar, L., Starke, S., Guzov, V. & Pons-Moll, G. Couch: towards controllable human–chair interactions. In Eur. Conf. Computer Vision 518–535 (2022).

  36. Liang, H., Zhang, W., Li, W., Yu, J. & Xu, L. Intergen: diffusion-based multi-human motion generation under complex interactions. Int. J. Comp. Vis. 132, 3463–3483 (2024).

    Article  Google Scholar 

  37. Braun, J., Christen, S., Kocabas, M., Aksan, E. & Hilliges, O. Physically plausible full-body hand–object interaction synthesis. In Int. Conf. 3D Vision 464–473 (2024).

  38. Lee, S., Starke, S., Ye, Y., Won, J. & Winkler, A. Questenvsim: environment-aware simulated motion tracking from sparse sensors. In ACM SIGGRAPH 2023 Conf. Proc. 62 (Association for Computing Machinery, 2023).

  39. Klein, G. S. & Murray, D. W. Parallel tracking and mapping for small AR workspaces. In 6th IEEE and ACM Int. Symp. Mixed and Augmented Reality 225–234 (IEEE, 2007).

  40. Dai, Y., Wu, J. & Wang, D. A review of common techniques for visual simultaneous localization and mapping. J. Robot. 2023, 8872822 (2023).

    Google Scholar 

  41. Mur-Artal, R., Montiel, J. M. & Tardós, J. D. ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31, 1147–1163 (2015). This work presents ORB-SLAM, a groundbreaking monocular SLAM system that revolutionized the field by introducing a versatile, real-time approach that uses the same ORB features for all SLAM tasks, enabling robust performance across diverse environments and paving the way for future advancements in visual SLAM technology.

    Article  Google Scholar 

  42. Engel, J., Koltun, V. & Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2017).

    Article  Google Scholar 

  43. Qin, T., Li, P. & Shen, S. VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 34, 1004–1020 (2018).

    Article  Google Scholar 

  44. Newcombe, R. A. et al. Kinectfusion: real-time dense surface mapping and tracking. In 10th IEEE Int. Symp. Mixed and Augmented Reality 127–136 (IEEE, 2011).

  45. Gallego, G. et al. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2020).

    Article  Google Scholar 

  46. SHAN, Tixiao et al. LIO-SAM: tightly-coupled LiDAR inertial odometry via smoothing and mapping. In IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS) 5135–5142 (IEEE, 2020).

  47. Piron, F., Morrison, D., Yuce, M. R. & Redouté, J. M. A review of single-photon avalanche diode time-of-flight imaging sensor arrays. IEEE Sens. J. 21, 12654–12666 (2020).

    Article  Google Scholar 

  48. Kamata, H. et al. MEMS gyro array employing array signal processing for interference and outlier suppression. In IEEE Int. Symp. Inertial Sensors and Systems (INERTIAL) 1–4 (IEEE, 2020).

  49. Sarlin, P. E., DeTone, D., Malisiewicz, T. & Rabinovich, A. SuperGlue: learning feature matching with graph neural networks. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 4938–4947 (IEEE/CVF, 2020).

  50. Teed, Z. & Deng, J. DROID-SLAM: deep visual slam for monocular, stereo, and RGB-D cameras. Adv. Neural Inf. Process. Syst. 34, 16558–16569 (2021).

    Google Scholar 

  51. Rosinol, A., Leonard, J. J. & Carlone, L. NeRF-SLAM: real-time dense monocular SLAM with neural radiance fields. In IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS) 3437–3444 (IEEE, 2023).

  52. Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128, 261–318 (2020). This survey traces how deep learning-based object detection became the foundation for scene understanding, context reasoning and scale-aware perception, while spotlighting challenges linking object cues to holistic image semantics.

    Article  Google Scholar 

  53. Viola, P. & Jones, M. Rapid object detection using a boosted cascade of simple features. In Proc. 2001 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (IEEE, 2001).

  54. Papageorgiou, C., Oren, M. & Poggio, T. A general framework for object detection. In Proc. Sixth Int. Conf. Computer Vision 555–562 (1998).

  55. Lowe, D. Object recognition from local scale-invariant features. Proc. Seventh IEEE Int. Conf. Comput. Vis. 2, 1150–1157 (1999).

    Article  Google Scholar 

  56. Schuhmann, C. et al. LAION-5B: an open large-scale dataset for training next generation image-text models. In Proc. 36th Int. Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 25278–25294 (NeurIPS, 2022).

  57. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th Int. Conf. Machine Learning (eds Meila, M. &Zhang, T.) 8748–8763 (MLRearchPress, 2021).

  58. Li, J., Dongxu, L., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Machine Learning (eds Krause, A. et al.) 19730–19742 (MLRearchPress, 2023).

  59. Wu, J. et al. GRiT: a generative region-to-text transformer for object understanding. In Eur. Conf. Computer Vision (eds Leonardis, A. et al.) 207–224 (Springer Nature, 2024).

  60. Liu, S. et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In Eur. Conf. Computer Vision (eds Leonardis, A. et al.) 38–55 (Springer Nature, 2023).

  61. Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2023).

    Google Scholar 

  62. Liu, A. et al. Deepseek-v3 technical report. Preprint at https://arxiv.org/abs/2412.19437 (2024).

  63. Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

  64. Dogan, M. et al. Augmented object intelligence with XR-objects. In Proc. 37th Annual ACM Symp. User Interface Software and Technology 19 (Association for Computing Machinery, 2024).

  65. Tang, Y. et al. Empowering LLMs with pseudo-untrimmed videos for audio-visual temporal understanding. Proc. 39th Annu. AAAI Conf. Artif. Intell. 39, 7293–7301 (AAAI Press, 2025).

  66. Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).

  67. Grauman, K. et al. Ego4D: around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 18995–19012 (IEEE/CVF, 2022).

  68. Ravi, N. et al. SAM 2: segment anything in images and videos. In Thirteenth Int. Conf. Learning Representations (2025).

  69. Pallotta, E., Azar, S. M., Li, S., Zatsarynna, O. & Gall, J. SyncVP: joint diffusion for synchronous multi-modal video prediction. In Proc. EEE/CVF Conf. Computer Vision and Pattern Recognition 13787–13797 (EEE/CVF, 2025).

  70. Wilson, J. & Lin, M. C. AVOT: audio-visual object tracking of multiple objects for robotics. In IEEE Int. Conf. Robotics and Automation 10045–10051 (IEEE, 2020).

  71. Bao, C., Xu, J., Wang, X., Gupta, A. & Bharadhwaj, H. HandsOnVLM: vision-language models for hand-object interaction prediction. Preprint at https://arxiv.org/abs/2412.13187 (2024).

  72. Wilson, A. & Hua, H. Design and demonstration of a vari-focal optical see-through head-mounted display using freeform alvarez lenses. Opt. Express 27, 15627–15637 (2019).

    Article  Google Scholar 

  73. Guan, P., Mercier, O., Shvartsman, M. & Lanman, D. Perceptual requirements for eye-tracked distortion correction in VR. In ACM SIGGRAPH 2022 Conf. Proc. 51 (Association for Computing Machinery, 2022).

  74. Zhai, S., Morimoto, C. & Ihde, S. Manual and gaze input cascaded (MAGIC) pointing. In Proc. SIGCHI Conf. Human Factors in Computing Systems 246–253 (Association for Computing Machinery, 1999).

  75. Patney, A. et al. Towards foveated rendering for gaze-tracked virtual reality. ACM Trans. Graph. 35, 1–12 (2016). This paper describes the well-known foveated rendering method considering human perception with experimental results.

    Article  MathSciNet  Google Scholar 

  76. Hennessey, C., Noureddin, B. & Lawrence, P. A single camera eye-gaze tracking system with free head motion. In Proc. 2006 Symp. Eye Tracking Research & Applications 87–94 (Association for Computing Machinery, 2006).

  77. Dierkes, K., Kassner, M. & Bulling, A. A fast approach to refraction-aware eye-model fitting and gaze prediction. In Proc. 11th ACM Symp. Eye Tracking Research & Applications 23 (Association for Computing Machinery, 2019).

  78. Kothari, R. S. et al. Ellseg: an ellipse segmentation framework for robust gaze tracking. IEEE Trans. Vis. Comput. Graph. 27, 2757–2767 (2021).

    Article  Google Scholar 

  79. Fuhl, W., Kasneci, G. & Kasneci, E. TEyeD: over 20 million real-world eye images with pupil, eyelid, and iris 2D and 3D segmentations, 2D and 3D landmarks, 3D eyeball, gaze vector, and eye movement types. In 2021 IEEE Int. Symp. Mixed and Augmented Reality (ISMAR) 367–375 (IEEE, 2021).

  80. Nair, N. et al. RIT-Eyes: rendering of near-eye images for eye-tracking applications. ACM Symp. Appl. Percept. 2020, 1–9 (2020).

    Google Scholar 

  81. Rigas, I., Raffle, H. & Komogortsev, O. V. Photosensor oculography: survey and parametric analysis of designs using model-based simulation. IEEE Trans. Human–Machine Syst. 48, 670–681 (2018).

    Article  Google Scholar 

  82. Palmero, C., Komogortsev, O. V., Escalera, S. & Talathi, S. S. Multi-rate sensor fusion for unconstrained near-eye gaze estimation. In Proc. 2023 Symp. Eye Tracking Research and Applications 12 (Association for Computing Machinery, 2023).

  83. Sarkar, N. et al. A resonant eye-tracking microsystem for velocity estimation of saccades and foveated rendering. In 2017 IEEE 30th Int. Conf. Micro Electro Mechanical Systems (MEMS) 304–307 (IEEE, 2017).

  84. Angelopoulos, A. N., Martel, J. N., Kohli, A. P., Conradt, J. & Wetzstein, G. Event based, near eye gaze tracking beyond 10,000 Hz. IEEE Trans. Vis. Comput. Graph. 27, 2577–2586 (2021).

    Article  Google Scholar 

  85. Bonazzi, P. et al. A low-power neuromorphic approach for efficient eye-tracking. Preprint at https://arxiv.org/abs/2312.00425 (2023).

  86. Moreno-Arjonilla, J. et al. Eye-tracking on virtual reality: a survey. Virtual Reality 28, 38 (2024).

    Article  Google Scholar 

  87. Aziz, S. et al. Evaluation of eye tracking signal quality for virtual reality applications: a case study in the meta quest pro. In Proc. 2024 Symp. Eye Tracking Research and Applications 7 (Association for Computing Machinery, 2024).

  88. Hooge, I. T., Niehorster, D. C., Nyström, M., Andersson, R. & Hessels, R. S. Fixation classification: how to merge and select fixation candidates. Behav. Res. Methods 54, 2765–2776 (2022).

    Article  Google Scholar 

  89. David-John, B. et al. Towards gaze-based prediction of the intent to interact in virtual reality. In ACM Symp. Eye Tracking Research and Applications 2 (Association for Computing Machinery, 2021).

  90. Peacock, C. E. et al. Gaze as an indicator of input recognition errors. Proc. ACM Human–Comput. Interact. 6, 1–18 (2022).

    Article  Google Scholar 

  91. Lohr, D. & Komogortsev, O. V. Eye know you too: toward viable end-to-end eye movement biometrics for user authentication. IEEE Trans. Inf. Forensics Security 17, 3151–3164 (2022).

    Article  Google Scholar 

  92. Wilson, E. et al. Eye gaze as a signal for conveying user attention in contextual AI systems. In Proc. 2025 Symp. Eye Tracking Research and Applications 1–7 (2025).

  93. Lengyel, J. The convergence of graphics and vision. Computer 31, 46–53 (1998).

    Article  Google Scholar 

  94. SONY. Digital human technology that faithfully reproduces human facial expressions and movements. Sony.com https://www.sony.com/en/SonyInfo/research/technologies/digital_human/ (2022).

  95. Seymor, M. et al. Meet Mike: epic avatars. In ACM SIGGRAPH 2017 VR Village (ed. Quesnel, D.) 12 (Association for Computing Machinery, 2017).

  96. Alexander, O. et al. The Digital Emily Project: achieving a photorealistic digital actor. IEEE Comput. Graph. Appl. 30, 20–31 (2010).

    Article  Google Scholar 

  97. Klehm, O. et al. Recent advances in facial appearance capture. Comput. Graph. Forum 34, 709–733 (2015).

    Article  Google Scholar 

  98. McAuley, S. et at. Practical physically-based shading in film and game production. In ACM SIGGRAPH 2012 Courses (ed. McNamara, A.) 10 (Association for Computing Machinery, 2012).

  99. Orvalho, V. et al. A facial rigging survey. In Eurographics 2021 - State of the Art Reports (eds Cani, M.-P. & Ganovelli, F.) 183–204 (The Eurographics Association, 2012).

  100. Lewis, J. et al. Practice and theory of blendshape facial models. Eurographics 1, 2 (2014).

    Google Scholar 

  101. Ekman, P. et al. Manual for the Facial Action Coding System (Consulting Psychologist Press, 1978).

  102. Kanade, T. et al. Virtualized reality: constructing virtual worlds from real scenes. IEEE Multimed. 4, 34–47 (1997).

    Article  Google Scholar 

  103. Laurentini, A. The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16, 150–162 (1994).

    Article  Google Scholar 

  104. Furukawa, U. et al. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1362–1376 (2010).

    Article  Google Scholar 

  105. Collet, A. et al. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 1–13 (2015).

    Article  Google Scholar 

  106. SONY. Volumetric capture studio. SonyPcl.jp https://www.sonypcl.jp/kiyosumi-shirakawa/index.html (2022).

  107. Guo, K. et al. The relightables: volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38, 1–19 (2019).

    Google Scholar 

  108. Mildenhall, B. et al. NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 99–106 (2021).

    Article  Google Scholar 

  109. Kerbl, B. et al. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 1–14 (2023).

    Article  Google Scholar 

  110. Cao, C. et al. Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41, 1–19 (2022).

    Google Scholar 

  111. Qian, S. et al. GaussianAvatars: photorealistic head avatars with rigged 3D Gaussians. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (eds Akata, Z. et al.) 20299–20309 (IEEE/CVF, 2024).

  112. Wang, R. et al. A survey on 3D human avatar modeling—from reconstruction to generation. Preprint at https://arxiv.org/abs/2406.04253 (2024).

  113. Bagautdinov, T. et al. Driving-signal aware full-body avatars. ACM Trans. Graph. 40, 143 (2021). This work presents one of the recent digital human creation approaches, proposing a method for driving the digital human created from multi-view videos.

    Article  Google Scholar 

  114. Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Adv. Neural Information Processing Systems 27: Annual Conf. Neural Information Processing Systems (eds Ghahramani, Z. et al.) 3104–3112 (NeurIPS, 2014).

  115. Vaswani, A. et al. Attention is all you need. In Adv. Neural Information Processing Systems 30: Annual Conf. Neural Information Processing Systems (eds Guyon, I. et al.) 5998–6008 (NeurIPS, 2017).

  116. Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. OpenReview.net https://openreview.net/forum?id=rygGQyrFvH (2020).

  117. Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North. Am. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., NAACL-HLT 2019 Vol. 1 (eds Burstein, J. et al.) 4171–4186 (ACL, 2019).

  118. Bommasani, R. et al. On the opportunities and risks of foundation models. Stanford.edu https://crfm.stanford.edu/assets/report.pdf (2021). This paper provides a comprehensive description of the possibilities and risks of the foundation model represented by LLM in terms of capabilities, technical principles, applications and social impacts.

  119. Brown, T. B. et al. Language models are few-shot learners. In Adv. Neural Information Processing Systems 33: Annual Conf. Neural Information Processing Systems 2020 (eds Larochelle, H. et al.) 1877–1901 (NeurIPS, 2020).

  120. Ouyang, L. et al. Training language models to follow instructions with human feedback. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (NeurIPS, 2022).

  121. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 24824–24837 (NeurIPS, 2022).

  122. Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Proc. 36th Annual ACM Symp. User Interface Software and Technology (eds Follmer, S. et al.) 2 (Association for Computing Machinery, 2023).

  123. Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Adv. Neural Information Processing Systems 33: Annual Conf. Neural Information Processing System (eds Larochelle, H. et al.) 9459–9474 (NeurIPS, 2020).

  124. Peng, B. et al. RWKV: reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP (eds Al-Onaizan, Y. et al.) 14048–14077 (ACL, 2023).

  125. Alayrac, J. B. et al. Flamingo: a visual language model for few-shot learning. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 23716–23736 (NeurIPS, 2022).

  126. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  127. OpenAI. GPT-4V(ision) System Card. OpenAI.com https://cdn.openai.com/papers/GPTV_System_Card.pdf (2023).

  128. Gemini Team Google. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

  129. Ng, E. et al. From audio to photoreal embodiment: synthesizing humans in conversations. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 1001–1010 (IEEE/CVF, 2024).

  130. Zhao, Q. et al. Media2Face: co-speech facial animation generation with multi-modality guidance. In ACM SIGGRAPH 2024 Conf. Proc. (eds Burbano, A. et al.) 18 (Association for Computing Machinery, 2024).

  131. Omine, T., Kawabata, N. & Homma, F. Co-speech gesture and facial expression generation for non-photorealistic 3D characters. In ACM Proc. SIGGRAPH Conf. Posters 8 (Association for Computing Machinery, 2025).

  132. Song, Y., Dhariwal, P., Chen, M. & Sutskever, I. Consistency models. In Proc. 40th Int. Conf. Mach. Learn., ICML 2023, Proc. Mach. Learning Research Vol. 202 (eds Krause, A. et al.) 32211–32252 (PMLR, 2023).

  133. Warburton, M. et al. Measuring motion-to-photon latency for sensorimotor experiments with virtual reality systems. Behav. Res. 55, 3658–3678 (2023).

    Article  Google Scholar 

  134. Bailey, R. E., Arthur III, J. J. & Williams, S. P. Latency requirements for head-worn display S/EVS applications. In Enhanced and Synthetic Vision 2004 Vol. 5424 (ed. Verly, J. G.) 98–109 (SPIE, 2004).

  135. Lincoln, P. et al. From motion to photons in 80 microseconds: towards minimal latency for virtual and augmented reality. IEEE Trans. Vis. Comput. Graph. 22, 1367–1376 (2016).

    Article  Google Scholar 

  136. Mark, W. R., McMillan, L. & Bishop, G. Post-rendering 3D warping. In Proc. 1997 Symp. Interactive 3D Graphics (ed. Dam, A. V.) 7–16 (Association for Computing Machinery, 1997).

  137. Waveren, J. V. The asynchronous time warp for virtual reality on consumer hardware. In Proc. 22nd ACM Conf. Virtual Reality Software and Technology (eds Kranzlmüller, D. & Klinker, G.) 37–6 (Association for Computing Machinery, 2016).

  138. Carmack, J. Latency mitigation strategies. #AltDevBlogADay https://web.archive.org/web/20130225013015/http://www.altdevblogaday.com/2013/02/22/latency-mitigation-strategies/ (2013). This article introduces the theoretical framework and practical implementation of time warping that became the foundation for the ATW, establishing the industry-standard approach to motion-to-photon latency compensation in consumer VR headsets.

  139. Ai, T. FPGA design & implementation of a very-low-latency video-see-through (VLLV) head-mount display (HMD) system for mixed reality (MR) applications. Proc. 15th ACM SIGGRAPH Conf. Virtual-Reality Contin. Its Appl. Ind. 1, 39–42 (2016).

    Google Scholar 

  140. Siegel, M. & Nagata, S. Just enough reality: comfortable 3-D viewing via microstereopsis. IEEE Trans. Circuits Syst. Video Technol. 10, 387–396 (2000).

    Article  Google Scholar 

  141. Macedo, M. C. & Apolinario, A. L. Occlusion handling in augmented reality: past, present and future. IEEE Trans. Vis. Comput. Graph. 29, 1590–1609 (2021).

    Article  Google Scholar 

  142. Chaurasia, G., Nieuwoudt, A., Ichim, A. E., Szeliski, R. & Sorkine-Hornung, A. Passthrough+: real-time stereoscopic view synthesis for mobile mixed reality. Proc. ACM Comput. Graph. Interact. Tech. 3, 1–17 (2020).

    Article  Google Scholar 

  143. Buyssens, P., Meur, O. L., Daisy, M., Tschumperlé, D. & Lézoray, O. Depth-guided disocclusion inpainting of synthesized RGB-D images. IEEE Trans. Image Process. 26, 525–538 (2017).

    Article  MathSciNet  Google Scholar 

  144. Ishihara, A. et al. Integrating both parallax and latency compensation into video see-through head-mounted display. IEEE Trans. Visual. Comput. Graph. 29, 2826–2836 (2023).

    Article  Google Scholar 

  145. Smart, B., Zheng, C., Laina, I. & Prisacariu, V. Splatt3R: zero-shot Gaussian splatting from uncalibrated image pairs. Preprint at https://arxiv.org/abs/2408.13912 (2024).

  146. Zhu, Z., Fan, Z., Jiang, Y. & Wang, Z. FSGS: real-time few-shot view synthesis using gaussian splatting. In Computer Vision – ECCV 2024, LNCS Vol. 15097 (eds Leonardis, A. et al.) 145–163 (Springer, 2024).

  147. Adaval, R., Saluja, G. & Jiang, Y. Seeing and thinking in pictures: a review of visual information processing. Consum. Psychol. Rev. 2, 50–69 (2019).

    Article  Google Scholar 

  148. Kress, B. C. Optical Architectures for Augmented-, Virtual-, and Mixed-reality Headsets (SPIE, 2020). This review covers the challenge of basic optical architectures and the working principles of emerging optical technologies in XR HMDs.

  149. Xiong, J., Hsiang, E. L., He, Z., Zhan, T. & Wu, S. T. Augmented reality and virtual reality displays: emerging technologies and future perspectives. Light. Sci. Appl. 10, 216 (2021).

    Article  Google Scholar 

  150. Usukura, N., Minoura, K. & Maruyama, R. Novel pancake-based HMD optics to improve light efficiency. J. Soc. Inf. Disp. 31, 344–354 (2023).

    Article  Google Scholar 

  151. Ding, Y., Luo, Z., Borjigin, G. & Wu, S. T. Breaking the optical efficiency limit of pancake optics in virtual reality. SID Symp. Dig. Tech. Pap. 55, 567–570 (2024).

    Article  Google Scholar 

  152. Chen, B. et al. Ultra-thin, ultra-light, rainbow-free AR glasses based on single-layer full-color SiC diffractive waveguide. Preprint at https://arxiv.org/abs/2409.14487 (2024).

  153. Yoshida, T. et al. A plastic holographic waveguide combiner for light-weight and highly-transparent augmented reality glasses. J. Soc. Inf. Disp. 26, 280–286 (2018).

    Article  Google Scholar 

  154. Chen, X. et al. Grating waveguides by machine learning for augmented reality. Appl. Opt. 62, 2924–2935 (2023).

    Article  Google Scholar 

  155. Maimone, A. and Fuchs, H. Computational augmented reality eyeglasses. In IEEE Int. Symp. Mixed and Augmented Reality (ISMAR) 29–38 (IEEE, 2013).

  156. Rathinavel, K. et al. Varifocal occlusion-capable optical see-through augmented reality display based on focus-tunable optics. IEEE Trans. Vis. Comput. Graph. 25, 3125–3134 (2019).

    Article  Google Scholar 

  157. Itoh, Y. et al. Occlusion leak compensation for optical see-through displays using a single-layer transmissive spatial light modulator. IEEE Trans. Vis. Comput. Graph. 23, 2463–2473 (2017).

    Article  Google Scholar 

  158. Kramida, G. Resolving the vergence-accomodation conflict in head-mounted displays. IEEE Trans. Vis. Comput. Graph. 22, 1912–1931 (2015).

    Article  Google Scholar 

  159. Yeom, H. J., Hong, K. & Park, M. High-quality phase-only Fourier hologram generation with camera-in-the-loop. Opt. Express 33, 6615–6628 (2025).

    Article  Google Scholar 

  160. Kim, D. et al. Holographic parallax improves 3D perceptual realism. ACM Trans. Graph. 43, 1–13 (2024).

    Google Scholar 

  161. Cheung, S. et al. Non-volatile heterogeneous III-V/Si photonics via optical charge-trap memory. Preprint at https://arxiv.org/abs/2305.17578 (2023).

  162. Liu, J. G. & Ueda, M. High refractive index polymers: fundamental research and practical applications. J. Mater. Chem. 19, 8907–8919 (2009).

    Article  Google Scholar 

  163. Lü, C. & Yang, B. High refractive index organic–inorganic nanocomposites: design, synthesis and application. J. Mater. Chem. 19, 2884–2901 (2009).

    Article  Google Scholar 

  164. Guo, Y. et al. A survey of the state of art in monocular 3D human pose estimation: methods, benchmarks, and challenges. Sensors 25, 2409 (2025).

    Article  Google Scholar 

  165. Hong, J., Choi, R. & Leonard, J. J. Learning from feedback: semantic enhancement for object SLAM using foundation models. Preprint at https://arxiv.org/abs/2411.06752 (2024).

  166. Moon, S., Kim, J., Lee, C. K. & Rho, J. Single-layer waveguide displays using achromatic metagratings for full-colour segmented reality. Nat. Nanotechnol. 20, 747–754 (2025).

    Article  Google Scholar 

  167. Malhotra, Y., Liu, X. & Mi, Z. Design principles and performance limitation of InGaN nanowire photonic crystal micro-LEDs. IEEE Photonics J. 17, 1–8 (2025).

    Article  Google Scholar 

  168. Dua, M., Akanksha & Dua, S. Noise robust automatic speech recognition: review and analysis. Int. J. Speech Technol. 26, 475–519 (2023).

    Article  Google Scholar 

  169. Pillalamarri, R. & Shanmugam, U. A review on EEG-based multimodal learning for emotion recognition. Artif. Intell. Rev. 58, 131 (2025).

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank J. Tanaka, Y. Fukumoto, T. Kitao and K. Akutsu for providing advice regarding digital humans, motion capture, pose estimation and 3D mapping, and eye tracking, respectively.

Author information

Authors and Affiliations

Authors

Contributions

H. Mukawa devised the overall structure of the manuscript; contributed to writing ‘Introduction’, ‘Overview of the metaverse’, ‘Extended reality workflow and technologies’, ‘Sensing and recognition technologies’, ‘Content generation technologies’, ‘Output technologies for optical displays’ and ‘Outlook’; and is also responsible for reviewing the entire article. Y.H. and H. Mizuno contributed to writing the ‘Digital replication of humans’ section. M.M. and F.H. contributed to writing ‘Conversational AI for metaverse NPCs’. K.M. contributed to writing the ‘Motion sensing’ section. H.A. and M.F. contributed to writing ‘Pose estimation and 3D mapping‘. H.A. contributed to writing ‘Motion-to-photon latency compensation’. R.O. and Y.M. contributed to writing ‘Eye tracking’. J.Y. and D.S. contributed to writing ‘Scene understanding’.

Corresponding author

Correspondence to Hiroshi Mukawa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Electrical Engineering thanks Frank Seto, Jeff Stafford and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Game Boy Advance Architecture: https://www.copetti.org/writings/consoles/game-boy-advance/

Game Graphics: Racing the Beam: https://hackaday.com/2023/10/24/game-graphics-racing-the-beam/

GPT-4o mini: advancing cost-efficient intelligence: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

HXR Technology: https://swave.io/nanopixel-holography/

Introducing ChatGPT: https://openai.com/blog/chatgpt

Introducing the Lightship Visual Positioning System and Niantic AR Map: https://nianticlabs.com/news/lightshipsummit?hl=en

Microsoft Mesh overview: https://learn.microsoft.com/en-us/mesh/overview

mocopi: https://electronics.sony.com/more/mocopi/all-mocopi/p/qmss1-uscx

Reducing latency in mobile VR by using single buffered strip rendering: https://blog.imaginationtech.com/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/

Sony Interactive Entertainment Inc., PlayStation.VR2: https://www.playstation.com/en-us/ps-vr2/

Time-of-Flight (ToF) Cameras vs. other 3D Depth Mapping Cameras: https://www.e-consystems.com/blog/camera/technology/how-time-of-flight-tof-compares-with-other-3d-depth-mapping-technologies/

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mukawa, H., Hirota, Y., Mizuno, H. et al. Extended reality technologies for applications in the metaverse. Nat Rev Electr Eng (2025). https://doi.org/10.1038/s44287-025-00211-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s44287-025-00211-4

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics