Abstract
The metaverse has gained increasing attention with advances in artificial intelligence (AI), semiconductor devices and high-speed networks. Although the metaverse has potential across various industries and consumer markets, it remains in the early stages of development, with further progress in extended reality (XR) technologies anticipated. In this Review, we provide an overview of essential XR technologies for immersive metaverse experiences enabling human–digital interactions. Motion sensing, eye tracking, pose estimation and 3D mapping, scene understanding, digital humans, conversational AI for metaverse non-player characters, motion-to-photon latency compensation and optical display systems are important for human–digital interaction in the metaverse, with AI accelerating the evolution of these technologies. Key challenges include the accuracy and robustness of sensing and recognition of users and surrounding environments, real-time content generation reflecting the users’ responses and environments, and high-performance XR head-mounted displays with compact form factors. Realizing this potential will enable people to interact more genuinely with each other and digital objects in healthcare, education, retail, manufacturing and everyday life.
Key points
-
One of the key user values of the metaverse is a sense of immersion and presence.
-
The technologies to present a sense of immersion and presence to users are extended reality (XR) technologies, which can enhance reality expressions and natural interactions.
-
An XR workflow consists of sensing and recognition, content generation and output, and various technologies comprise the workflow.
-
Artificial intelligence (AI) technologies have crucial roles in the sensing and recognition fields, as well as in the XR content generation domain.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
References
Martí-Testón, A. et al. Using WebXR metaverse platforms to create touristic services and cultural promotion. Appl. Sci. 13, 8544–8548 (2023).
Lin, H. et al. Metaverse in education: vision, opportunities, and challenges. In 2022 IEEE Int. Conf. on Big Data 2857–2866 (IEEE, 2022).
Lee, C. W. Application of metaverse service to healthcare industry: a strategic perspective. Int. J. Environ. Res. Public. Health 19, 13038 (2022).
Sobolev, T. Top 15 metaverse games in 2025. Fgfactory https://fgfactory.com/top-15-metaverse-games-in-2024 (21 April 2025).
GVR. Metaverse market size, share & trends analysis report by product, by platform, by technology (blockchain, virtual reality (VR) & augmented reality (AR), mixed reality (MR)), by application, by end-use, by region, and segment forecasts, 2023–2030. Grand View Research https://www.grandviewresearch.com/industry-analysis/metaverse-market-report (2022).
Nahalingam, K. & Katehi, L. P. A review of the recent developments in the fabrication processes of CMOS image sensors for smartphones. Preprint at https://arxiv.org/abs/2306.05339 (2023).
Sun, Y., Agostini, N. B., Dong, S. & Kaeli, D. Summarizing CPU and GPU design trends with product data. Preprint at https://arxiv.org/abs/1911.11313 (2019).
Kang, C. & Lee, H. Recent progress of organic light-emitting diode microdisplays for augmented reality/virtual reality applications. J. Inf. Disp. 23, 19–32 (2022).
Kshirsagar, P. R., Reddy, D. H., Dhingra, M., Dhabliya, D. & Gupta, A. A review on comparative study of 4G, 5G and 6G networks. In 2022 5th Int. Conf. Contemporary Computing and Informatics 1830–1833 (2022).
Hu, Y., Hu, W. & Quigley, A. Towards using generative AI for facilitating image creation in spatial augmented reality. In 2023 IEEE International Symposium on Mixed and Augmented Reality Adjunct 441–443 (IEEE, 2023).
Tiple, B. et al. AI based augmented reality assistant. Int. J. Intell. Syst. Appl. Eng. 12, 505–516 (2024).
Sahu, K. C., Young, C. & Rai, R. Artificial intelligence (AI) in augmented reality (AR)-assisted manufacturing applications: a review. Int. J. Prod. Res. 59, 4903–4959 (2021).
Jeremiah, R., Hutson J., and Wright, A. A proposed meta-reality immersive development pipeline: generative AI models and extended reality (XR) content for the metaverse. J. Intell. Learn. Syst. Appl. 15, 24–35 (2023).
Mukawa, H. Review and perspective of XR technologies for immersive experience. SID Symp. Digest Technical Papers 55, 559–562 (2024).
Wongkitrungrueng, A. and Suprawan, L. Metaverse meets branding: examining consumer responses to immersive brand experiences. Int. J. Hum. Comput. Interact. 40, 2905–2924 (2024).
Jafar, R. M. S. & Ahmad, W. Tourist loyalty in the metaverse: the role of immersive tourism experience and cognitive perceptions. Tour. Rev. 79, 321–336 (2024).
Queiroz, A. et al. Collaborative tasks in immersive virtual reality increase learning. In Proc.16th Int. Conf. Computer-Supported Collaborative Learning—CSCL 2023 27–34 (ISLS, 2023).
Moeslund, B., Adrian, H. & Volker, K. A survey of advances in vision-based human motion capture and analysis. Comp. Vis. Image Underst. 104, 90–126 (2006).
Filippeschi, A. et al. Survey of motion tracking methods based on inertial sensors: a focus on upper limb human motion. Sensors 17, 1257 (2017).
Van, E. & Reijne, M. Accuracy of human motion capture systems for sport applications; state-of-the-art review. Eur. J. Sport. Sci. 18, 806–819 (2018).
Titterton, H. & Weston, L. Strapdown Inertial Navigation Technology (IET, 2004).
Schepers, M., Giuberti, M. & Bellusci, G. Xsens MVN: consistent tracking of human motion using inertial sensing. Xsens Technol. 1, 1–8 (2018).
Nakano, N. et al. Evaluation of 3D markerless motion capture accuracy using openpose with multiple video cameras. Front. Sports Act. Living 2, 50 (2020).
Cao, Z., Simon, T., Wei, E. & Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 7291–7299 (IEEE/CVF, 2017).
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A. & Malik, J. Humans in 4D: reconstructing and tracking humans with transformers. In Proc. IEEE/CVF Int. Conf. Computer Vision 14783–14794 (IEEE/CVF, 2023).
Huang, Y. et al. Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. 37, 1–15 (2018). This work is an example of how machine learning can substantially reduce the number of sensors in a conventional motion capture system.
Jiang, J. et al. Avatarposer: articulated full-body pose tracking from sparse motion sensing. In Eur. Conf. Computer Vision 443–460 (2022).
Rhodin, H. et al. Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. 35, 1–11 (2016).
Shiratori, T., Park, S., Sigal, L., Sheikh, Y. & Hodgins, K. Motion capture from body-mounted cameras. In ACM SIGGRAPH 2011 Papers Vol. 30 31 (2011).
Kulozik, J., Nathanaël, J. Evaluating the precision of the HTC VIVE ultimate tracker with robotic and human movements under varied environmental conditions. Preprint at https://arxiv.org/abs/2409.01947 (2024).
Mourikis, I. & Roumeliotis, I. A multi-state constraint kalman filter for vision-aided inertial navigation. In Proc. 2007 IEEE Int. Conf. Robotics and Automation 3565–3572 (IEEE, 2007).
Du, Y. et al. Avatars grow legs: generating smooth human motion from sparse tracking inputs with diffusion model. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 481–490 (IEEE/CVF, 2023).
Peng, B., Abbeel, P., Levine, S. & Van, M. Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37, 1–14 (2018).
Taheri, O., Choutas, V., Black, J. & Tzionas, Goal: generating 4D whole-body motion for hand-object grasping. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 13263–13273 (IEEE/CVF, 2022).
Zhang, X., Bhatnagar, L., Starke, S., Guzov, V. & Pons-Moll, G. Couch: towards controllable human–chair interactions. In Eur. Conf. Computer Vision 518–535 (2022).
Liang, H., Zhang, W., Li, W., Yu, J. & Xu, L. Intergen: diffusion-based multi-human motion generation under complex interactions. Int. J. Comp. Vis. 132, 3463–3483 (2024).
Braun, J., Christen, S., Kocabas, M., Aksan, E. & Hilliges, O. Physically plausible full-body hand–object interaction synthesis. In Int. Conf. 3D Vision 464–473 (2024).
Lee, S., Starke, S., Ye, Y., Won, J. & Winkler, A. Questenvsim: environment-aware simulated motion tracking from sparse sensors. In ACM SIGGRAPH 2023 Conf. Proc. 62 (Association for Computing Machinery, 2023).
Klein, G. S. & Murray, D. W. Parallel tracking and mapping for small AR workspaces. In 6th IEEE and ACM Int. Symp. Mixed and Augmented Reality 225–234 (IEEE, 2007).
Dai, Y., Wu, J. & Wang, D. A review of common techniques for visual simultaneous localization and mapping. J. Robot. 2023, 8872822 (2023).
Mur-Artal, R., Montiel, J. M. & Tardós, J. D. ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31, 1147–1163 (2015). This work presents ORB-SLAM, a groundbreaking monocular SLAM system that revolutionized the field by introducing a versatile, real-time approach that uses the same ORB features for all SLAM tasks, enabling robust performance across diverse environments and paving the way for future advancements in visual SLAM technology.
Engel, J., Koltun, V. & Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2017).
Qin, T., Li, P. & Shen, S. VINS-Mono: a robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 34, 1004–1020 (2018).
Newcombe, R. A. et al. Kinectfusion: real-time dense surface mapping and tracking. In 10th IEEE Int. Symp. Mixed and Augmented Reality 127–136 (IEEE, 2011).
Gallego, G. et al. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 154–180 (2020).
SHAN, Tixiao et al. LIO-SAM: tightly-coupled LiDAR inertial odometry via smoothing and mapping. In IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS) 5135–5142 (IEEE, 2020).
Piron, F., Morrison, D., Yuce, M. R. & Redouté, J. M. A review of single-photon avalanche diode time-of-flight imaging sensor arrays. IEEE Sens. J. 21, 12654–12666 (2020).
Kamata, H. et al. MEMS gyro array employing array signal processing for interference and outlier suppression. In IEEE Int. Symp. Inertial Sensors and Systems (INERTIAL) 1–4 (IEEE, 2020).
Sarlin, P. E., DeTone, D., Malisiewicz, T. & Rabinovich, A. SuperGlue: learning feature matching with graph neural networks. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 4938–4947 (IEEE/CVF, 2020).
Teed, Z. & Deng, J. DROID-SLAM: deep visual slam for monocular, stereo, and RGB-D cameras. Adv. Neural Inf. Process. Syst. 34, 16558–16569 (2021).
Rosinol, A., Leonard, J. J. & Carlone, L. NeRF-SLAM: real-time dense monocular SLAM with neural radiance fields. In IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS) 3437–3444 (IEEE, 2023).
Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128, 261–318 (2020). This survey traces how deep learning-based object detection became the foundation for scene understanding, context reasoning and scale-aware perception, while spotlighting challenges linking object cues to holistic image semantics.
Viola, P. & Jones, M. Rapid object detection using a boosted cascade of simple features. In Proc. 2001 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (IEEE, 2001).
Papageorgiou, C., Oren, M. & Poggio, T. A general framework for object detection. In Proc. Sixth Int. Conf. Computer Vision 555–562 (1998).
Lowe, D. Object recognition from local scale-invariant features. Proc. Seventh IEEE Int. Conf. Comput. Vis. 2, 1150–1157 (1999).
Schuhmann, C. et al. LAION-5B: an open large-scale dataset for training next generation image-text models. In Proc. 36th Int. Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 25278–25294 (NeurIPS, 2022).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th Int. Conf. Machine Learning (eds Meila, M. &Zhang, T.) 8748–8763 (MLRearchPress, 2021).
Li, J., Dongxu, L., Savarese, S. & Hoi, S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Machine Learning (eds Krause, A. et al.) 19730–19742 (MLRearchPress, 2023).
Wu, J. et al. GRiT: a generative region-to-text transformer for object understanding. In Eur. Conf. Computer Vision (eds Leonardis, A. et al.) 207–224 (Springer Nature, 2024).
Liu, S. et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In Eur. Conf. Computer Vision (eds Leonardis, A. et al.) 38–55 (Springer Nature, 2023).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2023).
Liu, A. et al. Deepseek-v3 technical report. Preprint at https://arxiv.org/abs/2412.19437 (2024).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Dogan, M. et al. Augmented object intelligence with XR-objects. In Proc. 37th Annual ACM Symp. User Interface Software and Technology 19 (Association for Computing Machinery, 2024).
Tang, Y. et al. Empowering LLMs with pseudo-untrimmed videos for audio-visual temporal understanding. Proc. 39th Annu. AAAI Conf. Artif. Intell. 39, 7293–7301 (AAAI Press, 2025).
Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://arxiv.org/abs/2312.00752 (2023).
Grauman, K. et al. Ego4D: around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 18995–19012 (IEEE/CVF, 2022).
Ravi, N. et al. SAM 2: segment anything in images and videos. In Thirteenth Int. Conf. Learning Representations (2025).
Pallotta, E., Azar, S. M., Li, S., Zatsarynna, O. & Gall, J. SyncVP: joint diffusion for synchronous multi-modal video prediction. In Proc. EEE/CVF Conf. Computer Vision and Pattern Recognition 13787–13797 (EEE/CVF, 2025).
Wilson, J. & Lin, M. C. AVOT: audio-visual object tracking of multiple objects for robotics. In IEEE Int. Conf. Robotics and Automation 10045–10051 (IEEE, 2020).
Bao, C., Xu, J., Wang, X., Gupta, A. & Bharadhwaj, H. HandsOnVLM: vision-language models for hand-object interaction prediction. Preprint at https://arxiv.org/abs/2412.13187 (2024).
Wilson, A. & Hua, H. Design and demonstration of a vari-focal optical see-through head-mounted display using freeform alvarez lenses. Opt. Express 27, 15627–15637 (2019).
Guan, P., Mercier, O., Shvartsman, M. & Lanman, D. Perceptual requirements for eye-tracked distortion correction in VR. In ACM SIGGRAPH 2022 Conf. Proc. 51 (Association for Computing Machinery, 2022).
Zhai, S., Morimoto, C. & Ihde, S. Manual and gaze input cascaded (MAGIC) pointing. In Proc. SIGCHI Conf. Human Factors in Computing Systems 246–253 (Association for Computing Machinery, 1999).
Patney, A. et al. Towards foveated rendering for gaze-tracked virtual reality. ACM Trans. Graph. 35, 1–12 (2016). This paper describes the well-known foveated rendering method considering human perception with experimental results.
Hennessey, C., Noureddin, B. & Lawrence, P. A single camera eye-gaze tracking system with free head motion. In Proc. 2006 Symp. Eye Tracking Research & Applications 87–94 (Association for Computing Machinery, 2006).
Dierkes, K., Kassner, M. & Bulling, A. A fast approach to refraction-aware eye-model fitting and gaze prediction. In Proc. 11th ACM Symp. Eye Tracking Research & Applications 23 (Association for Computing Machinery, 2019).
Kothari, R. S. et al. Ellseg: an ellipse segmentation framework for robust gaze tracking. IEEE Trans. Vis. Comput. Graph. 27, 2757–2767 (2021).
Fuhl, W., Kasneci, G. & Kasneci, E. TEyeD: over 20 million real-world eye images with pupil, eyelid, and iris 2D and 3D segmentations, 2D and 3D landmarks, 3D eyeball, gaze vector, and eye movement types. In 2021 IEEE Int. Symp. Mixed and Augmented Reality (ISMAR) 367–375 (IEEE, 2021).
Nair, N. et al. RIT-Eyes: rendering of near-eye images for eye-tracking applications. ACM Symp. Appl. Percept. 2020, 1–9 (2020).
Rigas, I., Raffle, H. & Komogortsev, O. V. Photosensor oculography: survey and parametric analysis of designs using model-based simulation. IEEE Trans. Human–Machine Syst. 48, 670–681 (2018).
Palmero, C., Komogortsev, O. V., Escalera, S. & Talathi, S. S. Multi-rate sensor fusion for unconstrained near-eye gaze estimation. In Proc. 2023 Symp. Eye Tracking Research and Applications 12 (Association for Computing Machinery, 2023).
Sarkar, N. et al. A resonant eye-tracking microsystem for velocity estimation of saccades and foveated rendering. In 2017 IEEE 30th Int. Conf. Micro Electro Mechanical Systems (MEMS) 304–307 (IEEE, 2017).
Angelopoulos, A. N., Martel, J. N., Kohli, A. P., Conradt, J. & Wetzstein, G. Event based, near eye gaze tracking beyond 10,000 Hz. IEEE Trans. Vis. Comput. Graph. 27, 2577–2586 (2021).
Bonazzi, P. et al. A low-power neuromorphic approach for efficient eye-tracking. Preprint at https://arxiv.org/abs/2312.00425 (2023).
Moreno-Arjonilla, J. et al. Eye-tracking on virtual reality: a survey. Virtual Reality 28, 38 (2024).
Aziz, S. et al. Evaluation of eye tracking signal quality for virtual reality applications: a case study in the meta quest pro. In Proc. 2024 Symp. Eye Tracking Research and Applications 7 (Association for Computing Machinery, 2024).
Hooge, I. T., Niehorster, D. C., Nyström, M., Andersson, R. & Hessels, R. S. Fixation classification: how to merge and select fixation candidates. Behav. Res. Methods 54, 2765–2776 (2022).
David-John, B. et al. Towards gaze-based prediction of the intent to interact in virtual reality. In ACM Symp. Eye Tracking Research and Applications 2 (Association for Computing Machinery, 2021).
Peacock, C. E. et al. Gaze as an indicator of input recognition errors. Proc. ACM Human–Comput. Interact. 6, 1–18 (2022).
Lohr, D. & Komogortsev, O. V. Eye know you too: toward viable end-to-end eye movement biometrics for user authentication. IEEE Trans. Inf. Forensics Security 17, 3151–3164 (2022).
Wilson, E. et al. Eye gaze as a signal for conveying user attention in contextual AI systems. In Proc. 2025 Symp. Eye Tracking Research and Applications 1–7 (2025).
Lengyel, J. The convergence of graphics and vision. Computer 31, 46–53 (1998).
SONY. Digital human technology that faithfully reproduces human facial expressions and movements. Sony.com https://www.sony.com/en/SonyInfo/research/technologies/digital_human/ (2022).
Seymor, M. et al. Meet Mike: epic avatars. In ACM SIGGRAPH 2017 VR Village (ed. Quesnel, D.) 12 (Association for Computing Machinery, 2017).
Alexander, O. et al. The Digital Emily Project: achieving a photorealistic digital actor. IEEE Comput. Graph. Appl. 30, 20–31 (2010).
Klehm, O. et al. Recent advances in facial appearance capture. Comput. Graph. Forum 34, 709–733 (2015).
McAuley, S. et at. Practical physically-based shading in film and game production. In ACM SIGGRAPH 2012 Courses (ed. McNamara, A.) 10 (Association for Computing Machinery, 2012).
Orvalho, V. et al. A facial rigging survey. In Eurographics 2021 - State of the Art Reports (eds Cani, M.-P. & Ganovelli, F.) 183–204 (The Eurographics Association, 2012).
Lewis, J. et al. Practice and theory of blendshape facial models. Eurographics 1, 2 (2014).
Ekman, P. et al. Manual for the Facial Action Coding System (Consulting Psychologist Press, 1978).
Kanade, T. et al. Virtualized reality: constructing virtual worlds from real scenes. IEEE Multimed. 4, 34–47 (1997).
Laurentini, A. The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16, 150–162 (1994).
Furukawa, U. et al. Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1362–1376 (2010).
Collet, A. et al. High-quality streamable free-viewpoint video. ACM Trans. Graph. 34, 1–13 (2015).
SONY. Volumetric capture studio. SonyPcl.jp https://www.sonypcl.jp/kiyosumi-shirakawa/index.html (2022).
Guo, K. et al. The relightables: volumetric performance capture of humans with realistic relighting. ACM Trans. Graph. 38, 1–19 (2019).
Mildenhall, B. et al. NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 99–106 (2021).
Kerbl, B. et al. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42, 1–14 (2023).
Cao, C. et al. Authentic volumetric avatars from a phone scan. ACM Trans. Graph. 41, 1–19 (2022).
Qian, S. et al. GaussianAvatars: photorealistic head avatars with rigged 3D Gaussians. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (eds Akata, Z. et al.) 20299–20309 (IEEE/CVF, 2024).
Wang, R. et al. A survey on 3D human avatar modeling—from reconstruction to generation. Preprint at https://arxiv.org/abs/2406.04253 (2024).
Bagautdinov, T. et al. Driving-signal aware full-body avatars. ACM Trans. Graph. 40, 143 (2021). This work presents one of the recent digital human creation approaches, proposing a method for driving the digital human created from multi-view videos.
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Adv. Neural Information Processing Systems 27: Annual Conf. Neural Information Processing Systems (eds Ghahramani, Z. et al.) 3104–3112 (NeurIPS, 2014).
Vaswani, A. et al. Attention is all you need. In Adv. Neural Information Processing Systems 30: Annual Conf. Neural Information Processing Systems (eds Guyon, I. et al.) 5998–6008 (NeurIPS, 2017).
Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. OpenReview.net https://openreview.net/forum?id=rygGQyrFvH (2020).
Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. North. Am. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., NAACL-HLT 2019 Vol. 1 (eds Burstein, J. et al.) 4171–4186 (ACL, 2019).
Bommasani, R. et al. On the opportunities and risks of foundation models. Stanford.edu https://crfm.stanford.edu/assets/report.pdf (2021). This paper provides a comprehensive description of the possibilities and risks of the foundation model represented by LLM in terms of capabilities, technical principles, applications and social impacts.
Brown, T. B. et al. Language models are few-shot learners. In Adv. Neural Information Processing Systems 33: Annual Conf. Neural Information Processing Systems 2020 (eds Larochelle, H. et al.) 1877–1901 (NeurIPS, 2020).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 27730–27744 (NeurIPS, 2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 24824–24837 (NeurIPS, 2022).
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Proc. 36th Annual ACM Symp. User Interface Software and Technology (eds Follmer, S. et al.) 2 (Association for Computing Machinery, 2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Adv. Neural Information Processing Systems 33: Annual Conf. Neural Information Processing System (eds Larochelle, H. et al.) 9459–9474 (NeurIPS, 2020).
Peng, B. et al. RWKV: reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP (eds Al-Onaizan, Y. et al.) 14048–14077 (ACL, 2023).
Alayrac, J. B. et al. Flamingo: a visual language model for few-shot learning. In Adv. Neural Information Processing Systems 35: Annual Conf. Neural Information Processing Systems (eds Koyejo, S. et al.) 23716–23736 (NeurIPS, 2022).
Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
OpenAI. GPT-4V(ision) System Card. OpenAI.com https://cdn.openai.com/papers/GPTV_System_Card.pdf (2023).
Gemini Team Google. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Ng, E. et al. From audio to photoreal embodiment: synthesizing humans in conversations. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition 1001–1010 (IEEE/CVF, 2024).
Zhao, Q. et al. Media2Face: co-speech facial animation generation with multi-modality guidance. In ACM SIGGRAPH 2024 Conf. Proc. (eds Burbano, A. et al.) 18 (Association for Computing Machinery, 2024).
Omine, T., Kawabata, N. & Homma, F. Co-speech gesture and facial expression generation for non-photorealistic 3D characters. In ACM Proc. SIGGRAPH Conf. Posters 8 (Association for Computing Machinery, 2025).
Song, Y., Dhariwal, P., Chen, M. & Sutskever, I. Consistency models. In Proc. 40th Int. Conf. Mach. Learn., ICML 2023, Proc. Mach. Learning Research Vol. 202 (eds Krause, A. et al.) 32211–32252 (PMLR, 2023).
Warburton, M. et al. Measuring motion-to-photon latency for sensorimotor experiments with virtual reality systems. Behav. Res. 55, 3658–3678 (2023).
Bailey, R. E., Arthur III, J. J. & Williams, S. P. Latency requirements for head-worn display S/EVS applications. In Enhanced and Synthetic Vision 2004 Vol. 5424 (ed. Verly, J. G.) 98–109 (SPIE, 2004).
Lincoln, P. et al. From motion to photons in 80 microseconds: towards minimal latency for virtual and augmented reality. IEEE Trans. Vis. Comput. Graph. 22, 1367–1376 (2016).
Mark, W. R., McMillan, L. & Bishop, G. Post-rendering 3D warping. In Proc. 1997 Symp. Interactive 3D Graphics (ed. Dam, A. V.) 7–16 (Association for Computing Machinery, 1997).
Waveren, J. V. The asynchronous time warp for virtual reality on consumer hardware. In Proc. 22nd ACM Conf. Virtual Reality Software and Technology (eds Kranzlmüller, D. & Klinker, G.) 37–6 (Association for Computing Machinery, 2016).
Carmack, J. Latency mitigation strategies. #AltDevBlogADay https://web.archive.org/web/20130225013015/http://www.altdevblogaday.com/2013/02/22/latency-mitigation-strategies/ (2013). This article introduces the theoretical framework and practical implementation of time warping that became the foundation for the ATW, establishing the industry-standard approach to motion-to-photon latency compensation in consumer VR headsets.
Ai, T. FPGA design & implementation of a very-low-latency video-see-through (VLLV) head-mount display (HMD) system for mixed reality (MR) applications. Proc. 15th ACM SIGGRAPH Conf. Virtual-Reality Contin. Its Appl. Ind. 1, 39–42 (2016).
Siegel, M. & Nagata, S. Just enough reality: comfortable 3-D viewing via microstereopsis. IEEE Trans. Circuits Syst. Video Technol. 10, 387–396 (2000).
Macedo, M. C. & Apolinario, A. L. Occlusion handling in augmented reality: past, present and future. IEEE Trans. Vis. Comput. Graph. 29, 1590–1609 (2021).
Chaurasia, G., Nieuwoudt, A., Ichim, A. E., Szeliski, R. & Sorkine-Hornung, A. Passthrough+: real-time stereoscopic view synthesis for mobile mixed reality. Proc. ACM Comput. Graph. Interact. Tech. 3, 1–17 (2020).
Buyssens, P., Meur, O. L., Daisy, M., Tschumperlé, D. & Lézoray, O. Depth-guided disocclusion inpainting of synthesized RGB-D images. IEEE Trans. Image Process. 26, 525–538 (2017).
Ishihara, A. et al. Integrating both parallax and latency compensation into video see-through head-mounted display. IEEE Trans. Visual. Comput. Graph. 29, 2826–2836 (2023).
Smart, B., Zheng, C., Laina, I. & Prisacariu, V. Splatt3R: zero-shot Gaussian splatting from uncalibrated image pairs. Preprint at https://arxiv.org/abs/2408.13912 (2024).
Zhu, Z., Fan, Z., Jiang, Y. & Wang, Z. FSGS: real-time few-shot view synthesis using gaussian splatting. In Computer Vision – ECCV 2024, LNCS Vol. 15097 (eds Leonardis, A. et al.) 145–163 (Springer, 2024).
Adaval, R., Saluja, G. & Jiang, Y. Seeing and thinking in pictures: a review of visual information processing. Consum. Psychol. Rev. 2, 50–69 (2019).
Kress, B. C. Optical Architectures for Augmented-, Virtual-, and Mixed-reality Headsets (SPIE, 2020). This review covers the challenge of basic optical architectures and the working principles of emerging optical technologies in XR HMDs.
Xiong, J., Hsiang, E. L., He, Z., Zhan, T. & Wu, S. T. Augmented reality and virtual reality displays: emerging technologies and future perspectives. Light. Sci. Appl. 10, 216 (2021).
Usukura, N., Minoura, K. & Maruyama, R. Novel pancake-based HMD optics to improve light efficiency. J. Soc. Inf. Disp. 31, 344–354 (2023).
Ding, Y., Luo, Z., Borjigin, G. & Wu, S. T. Breaking the optical efficiency limit of pancake optics in virtual reality. SID Symp. Dig. Tech. Pap. 55, 567–570 (2024).
Chen, B. et al. Ultra-thin, ultra-light, rainbow-free AR glasses based on single-layer full-color SiC diffractive waveguide. Preprint at https://arxiv.org/abs/2409.14487 (2024).
Yoshida, T. et al. A plastic holographic waveguide combiner for light-weight and highly-transparent augmented reality glasses. J. Soc. Inf. Disp. 26, 280–286 (2018).
Chen, X. et al. Grating waveguides by machine learning for augmented reality. Appl. Opt. 62, 2924–2935 (2023).
Maimone, A. and Fuchs, H. Computational augmented reality eyeglasses. In IEEE Int. Symp. Mixed and Augmented Reality (ISMAR) 29–38 (IEEE, 2013).
Rathinavel, K. et al. Varifocal occlusion-capable optical see-through augmented reality display based on focus-tunable optics. IEEE Trans. Vis. Comput. Graph. 25, 3125–3134 (2019).
Itoh, Y. et al. Occlusion leak compensation for optical see-through displays using a single-layer transmissive spatial light modulator. IEEE Trans. Vis. Comput. Graph. 23, 2463–2473 (2017).
Kramida, G. Resolving the vergence-accomodation conflict in head-mounted displays. IEEE Trans. Vis. Comput. Graph. 22, 1912–1931 (2015).
Yeom, H. J., Hong, K. & Park, M. High-quality phase-only Fourier hologram generation with camera-in-the-loop. Opt. Express 33, 6615–6628 (2025).
Kim, D. et al. Holographic parallax improves 3D perceptual realism. ACM Trans. Graph. 43, 1–13 (2024).
Cheung, S. et al. Non-volatile heterogeneous III-V/Si photonics via optical charge-trap memory. Preprint at https://arxiv.org/abs/2305.17578 (2023).
Liu, J. G. & Ueda, M. High refractive index polymers: fundamental research and practical applications. J. Mater. Chem. 19, 8907–8919 (2009).
Lü, C. & Yang, B. High refractive index organic–inorganic nanocomposites: design, synthesis and application. J. Mater. Chem. 19, 2884–2901 (2009).
Guo, Y. et al. A survey of the state of art in monocular 3D human pose estimation: methods, benchmarks, and challenges. Sensors 25, 2409 (2025).
Hong, J., Choi, R. & Leonard, J. J. Learning from feedback: semantic enhancement for object SLAM using foundation models. Preprint at https://arxiv.org/abs/2411.06752 (2024).
Moon, S., Kim, J., Lee, C. K. & Rho, J. Single-layer waveguide displays using achromatic metagratings for full-colour segmented reality. Nat. Nanotechnol. 20, 747–754 (2025).
Malhotra, Y., Liu, X. & Mi, Z. Design principles and performance limitation of InGaN nanowire photonic crystal micro-LEDs. IEEE Photonics J. 17, 1–8 (2025).
Dua, M., Akanksha & Dua, S. Noise robust automatic speech recognition: review and analysis. Int. J. Speech Technol. 26, 475–519 (2023).
Pillalamarri, R. & Shanmugam, U. A review on EEG-based multimodal learning for emotion recognition. Artif. Intell. Rev. 58, 131 (2025).
Acknowledgements
The authors thank J. Tanaka, Y. Fukumoto, T. Kitao and K. Akutsu for providing advice regarding digital humans, motion capture, pose estimation and 3D mapping, and eye tracking, respectively.
Author information
Authors and Affiliations
Contributions
H. Mukawa devised the overall structure of the manuscript; contributed to writing ‘Introduction’, ‘Overview of the metaverse’, ‘Extended reality workflow and technologies’, ‘Sensing and recognition technologies’, ‘Content generation technologies’, ‘Output technologies for optical displays’ and ‘Outlook’; and is also responsible for reviewing the entire article. Y.H. and H. Mizuno contributed to writing the ‘Digital replication of humans’ section. M.M. and F.H. contributed to writing ‘Conversational AI for metaverse NPCs’. K.M. contributed to writing the ‘Motion sensing’ section. H.A. and M.F. contributed to writing ‘Pose estimation and 3D mapping‘. H.A. contributed to writing ‘Motion-to-photon latency compensation’. R.O. and Y.M. contributed to writing ‘Eye tracking’. J.Y. and D.S. contributed to writing ‘Scene understanding’.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Reviews Electrical Engineering thanks Frank Seto, Jeff Stafford and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Game Boy Advance Architecture: https://www.copetti.org/writings/consoles/game-boy-advance/
Game Graphics: Racing the Beam: https://hackaday.com/2023/10/24/game-graphics-racing-the-beam/
GPT-4o mini: advancing cost-efficient intelligence: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
HXR Technology: https://swave.io/nanopixel-holography/
Introducing ChatGPT: https://openai.com/blog/chatgpt
Introducing the Lightship Visual Positioning System and Niantic AR Map: https://nianticlabs.com/news/lightshipsummit?hl=en
Microsoft Mesh overview: https://learn.microsoft.com/en-us/mesh/overview
mocopi: https://electronics.sony.com/more/mocopi/all-mocopi/p/qmss1-uscx
Reducing latency in mobile VR by using single buffered strip rendering: https://blog.imaginationtech.com/reducing-latency-in-vr-by-using-single-buffered-strip-rendering/
Sony Interactive Entertainment Inc., PlayStation.VR2: https://www.playstation.com/en-us/ps-vr2/
Time-of-Flight (ToF) Cameras vs. other 3D Depth Mapping Cameras: https://www.e-consystems.com/blog/camera/technology/how-time-of-flight-tof-compares-with-other-3d-depth-mapping-technologies/
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mukawa, H., Hirota, Y., Mizuno, H. et al. Extended reality technologies for applications in the metaverse. Nat Rev Electr Eng (2025). https://doi.org/10.1038/s44287-025-00211-4
Accepted:
Published:
DOI: https://doi.org/10.1038/s44287-025-00211-4