Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 05 April 2026

Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion

  • Xiangwen Ma1,
  • Jiaxin Zhao1,
  • Xiaoyu Huang1,
  • San Li2,
  • Yang Li1 &
  • …
  • Junping Yin1,3,4 

Scientific Reports , Article number:  (2026) Cite this article

  • 337 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the end-to-end latent diffusion paradigm Hallo as the backbone, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.

Data availability

The HDTF dataset \(^{40}\) can be obtained from https://github.com/MRzzm/HDTF. This dataset \(^{40}\) was collected from publically available Youtube website and made publically available by its original creators. The HDTF dataset \(^{40}\) is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). It should be noted that the “Wild” dataset consists of publicly available talking head videos collected from the internet. No personally identifiable images, videos, or identity-related information from this dataset are displayed, analyzed individually, or shared in any form within this study. All displayed figures and qualitative results of the real portrait images in this paper are based exclusively on the HDTF dataset \(^{40}\). Therefore, this study complies with all ethical and licensing requirements.

References

  1. Dhariwal, P. & Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021).

    Google Scholar 

  2. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

    Google Scholar 

  3. Peebles, W. & Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205 (2023).

  4. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695 (2022).

  5. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256–2265 (2015).

  6. Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).

    Google Scholar 

  7. Zhou, Y., Liu, Y., Shao, Y. & Chen, J. Fine-tuning diffusion model to generate new kite designs for the revitalization and innovation of intangible cultural heritage. Sci. Rep. 15, 7519 (2025).

    Google Scholar 

  8. Yue, Z., Wang, J. & Loy, C. C. Efficient diffusion model for image restoration by residual shifting. IEEE Trans. Pattern Anal. Mach. Intell. 47, 116–130 (2024).

    Google Scholar 

  9. Loc, I. & Unlu, M. B. Accelerating photoacoustic microscopy by reconstructing undersampled images using diffusion models. Sci. Rep. 14 (2024).

  10. Wang, J., Zhang, O. & Jiang, Y. Multimodal diffusion framework for collaborative text image audio generation and applications. Sci. Rep. 15 (2025).

  11. Guo, Y. et al. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725 (2023).

  12. Hu, L. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8153–8163 (2024).

  13. Tian, L., Wang, Q., Zhang, B. & Bo, L. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision. 244–260 (2024).

  14. Xu, M. et al. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv:2406.08801 (2024).

  15. Zhang, Y. et al. Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 4848–4852 (2022).

  16. Yao, S., Zhong, R., Yan, Y., Zhai, G. & Yang, X. Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv:2201.00791 (2022).

  17. Yu, Z. et al. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7645–7655 (2023).

  18. Wang, D., Deng, Y., Yin, Z., Shum, H.-Y. & Wang, B. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17979–17989 (2023).

  19. Chen, L., Maddox, R. K., Duan, Z. & Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7832–7841 (2019).

  20. Blanz, V. & Vetter, T. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003).

    Google Scholar 

  21. Blattmann, A. et al. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22563–22575 (2023).

  22. Chang, D. et al. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. In International Conference on Machine Learning (2024).

  23. Lu, Y., Liang, Y., Zhu, L. & Yang, Y. Freelong: Training-free long video generation with spectral blend temporal attention. In Advances in Neural Information Processing Systems (2024).

  24. Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847 (2023).

  25. Gu, J. et al. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv:2309.03549 (2023).

  26. Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P. & Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492 (2020).

  27. Wen, X., Wang, M., Richardt, C., Chen, Z.-Y. & Hu, S.-M. Photorealistic audio-driven video portraits. IEEE Trans. Vis. Comput. Graph. 26, 3457–3466 (2020).

    Google Scholar 

  28. Zhang, C. et al. Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3867–3876 (2021).

  29. Blanz, V. & Vetter, T. A morphable model for the synthesis of 3D faces. Semin. Graph. Pap. Push. Bound. 2, 157–164 (2023).

    Google Scholar 

  30. Zhang, W. et al. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661 (2023).

  31. Sun, X. et al. Vividtalk: One-shot audio-driven talking head generation based on 3D hybrid prior. arXiv:2312.01841 (2023).

  32. Ma, Y. et al. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. Vol. 2. arXiv:2312.09767 (2023).

  33. Shen, S. et al. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982–1991 (2023).

  34. Wang, C. et al. V-express: Conditional dropout for progressive training of portrait video generation. arXiv:2406.02511 (2024).

  35. Van Den Oord, A., Vinyals, O. et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems. Vol. 30 (2017).

  36. Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502 (2020).

  37. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763 (2021).

  38. Cao, M. et al. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570 (2023).

  39. Schneider, S., Baevski, A., Collobert, R. & Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862 (2019).

  40. Zhang, Z., Li, L., Ding, Y. & Fan, C. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661–3670 (2021).

  41. Lugaresi, C. et al. Mediapipe: A framework for building perception pipelines. arXiv:1906.08172 (2019).

  42. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems. Vol. 30 (2017).

  43. Unterthiner, T. et al. Fvd: A new metric for video generation. In International Conference on Learning Representations (2019).

  44. Chung, J. S. & Zisserman, A. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops. 251–263 (Springer, 2017).

  45. Kim, M., Jain, A. K. & Liu, X. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18750–18759 (2022).

  46. Wei, H., Yang, Z. & Wang, Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv:2403.17694 (2024).

Download references

Acknowledgements

This work was supported by the Major Program of National Natural Science Foundation of China (No.12292980, 12292984), the National Key R&D Program of China (No.2023YFA1009000, 2023YFA1009004, 2020YFA0712203, 2020YFA0712201), Key Program of National Natural Science Foundation of China (NSFC12031016), Tianyuan Foundation of National Natural Science Foundation of China (NSFC12426529), Beijing Natural Science Foundation (BNSF-Z210003), the Department of Science, Technology and Information of the Ministry of Education (No. 8091B042240).

Author information

Authors and Affiliations

  1. Academy for Advanced Interdisciplinary Studies, Northeast Normal University, Changchun, 130024, China

    Xiangwen Ma, Jiaxin Zhao, Xiaoyu Huang, Yang Li & Junping Yin

  2. School of Mathematical Sciences, Heilongjiang University, Harbin, 150080, China

    San Li

  3. Institute of Applied Physics and Computational Mathematics, Beijing, 100094, China

    Junping Yin

  4. Shanghai Zhangjiang Institute of Mathematics, Shanghai, 201210, China

    Junping Yin

Authors
  1. Xiangwen Ma
    View author publications

    Search author on:PubMed Google Scholar

  2. Jiaxin Zhao
    View author publications

    Search author on:PubMed Google Scholar

  3. Xiaoyu Huang
    View author publications

    Search author on:PubMed Google Scholar

  4. San Li
    View author publications

    Search author on:PubMed Google Scholar

  5. Yang Li
    View author publications

    Search author on:PubMed Google Scholar

  6. Junping Yin
    View author publications

    Search author on:PubMed Google Scholar

Contributions

X.M. and X.H. developed the model and designed the architecture. Y.L. provided guidance on model construction. J.Z. and S.L. were responsible for collecting and cleaning the training data. X.M. wrote the original draft, and J.Y. supervised this research. All authors reviewed and edited the manuscript.

Corresponding authors

Correspondence to Yang Li or Junping Yin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, X., Zhao, J., Huang, X. et al. Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion. Sci Rep (2026). https://doi.org/10.1038/s41598-026-46445-6

Download citation

  • Received: 09 June 2025

  • Accepted: 25 March 2026

  • Published: 05 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-46445-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics