Abstract
Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the end-to-end latent diffusion paradigm Hallo as the backbone, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.
Data availability
The HDTF dataset \(^{40}\) can be obtained from https://github.com/MRzzm/HDTF. This dataset \(^{40}\) was collected from publically available Youtube website and made publically available by its original creators. The HDTF dataset \(^{40}\) is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). It should be noted that the “Wild” dataset consists of publicly available talking head videos collected from the internet. No personally identifiable images, videos, or identity-related information from this dataset are displayed, analyzed individually, or shared in any form within this study. All displayed figures and qualitative results of the real portrait images in this paper are based exclusively on the HDTF dataset \(^{40}\). Therefore, this study complies with all ethical and licensing requirements.
References
Dhariwal, P. & Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Peebles, W. & Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205 (2023).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10684–10695 (2022).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256–2265 (2015).
Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).
Zhou, Y., Liu, Y., Shao, Y. & Chen, J. Fine-tuning diffusion model to generate new kite designs for the revitalization and innovation of intangible cultural heritage. Sci. Rep. 15, 7519 (2025).
Yue, Z., Wang, J. & Loy, C. C. Efficient diffusion model for image restoration by residual shifting. IEEE Trans. Pattern Anal. Mach. Intell. 47, 116–130 (2024).
Loc, I. & Unlu, M. B. Accelerating photoacoustic microscopy by reconstructing undersampled images using diffusion models. Sci. Rep. 14 (2024).
Wang, J., Zhang, O. & Jiang, Y. Multimodal diffusion framework for collaborative text image audio generation and applications. Sci. Rep. 15 (2025).
Guo, Y. et al. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv:2307.04725 (2023).
Hu, L. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8153–8163 (2024).
Tian, L., Wang, Q., Zhang, B. & Bo, L. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. In European Conference on Computer Vision. 244–260 (2024).
Xu, M. et al. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation. arXiv:2406.08801 (2024).
Zhang, Y. et al. Meta talk: Learning to data-efficiently generate audio-driven lip-synchronized talking face with high definition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 4848–4852 (2022).
Yao, S., Zhong, R., Yan, Y., Zhai, G. & Yang, X. Dfa-nerf: Personalized talking head generation via disentangled face attributes neural rendering. arXiv:2201.00791 (2022).
Yu, Z. et al. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7645–7655 (2023).
Wang, D., Deng, Y., Yin, Z., Shum, H.-Y. & Wang, B. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17979–17989 (2023).
Chen, L., Maddox, R. K., Duan, Z. & Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7832–7841 (2019).
Blanz, V. & Vetter, T. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003).
Blattmann, A. et al. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22563–22575 (2023).
Chang, D. et al. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffusion. In International Conference on Machine Learning (2024).
Lu, Y., Liang, Y., Zhu, L. & Yang, Y. Freelong: Training-free long video generation with spectral blend temporal attention. In Advances in Neural Information Processing Systems (2024).
Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847 (2023).
Gu, J. et al. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv:2309.03549 (2023).
Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P. & Jawahar, C. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492 (2020).
Wen, X., Wang, M., Richardt, C., Chen, Z.-Y. & Hu, S.-M. Photorealistic audio-driven video portraits. IEEE Trans. Vis. Comput. Graph. 26, 3457–3466 (2020).
Zhang, C. et al. Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3867–3876 (2021).
Blanz, V. & Vetter, T. A morphable model for the synthesis of 3D faces. Semin. Graph. Pap. Push. Bound. 2, 157–164 (2023).
Zhang, W. et al. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661 (2023).
Sun, X. et al. Vividtalk: One-shot audio-driven talking head generation based on 3D hybrid prior. arXiv:2312.01841 (2023).
Ma, Y. et al. Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. Vol. 2. arXiv:2312.09767 (2023).
Shen, S. et al. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982–1991 (2023).
Wang, C. et al. V-express: Conditional dropout for progressive training of portrait video generation. arXiv:2406.02511 (2024).
Van Den Oord, A., Vinyals, O. et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems. Vol. 30 (2017).
Song, J., Meng, C. & Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502 (2020).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748–8763 (2021).
Cao, M. et al. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22560–22570 (2023).
Schneider, S., Baevski, A., Collobert, R. & Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862 (2019).
Zhang, Z., Li, L., Ding, Y. & Fan, C. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3661–3670 (2021).
Lugaresi, C. et al. Mediapipe: A framework for building perception pipelines. arXiv:1906.08172 (2019).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems. Vol. 30 (2017).
Unterthiner, T. et al. Fvd: A new metric for video generation. In International Conference on Learning Representations (2019).
Chung, J. S. & Zisserman, A. Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops. 251–263 (Springer, 2017).
Kim, M., Jain, A. K. & Liu, X. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18750–18759 (2022).
Wei, H., Yang, Z. & Wang, Z. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv:2403.17694 (2024).
Acknowledgements
This work was supported by the Major Program of National Natural Science Foundation of China (No.12292980, 12292984), the National Key R&D Program of China (No.2023YFA1009000, 2023YFA1009004, 2020YFA0712203, 2020YFA0712201), Key Program of National Natural Science Foundation of China (NSFC12031016), Tianyuan Foundation of National Natural Science Foundation of China (NSFC12426529), Beijing Natural Science Foundation (BNSF-Z210003), the Department of Science, Technology and Information of the Ministry of Education (No. 8091B042240).
Author information
Authors and Affiliations
Contributions
X.M. and X.H. developed the model and designed the architecture. Y.L. provided guidance on model construction. J.Z. and S.L. were responsible for collecting and cleaning the training data. X.M. wrote the original draft, and J.Y. supervised this research. All authors reviewed and edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ma, X., Zhao, J., Huang, X. et al. Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion. Sci Rep (2026). https://doi.org/10.1038/s41598-026-46445-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-46445-6