Brain-inspired perception-decision machine for fake speech detection

Feng, Chang; Wu, Xiaolong; Askar, Hamdulla; Xu, Mingxing; Cao, Lihong; Zheng, Thomas Fang

doi:10.1038/s41598-026-41859-8

Download PDF

Article
Open access
Published: 05 March 2026

Brain-inspired perception-decision machine for fake speech detection

Chang Feng¹,
Xiaolong Wu²,
Hamdulla Askar²,
Mingxing Xu¹,
Lihong Cao³ &
…
Thomas Fang Zheng¹

Scientific Reports , Article number: (2026) Cite this article

695 Accesses
1 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies challenges fake speech detection with an ever-evolving diversity of spoofed audio. Current approaches, which rely on a classification-based perspective, are highly dependent on a big amount of training data and show limited generalization to unseen attack types. To address these limitations, this paper introduces a brain-inspired, multi-clue detection paradigm. We propose a perception-decision machine composed of two core components. The perception module utilizes multiple independent detectors, each optimized for Maximum Detection Precision (MaxDP) to identify a specific forgery clue. By standardizing their outputs into binary Boolean values, this design allows for flexible computational models. The decision-making module then renders a final judgment by first evaluating learned combinations of the detected clues through a logical reasoning process. The outcomes of this reasoning are then aggregated using a variable-length OR operation, a mechanism that enables the seamless incremental learning of new forgery clues without retraining the entire system. Our results validate the effectiveness of the multi-clue detection perspective, demonstrating the framework’s potential for enhanced explainability and practical adaptability to new threats.

A multiscale brain emulation-based artificial intelligence framework for dynamic environments

Article Open access 21 May 2025

Language and culture internalization for human-like autotelic AI

Article 20 December 2022

AI hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content

Article Open access 27 September 2024

Data availability

The fake speech data in this study is open source. ASVspoof2019 LA can be accessed at https://datashare.ed. ac.uk/handle/10283/3336. ASVspoof2021 LA can be accessed at https://zenodo.org/record/4837263. CFAD can be accessed at https://zenodo.org/records/8122764.

References

Kaur, N. & Singh, P. Conventional and contemporary approaches used in text to speech synthesis: A review. Artif. Intell. Rev. 56, 5837–5880 (2023).
Google Scholar
Walczyna, T. & Piotrowski, Z. Overview of voice conversion methods based on deep learning. Appl. Sci. 13, 3100 (2023).
Google Scholar
Shah, A. J. & Patil, H. A. Significance of lower frequency regions for audio deepfake detection. In Proceedings of 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 1–6 (IEEE, 2024).
Kirchhübel, C. & Brown, G. Spoofed speech from the perspective of a forensic phonetician. In Proceedings of the 2022 Interspeech. 1308–1312 (Incheon, 2022).
Sun, C., Jia, S., Hou, S. & Lyu, S. Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 904–912 (IEEE, 2023).
Yang, J., Das, R. K. & Li, H. Significance of subband features for synthetic speech detection. IEEE Trans. Inf. For. Secur. 15, 2160–2170 (2020).
Google Scholar
Zhang, Y., Wang, W. & Zhang, P. The effect of silence and dual-band fusion in anti-spoofing system. In Proceedings of 2021 Interspeech. 4279–4283 (International Speech Communication Association, 2021).
Todisco, M., Delgado, H. & Evans, N. W. A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Proceedings of the 9th Speaker and Language Recognition Workshop Odyssey. Vol. 2016. 283–290. 10.21437/Odyssey.2016-4 (Bilbao, 2016).
Wu, Z., Das, R. K., Yang, J. & Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. In Proceedings of 2020 Interspeech. 1101–1105 (International Speech Communication Association, 2020).
Wang, C. et al. Detection of cross-dataset fake audio based on prosodic and pronunciation features. In Proceedings of 2023 Interspeech. 3844–3848 (International Speech Communication Association, 2023).
Tak, H., weon Jung, J., Patino, J., Todisco, M. & Evans, N. Graph attention networks for anti-spoofing. In Proceedings of the 22nd Interspeech Conference. 2356–2360. 10.21437/Interspeech.2021-993 (ISCA, 2021).
Tak, H. et al. End-to-end anti-spoofing with rawnet2. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech, and Signal Processing. 6369–6373. 10.1109/ICASSP39728.2021.9414234 (IEEE, 2021).
Chen, Y. et al. Rawbmamba: End-to-end bidirectional state space model for audio deepfake detection. In Proceedings of the 2024 Interspeech. 2720–2724 (International Speech Communication Association, 2024).
Chettri, B. et al. Ensemble models for spoofing detection in automatic speaker verification. In Proceedings of the 20th Interspeech Conference. 1018–1022. 10.21437/Interspeech.2019-2505 (ISCA, 2019).
Lavrentyeva, G. et al. STC Antispoofing systems for the ASVspoof2019 challenge. In Proceedings of the 20th Interspeech Conference. 1033–1037. 10.21437/Interspeech. 2019–1768 (ISCA, 2019).
Tak, H., Patino, J., Nautsch, A., Evans, N. & Todisco, M. Spoofing attack detection using the non-linear fusion of sub-band classifiers. In Proceedings of the 21st Interspeech Conference. 1106–1110. 10.21437/Interspeech.2020-1844 (ISCA, 2020).
Li, M., Ahmadiadli, Y. & Zhang, X.-P. A survey on speech deepfake detection. ACM Comput. Surv. 57, 1–38 (2025).
Google Scholar
Dhamyal, H., Ali, A., Qazi, I. A. & Raza, A. A. Using self attention dnns to discover phonemic features for audio deep fake detection. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 1178–1184 (IEEE, 2021).
Li, K., Lu, X., Akagi, M. & Unoki, M. Contributions of jitter and shimmer in the voice for fake audio detection. IEEE Access 11, 84689–84698 (2023).
Google Scholar
Parise, C. V. & Ernst, M. O. Correlation detection as a general mechanism for multisensory integration. Nat. Commun. 7, 11543 (2016).
Google Scholar
Pesnot Lerousseau, J., Parise, C. V., Ernst, M. O. & van Wassenhove, V. Multisensory correlation computations in the human brain identified by a time-resolved encoding model. Nat. Commun. 13, 2489 (2022).
Google Scholar
Rohlf, S., Li, L., Bruns, P. & Röder, B. Multisensory integration develops prior to crossmodal recalibration. Curr. Biol. 30, 1726–1732 (2020).
Google Scholar
Wang, X. et al. Asvspoof 2019: A large-scale public database of synthesized, conted and replayed speech. Comput. Speech Lang. 64, 101114. https://doi.org/10.1016/j.csl.2020.101114 (2020).
Google Scholar
Liu, X. et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 2507–2522 (2023).
Google Scholar
Ma, H. et al. Cfad: A Chinese dataset for fake audio detection. Speech Commun. 164, 103122 (2024).
Google Scholar
Jung, J.-W. et al. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the 47th IEEE International Conference on Acoustics, Speech, and Signal Processing. 6367–6371. 10.1109/ICASSP43922.2022.9747766 (IEEE, 2022).
Tak, H. et al. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proceedings of the 12th Speaker and Language Recognition Workshop Odyssey. 112–119 (2022).
Liu, X. et al. Leveraging positional-related local-global dependency for synthetic speech detection. In Proceedings of the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1–5 (IEEE, 2023).
Wang, S. et al. Memristor-based adaptive neuromorphic perception in unstructured environments. Nat. Commun. 15, 4671 (2024).
Google Scholar
Yu, F. et al. Brain-inspired multimodal hybrid neural network for robot place recognition. Sci. Robot. 8, eabm6996 (2023).
Lin, X. et al. A brain-inspired computational model for spatio-temporal information processing. Neural Netw. 143, 74–87 (2021).
Google Scholar
Grinberg, P., Kumar, A., Koppisetti, S. & Bharaj, G. What does an audio deepfake detector focus on? a study in the time domain. In ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5 (IEEE, 2025).
Jung, H. & Oh, Y. Towards better explanations of class activation mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1336–1344 (2021).
Ravanelli, M. & Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of 2018 IEEE Spoken Language Technology Workshop. 1021–1028 (IEEE, 2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
Babu, A. et al. Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of the 23rd Interspeech Conference. 2278–2282. 10.21437/Interspeech.2022-143 (ISCA, 2022).

Download references

Acknowledgements

Acknowledged to the Beijing Municipal Science and Technology Commission for its funding and support of Project Z221100001222005 under the Beijing Science and Technology Plan. And also acknowledged to Tianshan Talents Cultivation Program - Leading Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).

Funding

This work was funded by Beijing Science and Technology Financial Innovation Support Project (Z221100001222005).

Author information

Authors and Affiliations

Center for Speech and Language Technologies, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, 100084, China
Chang Feng, Mingxing Xu & Thomas Fang Zheng
School of Computer Science and Technology, Xinjiang University, Urumqi, 830000, China
Xiaolong Wu & Hamdulla Askar
Neuroscience and Intelligent Media Institute, Communication University of China, Beijing, 100024, China
Lihong Cao

Authors

Chang Feng
View author publications
Search author on:PubMed Google Scholar
Xiaolong Wu
View author publications
Search author on:PubMed Google Scholar
Hamdulla Askar
View author publications
Search author on:PubMed Google Scholar
Mingxing Xu
View author publications
Search author on:PubMed Google Scholar
Lihong Cao
View author publications
Search author on:PubMed Google Scholar
Thomas Fang Zheng
View author publications
Search author on:PubMed Google Scholar

Contributions

Chang Feng analyzed the data, performed the experiments and wrote the initial manuscript. Xiaolong Wu edited the figures, provided feedback and revised the manuscript. Hamdulla Askar and Mingxing Xu supervised the research. Lihong Cao reviewed and edited the final manuscript. Thomas Fang Zheng conceived the project, designed the study, reviewed and edited the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Thomas Fang Zheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Table 3.

Table 3 Detailed thresholds and performance for each detector on the development set. MaxDP strategy ensures high precision across all detectors, regardless of the missed detection in a single detector.

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Feng, C., Wu, X., Askar, H. et al. Brain-inspired perception-decision machine for fake speech detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-41859-8

Download citation

Received: 01 July 2025
Accepted: 23 February 2026
Published: 05 March 2026
DOI: https://doi.org/10.1038/s41598-026-41859-8