Abstract
The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies challenges fake speech detection with an ever-evolving diversity of spoofed audio. Current approaches, which rely on a classification-based perspective, are highly dependent on a big amount of training data and show limited generalization to unseen attack types. To address these limitations, this paper introduces a brain-inspired, multi-clue detection paradigm. We propose a perception-decision machine composed of two core components. The perception module utilizes multiple independent detectors, each optimized for Maximum Detection Precision (MaxDP) to identify a specific forgery clue. By standardizing their outputs into binary Boolean values, this design allows for flexible computational models. The decision-making module then renders a final judgment by first evaluating learned combinations of the detected clues through a logical reasoning process. The outcomes of this reasoning are then aggregated using a variable-length OR operation, a mechanism that enables the seamless incremental learning of new forgery clues without retraining the entire system. Our results validate the effectiveness of the multi-clue detection perspective, demonstrating the framework’s potential for enhanced explainability and practical adaptability to new threats.
Similar content being viewed by others
Data availability
The fake speech data in this study is open source. ASVspoof2019 LA can be accessed at https://datashare.ed. ac.uk/handle/10283/3336. ASVspoof2021 LA can be accessed at https://zenodo.org/record/4837263. CFAD can be accessed at https://zenodo.org/records/8122764.
References
Kaur, N. & Singh, P. Conventional and contemporary approaches used in text to speech synthesis: A review. Artif. Intell. Rev. 56, 5837–5880 (2023).
Walczyna, T. & Piotrowski, Z. Overview of voice conversion methods based on deep learning. Appl. Sci. 13, 3100 (2023).
Shah, A. J. & Patil, H. A. Significance of lower frequency regions for audio deepfake detection. In Proceedings of 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 1–6 (IEEE, 2024).
Kirchhübel, C. & Brown, G. Spoofed speech from the perspective of a forensic phonetician. In Proceedings of the 2022 Interspeech. 1308–1312 (Incheon, 2022).
Sun, C., Jia, S., Hou, S. & Lyu, S. Ai-synthesized voice detection using neural vocoder artifacts. In Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 904–912 (IEEE, 2023).
Yang, J., Das, R. K. & Li, H. Significance of subband features for synthetic speech detection. IEEE Trans. Inf. For. Secur. 15, 2160–2170 (2020).
Zhang, Y., Wang, W. & Zhang, P. The effect of silence and dual-band fusion in anti-spoofing system. In Proceedings of 2021 Interspeech. 4279–4283 (International Speech Communication Association, 2021).
Todisco, M., Delgado, H. & Evans, N. W. A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Proceedings of the 9th Speaker and Language Recognition Workshop Odyssey. Vol. 2016. 283–290. 10.21437/Odyssey.2016-4 (Bilbao, 2016).
Wu, Z., Das, R. K., Yang, J. & Li, H. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. In Proceedings of 2020 Interspeech. 1101–1105 (International Speech Communication Association, 2020).
Wang, C. et al. Detection of cross-dataset fake audio based on prosodic and pronunciation features. In Proceedings of 2023 Interspeech. 3844–3848 (International Speech Communication Association, 2023).
Tak, H., weon Jung, J., Patino, J., Todisco, M. & Evans, N. Graph attention networks for anti-spoofing. In Proceedings of the 22nd Interspeech Conference. 2356–2360. 10.21437/Interspeech.2021-993 (ISCA, 2021).
Tak, H. et al. End-to-end anti-spoofing with rawnet2. In Proceedings of the 46th IEEE International Conference on Acoustics, Speech, and Signal Processing. 6369–6373. 10.1109/ICASSP39728.2021.9414234 (IEEE, 2021).
Chen, Y. et al. Rawbmamba: End-to-end bidirectional state space model for audio deepfake detection. In Proceedings of the 2024 Interspeech. 2720–2724 (International Speech Communication Association, 2024).
Chettri, B. et al. Ensemble models for spoofing detection in automatic speaker verification. In Proceedings of the 20th Interspeech Conference. 1018–1022. 10.21437/Interspeech.2019-2505 (ISCA, 2019).
Lavrentyeva, G. et al. STC Antispoofing systems for the ASVspoof2019 challenge. In Proceedings of the 20th Interspeech Conference. 1033–1037. 10.21437/Interspeech. 2019–1768 (ISCA, 2019).
Tak, H., Patino, J., Nautsch, A., Evans, N. & Todisco, M. Spoofing attack detection using the non-linear fusion of sub-band classifiers. In Proceedings of the 21st Interspeech Conference. 1106–1110. 10.21437/Interspeech.2020-1844 (ISCA, 2020).
Li, M., Ahmadiadli, Y. & Zhang, X.-P. A survey on speech deepfake detection. ACM Comput. Surv. 57, 1–38 (2025).
Dhamyal, H., Ali, A., Qazi, I. A. & Raza, A. A. Using self attention dnns to discover phonemic features for audio deep fake detection. In Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 1178–1184 (IEEE, 2021).
Li, K., Lu, X., Akagi, M. & Unoki, M. Contributions of jitter and shimmer in the voice for fake audio detection. IEEE Access 11, 84689–84698 (2023).
Parise, C. V. & Ernst, M. O. Correlation detection as a general mechanism for multisensory integration. Nat. Commun. 7, 11543 (2016).
Pesnot Lerousseau, J., Parise, C. V., Ernst, M. O. & van Wassenhove, V. Multisensory correlation computations in the human brain identified by a time-resolved encoding model. Nat. Commun. 13, 2489 (2022).
Rohlf, S., Li, L., Bruns, P. & Röder, B. Multisensory integration develops prior to crossmodal recalibration. Curr. Biol. 30, 1726–1732 (2020).
Wang, X. et al. Asvspoof 2019: A large-scale public database of synthesized, conted and replayed speech. Comput. Speech Lang. 64, 101114. https://doi.org/10.1016/j.csl.2020.101114 (2020).
Liu, X. et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 2507–2522 (2023).
Ma, H. et al. Cfad: A Chinese dataset for fake audio detection. Speech Commun. 164, 103122 (2024).
Jung, J.-W. et al. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In Proceedings of the 47th IEEE International Conference on Acoustics, Speech, and Signal Processing. 6367–6371. 10.1109/ICASSP43922.2022.9747766 (IEEE, 2022).
Tak, H. et al. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In Proceedings of the 12th Speaker and Language Recognition Workshop Odyssey. 112–119 (2022).
Liu, X. et al. Leveraging positional-related local-global dependency for synthetic speech detection. In Proceedings of the 2023 International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 1–5 (IEEE, 2023).
Wang, S. et al. Memristor-based adaptive neuromorphic perception in unstructured environments. Nat. Commun. 15, 4671 (2024).
Yu, F. et al. Brain-inspired multimodal hybrid neural network for robot place recognition. Sci. Robot. 8, eabm6996 (2023).
Lin, X. et al. A brain-inspired computational model for spatio-temporal information processing. Neural Netw. 143, 74–87 (2021).
Grinberg, P., Kumar, A., Koppisetti, S. & Bharaj, G. What does an audio deepfake detector focus on? a study in the time domain. In ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5 (IEEE, 2025).
Jung, H. & Oh, Y. Towards better explanations of class activation mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1336–1344 (2021).
Ravanelli, M. & Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of 2018 IEEE Spoken Language Technology Workshop. 1021–1028 (IEEE, 2018).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
Babu, A. et al. Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Proceedings of the 23rd Interspeech Conference. 2278–2282. 10.21437/Interspeech.2022-143 (ISCA, 2022).
Acknowledgements
Acknowledged to the Beijing Municipal Science and Technology Commission for its funding and support of Project Z221100001222005 under the Beijing Science and Technology Plan. And also acknowledged to Tianshan Talents Cultivation Program - Leading Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).
Funding
This work was funded by Beijing Science and Technology Financial Innovation Support Project (Z221100001222005).
Author information
Authors and Affiliations
Contributions
Chang Feng analyzed the data, performed the experiments and wrote the initial manuscript. Xiaolong Wu edited the figures, provided feedback and revised the manuscript. Hamdulla Askar and Mingxing Xu supervised the research. Lihong Cao reviewed and edited the final manuscript. Thomas Fang Zheng conceived the project, designed the study, reviewed and edited the final manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
See Table 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Feng, C., Wu, X., Askar, H. et al. Brain-inspired perception-decision machine for fake speech detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-41859-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-41859-8


