Abstract
As language models (LMs) increasingly infiltrate into high-stakes domains such as law, medicine, journalism and science, their ability to distinguish belief from knowledge, and fact from fiction, becomes imperative. Failure to make such distinctions can mislead diagnoses, distort judicial judgments and amplify misinformation. Here we evaluate 24 cutting-edge LMs using a new KaBLE benchmark of 13,000 questions across 13 epistemic tasks. Our findings reveal crucial limitations. In particular, all models tested systematically fail to acknowledge first-person false beliefs, with GPT-4o dropping from 98.2% to 64.4% accuracy and DeepSeek R1 plummeting from over 90% to 14.4%. Further, models process third-person false beliefs with substantially higher accuracy (95% for newer models; 79% for older ones) than first-person false beliefs (62.6% for newer; 52.5% for older), revealing a troubling attribution bias. We also find that, while recent models show competence in recursive knowledge tasks, they still rely on inconsistent reasoning strategies, suggesting superficial pattern matching rather than robust epistemic understanding. Most models lack a robust understanding of the factive nature of knowledge, that knowledge inherently requires truth. These limitations necessitate urgent improvements before deploying LMs in high-stakes domains where epistemic distinctions are crucial.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The KaBLE dataset introduced in this study is publicly available via Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable (ref. 11). An online leaderboard tracking model performance on the dataset is available at https://huggingface.co/spaces/vinid/kable-leaderboard.
Code availability
The full code for reproducing our results is available via Zenodo at https://doi.org/10.5281/zenodo.15249480 (ref. 10). It is also available via GitHub at https://github.com/suzgunmirac/belief-in-the-machine.
References
User clip: nicotine is not addictive. C-SPAN https://www.c-span.org/clip/house-committee/user-clip-nicotine-is-not-addictive/4527554 (1994).
Tobacco CEOs’ statement to Congress 1994 news clip ‘Nicotine is not addictive.’ UCSF Academic Senate https://senate.ucsf.edu/tobacco-ceo-statement-to-congress (1994).
Sap, M., Le Bras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3762–3780 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.emnlp-main.248
Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 13518–13529 (Curan Associates, Inc. 2023).
Kosinski, M. Theory of mind might have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).
Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).
Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Graham, Y. & Purver, M.) 2257–2273 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.eacl-long.138
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).
Sharma, M. et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (eds Kim, B. et al.) 110–144 (2024).
Suzgun, M. et al. KaBLE Dataset (v1.0). Zenodo https://doi.org/10.5281/zenodo.15249480 (2025).
KaBLE Dataset. Hugging Face https://huggingface.co/datasets/turingmachine/kable (2025).
Suzgun, M., Shieber, S. & Jurafsky, D. string2string: a modern Python library for string-to-string algorithms. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (eds Cao, Y., Feng, Y. & Xiong, D.) 278–285 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-demos.26
Trott, S., Jones, C., Chang, T., Michaelov, J. & Bergen, B. Do large language models know what humans know? Cogn. Sci. 47, e13309 (2023).
Aru, J., Labash, A., Corcoll, O. & Vicente, R. Mind the gap: challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 56, 9141–9156 (2023).
Mahowald, K. et al. Dissociating language and thought in large language models. Trends Cogn. Sci. 28, 517–540 (2024).
Le, M., Boureau, Y.-L. & Nickel, M. Revisiting the evaluation of theory of mind through question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 5872–5877 (Association for Computational Linguistics, 2019).
Ma, X., Gao, L. & Xu, Q. Tomchallenges: a principle-guided dataset and diverse evaluation tasks for exploring theory of mind. In Proc. 27th Conference on Computational Natural Language Learning (CoNLL) (eds Jiang, J. et al.) 15–26 (Association for Computational Linguistics, 2023).
Gandhi, K., Franken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. Preprint at https://doi.org/10.48550/arXiv.2306.15448 (2023).
Wu, Y. et al. Hi-ToM: a benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 10691–10706 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.findings-emnlp.717
Jones, C. R., Trott, S. & Bergen, B. EPITOME: experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents (2023); https://openreview.net/forum?id=e5Yky8Fnvj
Zhou, P. et al. How FaR are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).
Xu, H., Zhao, R., Zhu, L., Du, J. & He, Y. OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 8593–8623 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-long.466
Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Kevin, D. et al.) 1819–1862 (Association for Computational Linguistics, 2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. LLMs’ reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. Preprint at https://doi.org/10.48550/arXiv.2404.06283 (2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. Simple linguistic inferences of large language models (LLMs): blind spots and blinds. Preprint at https://doi.org/10.48550/arXiv.2305.14785 (2023).
Holliday, W. H. & Mandelkern, M. Conditional and modal reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2401.17169 (2024).
Acknowledgements
We thank W. Held, W. H. Holliday, A. T. Kalai, J. Tagliabue, M. Tekgürler, M. Tuncer, S. Sarkar, E. Shen, K. Swanson, A. Wang and M. Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University. M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.
Author information
Authors and Affiliations
Contributions
M.S., T.G. and F.B. conceptualized the research. M.S. led the overall project. M.S., T.G. and F.B. created the KaBLE dataset, performed the main benchmarking experiments and analysed the results—with support from all authors. M.S. and F.B. developed the primary codebase. D.E.H., T.I., D.J. and J.Z. contributed to the experimental design of the benchmark, interpretation of the results and revision of the paper. All the authors contributed to writing the paper and approved the final version. D.E.H. and J.Z. supervised the project throughout.
Corresponding author
Ethics declarations
Competing interests
M.S. previously held research internship positions at Google Brain, Microsoft Research and Meta GenAI; none of these organizations had any role in the conception, design, execution, evaluation or writing of this paper. The other authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Kristian Kersting and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Discussion (epistemology (§A), related work (§B), additional experimental details (§C), language model release and knowledge-cutoff dates (§D), limitations and future directions (§E) and extended results (§F)), Figs. 1 and 2 and Tables 1–4.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Suzgun, M., Gur, T., Bianchi, F. et al. Language models cannot reliably distinguish belief from knowledge and fact. Nat Mach Intell 7, 1780–1790 (2025). https://doi.org/10.1038/s42256-025-01113-8
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-01113-8
This article is cited by
-
Large language models still struggle with false beliefs
Nature Machine Intelligence (2025)


