Language models cannot reliably distinguish belief from knowledge and fact

Suzgun, Mirac; Gur, Tayfun; Bianchi, Federico; Ho, Daniel E.; Icard, Thomas; Jurafsky, Dan; Zou, James

doi:10.1038/s42256-025-01113-8

Article
Published: 03 November 2025

Language models cannot reliably distinguish belief from knowledge and fact

Nature Machine Intelligence volume 7, pages 1780–1790 (2025)Cite this article

9152 Accesses
4 Citations
461 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

As language models (LMs) increasingly infiltrate into high-stakes domains such as law, medicine, journalism and science, their ability to distinguish belief from knowledge, and fact from fiction, becomes imperative. Failure to make such distinctions can mislead diagnoses, distort judicial judgments and amplify misinformation. Here we evaluate 24 cutting-edge LMs using a new KaBLE benchmark of 13,000 questions across 13 epistemic tasks. Our findings reveal crucial limitations. In particular, all models tested systematically fail to acknowledge first-person false beliefs, with GPT-4o dropping from 98.2% to 64.4% accuracy and DeepSeek R1 plummeting from over 90% to 14.4%. Further, models process third-person false beliefs with substantially higher accuracy (95% for newer models; 79% for older ones) than first-person false beliefs (62.6% for newer; 52.5% for older), revealing a troubling attribution bias. We also find that, while recent models show competence in recursive knowledge tasks, they still rely on inconsistent reasoning strategies, suggesting superficial pattern matching rather than robust epistemic understanding. Most models lack a robust understanding of the factive nature of knowledge, that knowledge inherently requires truth. These limitations necessitate urgent improvements before deploying LMs in high-stakes domains where epistemic distinctions are crucial.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: LMs struggle to affirm first-person beliefs in factually false scenarios.**

**Fig. 2: Sample true (factual) and false statements from the KaBLE dataset.**

**Fig. 3: Overview of the 13 basic epistemic comprehension and reasoning tasks in the KaBLE dataset.**

**Fig. 4: Performance (%) of recent reasoning-driven LMs across verification, confirmation and recursive knowledge tasks in the dataset.**

**Fig. 5: Performance of LMs on the verification (left) and confirmation (right) of first-person belief tasks involving false statements.**

**Fig. 6: Examples LM responses across epistemic reasoning tasks.**

Factuality challenges in the era of large language models and opportunities for fact-checking

Article 22 August 2024

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

Article Open access 28 March 2025

Data availability

The KaBLE dataset introduced in this study is publicly available via Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable (ref. ¹¹). An online leaderboard tracking model performance on the dataset is available at https://huggingface.co/spaces/vinid/kable-leaderboard.

Code availability

The full code for reproducing our results is available via Zenodo at https://doi.org/10.5281/zenodo.15249480 (ref. ¹⁰). It is also available via GitHub at https://github.com/suzgunmirac/belief-in-the-machine.

References

User clip: nicotine is not addictive. C-SPAN https://www.c-span.org/clip/house-committee/user-clip-nicotine-is-not-addictive/4527554 (1994).
Tobacco CEOs’ statement to Congress 1994 news clip ‘Nicotine is not addictive.’ UCSF Academic Senate https://senate.ucsf.edu/tobacco-ceo-statement-to-congress (1994).
Sap, M., Le Bras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3762–3780 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.emnlp-main.248
Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 13518–13529 (Curan Associates, Inc. 2023).
Kosinski, M. Theory of mind might have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).
Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).
Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Graham, Y. & Purver, M.) 2257–2273 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.eacl-long.138
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).
Sharma, M. et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (eds Kim, B. et al.) 110–144 (2024).
Suzgun, M. et al. KaBLE Dataset (v1.0). Zenodo https://doi.org/10.5281/zenodo.15249480 (2025).
KaBLE Dataset. Hugging Face https://huggingface.co/datasets/turingmachine/kable (2025).
Suzgun, M., Shieber, S. & Jurafsky, D. string2string: a modern Python library for string-to-string algorithms. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (eds Cao, Y., Feng, Y. & Xiong, D.) 278–285 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-demos.26
Trott, S., Jones, C., Chang, T., Michaelov, J. & Bergen, B. Do large language models know what humans know? Cogn. Sci. 47, e13309 (2023).
Article Google Scholar
Aru, J., Labash, A., Corcoll, O. & Vicente, R. Mind the gap: challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 56, 9141–9156 (2023).
Article Google Scholar
Mahowald, K. et al. Dissociating language and thought in large language models. Trends Cogn. Sci. 28, 517–540 (2024).
Article Google Scholar
Le, M., Boureau, Y.-L. & Nickel, M. Revisiting the evaluation of theory of mind through question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 5872–5877 (Association for Computational Linguistics, 2019).
Ma, X., Gao, L. & Xu, Q. Tomchallenges: a principle-guided dataset and diverse evaluation tasks for exploring theory of mind. In Proc. 27th Conference on Computational Natural Language Learning (CoNLL) (eds Jiang, J. et al.) 15–26 (Association for Computational Linguistics, 2023).
Gandhi, K., Franken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. Preprint at https://doi.org/10.48550/arXiv.2306.15448 (2023).
Wu, Y. et al. Hi-ToM: a benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 10691–10706 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.findings-emnlp.717
Jones, C. R., Trott, S. & Bergen, B. EPITOME: experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents (2023); https://openreview.net/forum?id=e5Yky8Fnvj
Zhou, P. et al. How FaR are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).
Xu, H., Zhao, R., Zhu, L., Du, J. & He, Y. OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 8593–8623 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-long.466
Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Kevin, D. et al.) 1819–1862 (Association for Computational Linguistics, 2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. LLMs’ reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. Preprint at https://doi.org/10.48550/arXiv.2404.06283 (2024).
Basmov, V., Goldberg, Y. & Tsarfaty, R. Simple linguistic inferences of large language models (LLMs): blind spots and blinds. Preprint at https://doi.org/10.48550/arXiv.2305.14785 (2023).
Holliday, W. H. & Mandelkern, M. Conditional and modal reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2401.17169 (2024).

Download references

Acknowledgements

We thank W. Held, W. H. Holliday, A. T. Kalai, J. Tagliabue, M. Tekgürler, M. Tuncer, S. Sarkar, E. Shen, K. Swanson, A. Wang and M. Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University. M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.

Author information

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, CA, USA
Mirac Suzgun, Daniel E. Ho, Thomas Icard, Dan Jurafsky & James Zou
Stanford Law School, Stanford, CA, USA
Mirac Suzgun & Daniel E. Ho
Department of Philosophy, Duke University, Durham, NC, USA
Tayfun Gur
TogetherAI, San Francisco, CA, USA
Federico Bianchi
Department of Political Science, Stanford University, Stanford, CA, USA
Daniel E. Ho
Department of Philosophy, Stanford University, Stanford, CA, USA
Thomas Icard
Department of Linguistics, Stanford University, Stanford, CA, USA
Dan Jurafsky
Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
James Zou
Department of Electrical Engineering, Stanford University, Stanford, CA, USA
James Zou

Authors

Mirac Suzgun
View author publications
Search author on:PubMed Google Scholar
Tayfun Gur
View author publications
Search author on:PubMed Google Scholar
Federico Bianchi
View author publications
Search author on:PubMed Google Scholar
Daniel E. Ho
View author publications
Search author on:PubMed Google Scholar
Thomas Icard
View author publications
Search author on:PubMed Google Scholar
Dan Jurafsky
View author publications
Search author on:PubMed Google Scholar
James Zou
View author publications
Search author on:PubMed Google Scholar

Contributions

M.S., T.G. and F.B. conceptualized the research. M.S. led the overall project. M.S., T.G. and F.B. created the KaBLE dataset, performed the main benchmarking experiments and analysed the results—with support from all authors. M.S. and F.B. developed the primary codebase. D.E.H., T.I., D.J. and J.Z. contributed to the experimental design of the benchmark, interpretation of the results and revision of the paper. All the authors contributed to writing the paper and approved the final version. D.E.H. and J.Z. supervised the project throughout.

Corresponding author

Correspondence to James Zou.

Ethics declarations

Competing interests

M.S. previously held research internship positions at Google Brain, Microsoft Research and Meta GenAI; none of these organizations had any role in the conception, design, execution, evaluation or writing of this paper. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Kristian Kersting and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Discussion (epistemology (§A), related work (§B), additional experimental details (§C), language model release and knowledge-cutoff dates (§D), limitations and future directions (§E) and extended results (§F)), Figs. 1 and 2 and Tables 1–4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Suzgun, M., Gur, T., Bianchi, F. et al. Language models cannot reliably distinguish belief from knowledge and fact. Nat Mach Intell 7, 1780–1790 (2025). https://doi.org/10.1038/s42256-025-01113-8

Download citation

Received: 11 December 2024
Accepted: 11 August 2025
Published: 03 November 2025
Version of record: 03 November 2025
Issue date: November 2025
DOI: https://doi.org/10.1038/s42256-025-01113-8

This article is cited by

Large language models still struggle with false beliefs
- Kristian Kersting
Nature Machine Intelligence (2025)