Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Language models cannot reliably distinguish belief from knowledge and fact

A preprint version of the article is available at arXiv.

Abstract

As language models (LMs) increasingly infiltrate into high-stakes domains such as law, medicine, journalism and science, their ability to distinguish belief from knowledge, and fact from fiction, becomes imperative. Failure to make such distinctions can mislead diagnoses, distort judicial judgments and amplify misinformation. Here we evaluate 24 cutting-edge LMs using a new KaBLE benchmark of 13,000 questions across 13 epistemic tasks. Our findings reveal crucial limitations. In particular, all models tested systematically fail to acknowledge first-person false beliefs, with GPT-4o dropping from 98.2% to 64.4% accuracy and DeepSeek R1 plummeting from over 90% to 14.4%. Further, models process third-person false beliefs with substantially higher accuracy (95% for newer models; 79% for older ones) than first-person false beliefs (62.6% for newer; 52.5% for older), revealing a troubling attribution bias. We also find that, while recent models show competence in recursive knowledge tasks, they still rely on inconsistent reasoning strategies, suggesting superficial pattern matching rather than robust epistemic understanding. Most models lack a robust understanding of the factive nature of knowledge, that knowledge inherently requires truth. These limitations necessitate urgent improvements before deploying LMs in high-stakes domains where epistemic distinctions are crucial.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: LMs struggle to affirm first-person beliefs in factually false scenarios.
Fig. 2: Sample true (factual) and false statements from the KaBLE dataset.
Fig. 3: Overview of the 13 basic epistemic comprehension and reasoning tasks in the KaBLE dataset.
Fig. 4: Performance (%) of recent reasoning-driven LMs across verification, confirmation and recursive knowledge tasks in the dataset.
Fig. 5: Performance of LMs on the verification (left) and confirmation (right) of first-person belief tasks involving false statements.
Fig. 6: Examples LM responses across epistemic reasoning tasks.

Similar content being viewed by others

Data availability

The KaBLE dataset introduced in this study is publicly available via Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable (ref. 11). An online leaderboard tracking model performance on the dataset is available at https://huggingface.co/spaces/vinid/kable-leaderboard.

Code availability

The full code for reproducing our results is available via Zenodo at https://doi.org/10.5281/zenodo.15249480 (ref. 10). It is also available via GitHub at https://github.com/suzgunmirac/belief-in-the-machine.

References

  1. User clip: nicotine is not addictive. C-SPAN https://www.c-span.org/clip/house-committee/user-clip-nicotine-is-not-addictive/4527554 (1994).

  2. Tobacco CEOs’ statement to Congress 1994 news clip ‘Nicotine is not addictive.’ UCSF Academic Senate https://senate.ucsf.edu/tobacco-ceo-statement-to-congress (1994).

  3. Sap, M., Le Bras, R., Fried, D. & Choi, Y. Neural theory-of-mind? On the limits of social intelligence in large LMs. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z. & Zhang, Y.) 3762–3780 (Association for Computational Linguistics, 2022); https://aclanthology.org/2022.emnlp-main.248

  4. Gandhi, K., Fränken, J.-P., Gerstenberg, T. & Goodman, N. Understanding social reasoning in language models with language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 13518–13529 (Curan Associates, Inc. 2023).

  5. Kosinski, M. Theory of mind might have spontaneously emerged in large language models. Preprint at https://doi.org/10.48550/arXiv.2302.02083 (2023).

  6. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. Preprint at https://doi.org/10.48550/arXiv.2302.08399 (2023).

  7. Shapira, N. et al. Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Graham, Y. & Purver, M.) 2257–2273 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.eacl-long.138

  8. Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://doi.org/10.48550/arXiv.2303.12712 (2023).

  9. Sharma, M. et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (eds Kim, B. et al.) 110–144 (2024).

  10. Suzgun, M. et al. KaBLE Dataset (v1.0). Zenodo https://doi.org/10.5281/zenodo.15249480 (2025).

  11. KaBLE Dataset. Hugging Face https://huggingface.co/datasets/turingmachine/kable (2025).

  12. Suzgun, M., Shieber, S. & Jurafsky, D. string2string: a modern Python library for string-to-string algorithms. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) (eds Cao, Y., Feng, Y. & Xiong, D.) 278–285 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-demos.26

  13. Trott, S., Jones, C., Chang, T., Michaelov, J. & Bergen, B. Do large language models know what humans know? Cogn. Sci. 47, e13309 (2023).

    Article  Google Scholar 

  14. Aru, J., Labash, A., Corcoll, O. & Vicente, R. Mind the gap: challenges of deep learning approaches to theory of mind. Artif. Intell. Rev. 56, 9141–9156 (2023).

    Article  Google Scholar 

  15. Mahowald, K. et al. Dissociating language and thought in large language models. Trends Cogn. Sci. 28, 517–540 (2024).

    Article  Google Scholar 

  16. Le, M., Boureau, Y.-L. & Nickel, M. Revisiting the evaluation of theory of mind through question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K. et al.) 5872–5877 (Association for Computational Linguistics, 2019).

  17. Ma, X., Gao, L. & Xu, Q. Tomchallenges: a principle-guided dataset and diverse evaluation tasks for exploring theory of mind. In Proc. 27th Conference on Computational Natural Language Learning (CoNLL) (eds Jiang, J. et al.) 15–26 (Association for Computational Linguistics, 2023).

  18. Gandhi, K., Franken, J.-P., Gerstenberg, T. & Goodman, N. D. Understanding social reasoning in language models with language models. Preprint at https://doi.org/10.48550/arXiv.2306.15448 (2023).

  19. Wu, Y. et al. Hi-ToM: a benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 10691–10706 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.findings-emnlp.717

  20. Jones, C. R., Trott, S. & Bergen, B. EPITOME: experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents (2023); https://openreview.net/forum?id=e5Yky8Fnvj

  21. Zhou, P. et al. How FaR are large language models from agents with theory-of-mind? Preprint at https://doi.org/10.48550/arXiv.2310.03051 (2023).

  22. Xu, H., Zhao, R., Zhu, L., Du, J. & He, Y. OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W., Martins, A. & Srikumar, V.) 8593–8623 (Association for Computational Linguistics, 2024); https://aclanthology.org/2024.acl-long.466

  23. Wu, Z. et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (eds Kevin, D. et al.) 1819–1862 (Association for Computational Linguistics, 2024).

  24. Basmov, V., Goldberg, Y. & Tsarfaty, R. LLMs’ reading comprehension is affected by parametric knowledge and struggles with hypothetical statements. Preprint at https://doi.org/10.48550/arXiv.2404.06283 (2024).

  25. Basmov, V., Goldberg, Y. & Tsarfaty, R. Simple linguistic inferences of large language models (LLMs): blind spots and blinds. Preprint at https://doi.org/10.48550/arXiv.2305.14785 (2023).

  26. Holliday, W. H. & Mandelkern, M. Conditional and modal reasoning in large language models. Preprint at https://doi.org/10.48550/arXiv.2401.17169 (2024).

Download references

Acknowledgements

We thank W. Held, W. H. Holliday, A. T. Kalai, J. Tagliabue, M. Tekgürler, M. Tuncer, S. Sarkar, E. Shen, K. Swanson, A. Wang and M. Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University. M.S. gratefully acknowledges the support of a Stanford Law School Fellowship.

Author information

Authors and Affiliations

Authors

Contributions

M.S., T.G. and F.B. conceptualized the research. M.S. led the overall project. M.S., T.G. and F.B. created the KaBLE dataset, performed the main benchmarking experiments and analysed the results—with support from all authors. M.S. and F.B. developed the primary codebase. D.E.H., T.I., D.J. and J.Z. contributed to the experimental design of the benchmark, interpretation of the results and revision of the paper. All the authors contributed to writing the paper and approved the final version. D.E.H. and J.Z. supervised the project throughout.

Corresponding author

Correspondence to James Zou.

Ethics declarations

Competing interests

M.S. previously held research internship positions at Google Brain, Microsoft Research and Meta GenAI; none of these organizations had any role in the conception, design, execution, evaluation or writing of this paper. The other authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Kristian Kersting and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Discussion (epistemology (§A), related work (§B), additional experimental details (§C), language model release and knowledge-cutoff dates (§D), limitations and future directions (§E) and extended results (§F)), Figs. 1 and 2 and Tables 1–4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suzgun, M., Gur, T., Bianchi, F. et al. Language models cannot reliably distinguish belief from knowledge and fact. Nat Mach Intell 7, 1780–1790 (2025). https://doi.org/10.1038/s42256-025-01113-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01113-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics