Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Comment
  • Published:

How to evaluate the cognitive abilities of LLMs

Language models have become an essential part of the burgeoning field of artificial intelligence (AI) psychology. I discuss 14 methodological considerations that can be used to design more robust, generalizable studies that evaluate the cognitive abilities of language-based AI systems, as well as to accurately interpret the results of these studies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

References

  1. Frank, M. C. Nat. Rev. Psychol. 2, 451–452 (2023).

    Article  Google Scholar 

  2. Mitchell, M. Science 381, eadj5957 (2023).

    Article  Google Scholar 

  3. McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Proc. Natl Acad. Sci. USA 121, e2322420121 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lampinen, A. K. Preprint at https://doi.org/10.48550/arXiv.2210.15303 (2022).

  5. Atari, M., Xue, M. J., Park, P. S., Blasi, D. & Henrich, J. Prepint at PsyArXiv https://doi.org/10.31234/osf.io/5b26t (2023).

  6. Levesque, H., Davis, E. & Morgenstern, L. The Winograd schema challenge. In Proc. Thirteenth International Conf. on the Principles of Knowledge Representation and Reasoning (eds Brewka, G. et al.) 552–561 (AAAI Press, 2012).

  7. Sakaguchi, K., Bras, R. L., Bhagavatula, C. & Choi, Y. Commun. ACM 64, 99–106 (2021).

    Article  Google Scholar 

  8. Trichelair, P., Emami, A., Trischler, A., Suleman, K. & Cheung, J. C. K. How reasonable are common-sense reasoning tasks: a case-study on the Winograd schema challenge and SWAG. In Proc. 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conf. on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K. et al.) 3382–3387 (ACL, 2019).

  9. Elazar, Y., Zhang, H., Goldberg, Y. & Roth, D. Back to square one: zrtifact detection, training and commonsense disentanglement in the Winograd schema. In Proc. 2021 Conf, on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 10486–10500 (ACL, 2021).

  10. Kocijan, V., Davis, E., Lukasiewicz, T., Marcus, G. & Morgenstern, L. Artif. Intell. 325, 103971 (2023).

    Article  Google Scholar 

Download references

Acknowledgements

I gratefully acknowledge funding support from the University System of Georgia. I thank A. Sathe, B. Lipkin, C. Kauf, G. Tuckute and K. Mahowald for their constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna A. Ivanova.

Ethics declarations

Competing interests

The author declares no competing interests.

Peer review

Peer review information

Nature Human Behaviour thanks Michael Frank, Adele Goldberg and Yoshua Bengio for their contribution to the peer review of this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ivanova, A.A. How to evaluate the cognitive abilities of LLMs. Nat Hum Behav 9, 230–233 (2025). https://doi.org/10.1038/s41562-024-02096-z

Download citation

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41562-024-02096-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing