Language models have become an essential part of the burgeoning field of artificial intelligence (AI) psychology. I discuss 14 methodological considerations that can be used to design more robust, generalizable studies that evaluate the cognitive abilities of language-based AI systems, as well as to accurately interpret the results of these studies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout
References
Frank, M. C. Nat. Rev. Psychol. 2, 451–452 (2023).
Mitchell, M. Science 381, eadj5957 (2023).
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Proc. Natl Acad. Sci. USA 121, e2322420121 (2024).
Lampinen, A. K. Preprint at https://doi.org/10.48550/arXiv.2210.15303 (2022).
Atari, M., Xue, M. J., Park, P. S., Blasi, D. & Henrich, J. Prepint at PsyArXiv https://doi.org/10.31234/osf.io/5b26t (2023).
Levesque, H., Davis, E. & Morgenstern, L. The Winograd schema challenge. In Proc. Thirteenth International Conf. on the Principles of Knowledge Representation and Reasoning (eds Brewka, G. et al.) 552–561 (AAAI Press, 2012).
Sakaguchi, K., Bras, R. L., Bhagavatula, C. & Choi, Y. Commun. ACM 64, 99–106 (2021).
Trichelair, P., Emami, A., Trischler, A., Suleman, K. & Cheung, J. C. K. How reasonable are common-sense reasoning tasks: a case-study on the Winograd schema challenge and SWAG. In Proc. 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conf. on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K. et al.) 3382–3387 (ACL, 2019).
Elazar, Y., Zhang, H., Goldberg, Y. & Roth, D. Back to square one: zrtifact detection, training and commonsense disentanglement in the Winograd schema. In Proc. 2021 Conf, on Empirical Methods in Natural Language Processing (eds Moens, M.-F. et al.) 10486–10500 (ACL, 2021).
Kocijan, V., Davis, E., Lukasiewicz, T., Marcus, G. & Morgenstern, L. Artif. Intell. 325, 103971 (2023).
Acknowledgements
I gratefully acknowledge funding support from the University System of Georgia. I thank A. Sathe, B. Lipkin, C. Kauf, G. Tuckute and K. Mahowald for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Michael Frank, Adele Goldberg and Yoshua Bengio for their contribution to the peer review of this work.
Rights and permissions
About this article
Cite this article
Ivanova, A.A. How to evaluate the cognitive abilities of LLMs. Nat Hum Behav 9, 230–233 (2025). https://doi.org/10.1038/s41562-024-02096-z
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41562-024-02096-z