Abstract
To promote the scientific use of large language models (LLMs), we suggest that researchers in the political and social sciences refocus on the scientific goal of inference. We suggest that this refocus will improve the accumulation of shared scientific knowledge about these tools and their uses in the social sciences. We discuss the challenges and opportunities related to scientific inference with LLMs, using validation of model output as an illustrative case for discussion. We then propose a set of guidelines related to establishing the failure and success of LLMs when completing particular tasks and discuss how to make inferences from these observations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Meincke, L., Girotra, K., Nave, G., Terwiesch, C. & Ulrich, K. T. Using large language models for idea generation in innovation. Preprint at https://doi.org/10.2139/ssrn.4526071 (2024).
Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. In 13th International Conference on Learning Representations (eds Yue, Y. et al.) 94003–94092 (ICLR, 2025); https://proceedings.iclr.cc/paper_files/paper/2025/file/ea94957d81b1c1caf87ef5319fa6b467-Paper-Conference.pdf
Schmidgall, S. et al. Agent laboratory: using LLM agents as research assistants. Preprint at https://arxiv.org/abs/2501.04227 (2025).
Agarwal, S. et al. LitLLMs, LLMs for literature review: are we there yet? Preprint at https://arxiv.org/abs/2412.15249 (2025).
Nejjar, M., Zacharias, L., Stiehle, F. & Weber, I. LLMs for science: usage for code generation and data analysis. J. Softw. Evol. Process 37, e2723 (2025).
Törnberg, P. Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Soc. Sci. Comput. Rev. https://doi.org/10.1177/08944393241286471 (2024).
Ornstein, J. T., Blasingame, E. N. & Truscott, J. S. How to train your stochastic parrot: large language models for political texts. Political Sci. Res. Methods 13, 264–281 (2025).
Heseltine, M. & Clemm von Hohenberg, B. Large language models as a substitute for human experts in annotating political text. Res. Politics 11, 20531680241236239 (2024).
Wang, Y. LLMs in political science: heralding a new era of visual analysis. Preprint at https://arxiv.org/abs/2403.00154 (2024).
Rytting, C. et al. Towards coding social science datasets with language models. Preprint at https://arxiv.org/abs/2306.02177 (2023).
VELEZ, Y. R. & LIU, P. Confronting core issues: a critical assessment of attitude polarization using tailored experiments. Am. Political Sci. Rev. 119, 1036–1053 (2025).
Argyle, L. P. et al. Leveraging AI for democratic discourse: chat interventions can improve online political conversations at scale. Proc. Natl Acad. Sci. USA 120, e2311627120 (2023).
Tessler, M. H. et al. AI can help humans find common ground in democratic deliberation. Science 386, eadq2852 (2024).
Hackenburg, K. & Margetts, H. Evaluating the persuasive influence of political microtargeting with large language models. Proc. Natl Acad. Sci. USA 121, e2403116121 (2024).
Rozado, D. The political biases of ChatGPT. Soc. Sci. 12, 148 (2023).
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Follmer, S., Han, J., Steimle, J. and Riche, N. H. Proc. 36th Annual ACM Symposium on User Interface Software and Technology 1–22 (Association for Computing Machinery, 2023).
Palmer, A. & Spirling, A. Large language models can argue in convincing ways about politics, but humans dislike AI authors: implications for governance. Political Sci. 75, 281–291 (2023).
Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Political Anal. 31, 337–351 (2023).
Törnberg, P., Valeeva, D., Uitermark, J. & Bail, C. Simulating social media using large language models to evaluate alternative news feed algorithms. Preprint at https://arxiv.org/abs/2310.05984 (2023).
Sreedhar, K., Cai, A., Ma, J., Nickerson, J. V. & Chilton, L. B. Simulating cooperative prosocial behavior with multi-agent LLMs: evidence and mechanisms for AI agents to inform policy decisions. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 1272–1286 (Association for Computing Machinery, 2025).
Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B. & Larson, J. M. Synthetic replacements for human survey data? The perils of large language models. Political Anal. 32, 401–416 (2023).
Ashokkumar, A., Hewitt, L., Ghezae, I. & Willer, R. Predicting results of social science experiments using large language models. Ethics and Psychology (6 November 2024).
Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313–313 (2023).
Van Woudenberg, R., Ranalli, C. & Bracker, D. Authorship and ChatGPT: a conservative view. Phil. Technol. 37, 34 (2024).
Bail, C. A. Can generative AI improve social science? Proc. Natl Acad. Sci. USA 121, e2314021121 (2024).
Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).
Xu, R. et al. AI for social science and social science of AI: a survey. Inf. Process. Manag. 61, 103665 (2024).
Argyle, L. P., Busby, E. C., Gubler, J. R. & Wingate, D. Testing theories of political persuasion using artificial intelligence. Proc. Natl Acad. Sci. USA 122, e2412815122 (2025).
Lyman, A. et al. Balancing large language model alignment and algorithmic fidelity in social science research. Sociol. Methods Res. (2025).
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proc. Natl Acad. Sci. USA 121, e2322420121 (2024).
Liu, M. & Shi, G. Enhancing LLM-based text classification in political science: automatic prompt optimization and dynamic exemplar selection for few-shot learning. Preprint at https://arxiv.org/abs/2409.01466 (2024).
Atreja, S., Ashkinaze, J., Li, L., Mendelsohn, J. & Hemphill, L. What’s in a prompt?: A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of LLM-generated text annotations. Proc. International AAAI Conference on Web and Social Media Vol. 19, 122–145 (AAAI, 2025); https://ojs.aaai.org/index.php/ICWSM/article/view/35807
Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D. & Kai Chen. ProSA: assessing and understanding the prompt sensitivity of LLMs. In Findings of the Association for Computational Linguistics 1950–1976 (Association for Computational Linguistics, 2024).
Spirling, A. Why open-source generative AI models are an ethical way forward for science. Nature 616, 413 (2023).
Ollion, É., Shen, R., Macanovic, A. & Chatelain, A. The dangers of using proprietary LLMs for research. Nat. Mach. Intell. 6, 4–5 (2024).
Jaźwińska, K. & Chandrasekar, A. AI search has a citation problem: We compared eight AI search engines. They’re all bad at citing news. Columbia Journalism Review (6 March 2025).
Briggs, R., Mellon, J., Arel-Bundock, V. & Larson, T. We used LLMs to track methodological and substantive publication patterns in political science and they seem to do a pretty good job. Preprint at https://osf.io/v7fe8 (2025).
Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).
Li, J. et al. Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. In 37th Conference on Neural Information Processing Systems (NeurIPS, 2023).
Fernandez, R. C., Elmore, A. J., Franklin, M. J., Krishnan, S. & Tan, C. How large language models will disrupt data management. Proc. VLDB Endow. 16, 3302–3309 (2023).
Xiao, C., Xu, S. X., Zhang, K., Wang, Y. & Xia, L. Evaluating reading comprehension exercises generated by LLMs: a showcase of ChatGPT in education applications. In Proc. 18th Workshop on Innovative Use of NLP for Building Educational Applications (eds Kochmar, E. et al.) 610–625 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.bea-1.52/
Lyu, W., Wang, Y., Chung, T. (R.), Sun, Y. & Zhang, Y. Evaluating the effectiveness of LLMs in introductory computer science education: a semester-long field study. In Proc. 11th ACM Conference on Learning @ Scale (eds Joyner, D.) 63–74 (Association for Computing Machinery, 2024); https://doi.org/10.1145/3657604.3662036
Milano, S., McGrane, J. A. & Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 5, 333–334 (2023).
Yakura, H. et al. Empirical evidence of large language model’s influence on human spoken communication. Preprint at https://arxiv.org/abs/2409.01754 (2024).
Hohenstein, J. et al. Artificial intelligence in communication impacts language and social relationships. Sci. Rep. 13, 5487 (2023).
Manning, B. S., Zhu, K. & Horton, J. J. Automated Social Science: Language Models as Scientist and Subjects Technical Report (National Bureau of Economic Research, 2024).
Rossi, L., Harrison, K. & Shklovski, I. The problems of LLM-generated data in social science research. Sociologica 18, 145–168 (2024).
Hayes, A. S. ‘Conversing’ with qualitative data: enhancing qualitative research through large language models (LLMs). Int. J. Qual. Methods 24, 16094069251322346 (2025).
Schroeder, H., Aubin Le Quéré, M., Randazzo, C., Mimno, D. & Schoenebeck, S. Large language models in qualitative research: uses, tensions, and intentions. In Proc. 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–17 (Association for Computing Machinery, 2025).
Dunivin, Z. O. Scaling hermeneutics: a guide to qualitative coding with llms for reflexive content analysis. EPJ Data Sci. 14, 28 (2025).
Reiss, M. V. Testing the reliability of ChatGPT for text annotation and classification: a cautionary remark. Preprint at https://arxiv.org/abs/2304.11085 (2023).
Ollion, E., Shen, R., Macanovic, A. & Chatelain, A. ChatGPT for text annotation? Mind the hype. Preprint at https://osf.io/preprints/socarxiv/x58kn_v1 (2023).
Pangakis, N., Wolken, S. & Fasching, N. Automated annotation with generative AI requires validation. Preprint at https://arxiv.org/abs/2306.00176 (2023).
Törnberg, P. Best practices for text annotation with large language models. Sociologica 18, 67–85 (2024).
Alizadeh, M. et al. Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning. J. Comput. Soc. Sci. 8, (2025).
King, G., Keohane, R. O. & Verba, S. Designing Social Inquiry: Scientific Inference in Qualitative Research: New Edition (Princeton Univ. Press, 2021).
Lakatos, I. Falsification and the methodology of scientific research programmes. In Criticism and the Growth of Knowledge: Proc. International Colloquium in the Philosophy of Science (eds Lakatos, I. & Musgrave, A.) 91–196 (Cambridge Univ. Press, 1970).
Cui, Z., Li, N. & Zhou, H. Can AI replace human subjects? A large-scale replication of psychological experiments with LLMs. Preprint at https://arxiv.org/abs/2409.00128 (2024).
Horton, J. J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Technical Report (National Bureau of Economic Research, 2023).
Lippert, S. et al. Can large language models help predict results from a complex behavioural science study? R. Soc. Open Sci. 11, 240682 (2024).
Hackenburg, K. et al. Scaling language model size yields diminishing returns for single-message political persuasion. Proc. Natl Acad. Sci. USA 122, e2413443122 (2025).
Szymanski, A. et al. Limitations of the LLM-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 952–966 (Association for Computing Machinery, 2025).
Huang, J.-T. et al. Apathetic or empathetic? Evaluating LLMs’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).
Amirizaniani, M., Martin, E., Sivachenko, M., Mashhadi, A. & Shah, C. Can LLMs reason like humans? Assessing theory of mind reasoning in LLMs for open-ended questions. In Proc. 33rd ACM International Conference on Information and Knowledge Management (eds Serra, E. & Spezzano, F.) 34–44 (Association for Computing Machinery, 2024).
Valmeekam, K., Olmo, A., Sreedharan, S. & Kambhampati, S. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop (2022).
Eaton, K. How many R’s in ‘strawberry’? This AI doesn’t know. Inc. https://www.inc.com/kit-eaton/how-many-rs-in-strawberry-this-ai-cant-tell-you.html (2024).
Lu, Y., Zhu, W., Li, L., Qiao, Y. & Yuan, F. LLaMAX: Scaling linguistic horizons of LLM by enhancing translation capabilities beyond 100 languages. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y. et al.) 10748–10772 (Association for Computational Linguistics, 2024).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Cui, J., Chiang, W. L., Stoica, I., & Hsieh, C. J. Or-bench: an over-refusal benchmark for large language models. Preprint at https://arxiv.org/abs/2405.20947 (2025).
Tekgurler, M. Historical, low-resourced languages and contempo-rary AI models. In Proc. 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (eds Kazantseva, A. et al.) 227–237 (Association for Computational Linguistics, 2025).
Kirk, R. et al. Understanding the effects of RLHF on LLM generalisation and diversity. In 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=PXD3FAVHJT
Li, H., Ding, L., Fang, M. & Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y, et al.) 4297–4308 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-emnlp.249
Pezeshkpour, P. & Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Findings of the Association for Computational Linguistics: NAACL 2024 (eds Duh, K. et al.) 2006–2017 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-naacl.130
Wang, Q. et al. What limits LLM-based human simulation: LLMs or our design? Preprint at https://arxiv.org/abs/2501.08579 (2025).
Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 1–34 (ICML, 2023).
Boelaert, J., Coavoux, S., Ollion, É., Petev, I. & Präg, P. Machine bias generative large language models have a worldview of their own. Preprint at https://osf.io/preprints/socarxiv/r2pnb_v2 (2025).
Kim, J. & Lee, B. AI-augmented surveys: leveraging large language models and surveys for opinion prediction. Preprint at https://arxiv.org/abs/2305.09620 (2023).
Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd-workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).
Author information
Authors and Affiliations
Contributions
All authors contributed to the writing and editing of this paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Argyle, L.P., Busby, E.C., Gubler, J.R. et al. Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences. Nat Comput Sci 5, 737–744 (2025). https://doi.org/10.1038/s43588-025-00843-4
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s43588-025-00843-4