Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences

Argyle, Lisa P.; Busby, Ethan C.; Gubler, Joshua R.; Hepner, Bryce; Lyman, Alex; Wingate, David

doi:10.1038/s43588-025-00843-4

Perspective
Published: 08 August 2025

Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences

Nature Computational Science volume 5, pages 737–744 (2025)Cite this article

374 Accesses
5 Altmetric
Metrics details

Subjects

Abstract

To promote the scientific use of large language models (LLMs), we suggest that researchers in the political and social sciences refocus on the scientific goal of inference. We suggest that this refocus will improve the accumulation of shared scientific knowledge about these tools and their uses in the social sciences. We discuss the challenges and opportunities related to scientific inference with LLMs, using validation of model output as an illustrative case for discussion. We then propose a set of guidelines related to establishing the failure and success of LLMs when completing particular tasks and discuss how to make inferences from these observations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Machine-assisted quantitizing designs: augmenting humanities and social sciences with artificial intelligence

Article Open access 28 February 2025

A large-scale replication of scenario-based experiments in psychology and management using large language models

Article 09 July 2025

A scientific-article key-insight extraction system based on multi-actor of fine-tuned open-source large language models

Article Open access 10 January 2025

References

Meincke, L., Girotra, K., Nave, G., Terwiesch, C. & Ulrich, K. T. Using large language models for idea generation in innovation. Preprint at https://doi.org/10.2139/ssrn.4526071 (2024).
Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. In 13th International Conference on Learning Representations (eds Yue, Y. et al.) 94003–94092 (ICLR, 2025); https://proceedings.iclr.cc/paper_files/paper/2025/file/ea94957d81b1c1caf87ef5319fa6b467-Paper-Conference.pdf
Schmidgall, S. et al. Agent laboratory: using LLM agents as research assistants. Preprint at https://arxiv.org/abs/2501.04227 (2025).
Agarwal, S. et al. LitLLMs, LLMs for literature review: are we there yet? Preprint at https://arxiv.org/abs/2412.15249 (2025).
Nejjar, M., Zacharias, L., Stiehle, F. & Weber, I. LLMs for science: usage for code generation and data analysis. J. Softw. Evol. Process 37, e2723 (2025).
Article Google Scholar
Törnberg, P. Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Soc. Sci. Comput. Rev. https://doi.org/10.1177/08944393241286471 (2024).
Ornstein, J. T., Blasingame, E. N. & Truscott, J. S. How to train your stochastic parrot: large language models for political texts. Political Sci. Res. Methods 13, 264–281 (2025).
Article Google Scholar
Heseltine, M. & Clemm von Hohenberg, B. Large language models as a substitute for human experts in annotating political text. Res. Politics 11, 20531680241236239 (2024).
Article Google Scholar
Wang, Y. LLMs in political science: heralding a new era of visual analysis. Preprint at https://arxiv.org/abs/2403.00154 (2024).
Rytting, C. et al. Towards coding social science datasets with language models. Preprint at https://arxiv.org/abs/2306.02177 (2023).
VELEZ, Y. R. & LIU, P. Confronting core issues: a critical assessment of attitude polarization using tailored experiments. Am. Political Sci. Rev. 119, 1036–1053 (2025).
Article Google Scholar
Argyle, L. P. et al. Leveraging AI for democratic discourse: chat interventions can improve online political conversations at scale. Proc. Natl Acad. Sci. USA 120, e2311627120 (2023).
Article Google Scholar
Tessler, M. H. et al. AI can help humans find common ground in democratic deliberation. Science 386, eadq2852 (2024).
Article Google Scholar
Hackenburg, K. & Margetts, H. Evaluating the persuasive influence of political microtargeting with large language models. Proc. Natl Acad. Sci. USA 121, e2403116121 (2024).
Article Google Scholar
Rozado, D. The political biases of ChatGPT. Soc. Sci. 12, 148 (2023).
Article Google Scholar
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Follmer, S., Han, J., Steimle, J. and Riche, N. H. Proc. 36th Annual ACM Symposium on User Interface Software and Technology 1–22 (Association for Computing Machinery, 2023).
Palmer, A. & Spirling, A. Large language models can argue in convincing ways about politics, but humans dislike AI authors: implications for governance. Political Sci. 75, 281–291 (2023).
Article Google Scholar
Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Political Anal. 31, 337–351 (2023).
Article Google Scholar
Törnberg, P., Valeeva, D., Uitermark, J. & Bail, C. Simulating social media using large language models to evaluate alternative news feed algorithms. Preprint at https://arxiv.org/abs/2310.05984 (2023).
Sreedhar, K., Cai, A., Ma, J., Nickerson, J. V. & Chilton, L. B. Simulating cooperative prosocial behavior with multi-agent LLMs: evidence and mechanisms for AI agents to inform policy decisions. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 1272–1286 (Association for Computing Machinery, 2025).
Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B. & Larson, J. M. Synthetic replacements for human survey data? The perils of large language models. Political Anal. 32, 401–416 (2023).
Article Google Scholar
Ashokkumar, A., Hewitt, L., Ghezae, I. & Willer, R. Predicting results of social science experiments using large language models. Ethics and Psychology (6 November 2024).
Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313–313 (2023).
Article Google Scholar
Van Woudenberg, R., Ranalli, C. & Bracker, D. Authorship and ChatGPT: a conservative view. Phil. Technol. 37, 34 (2024).
Article Google Scholar
Bail, C. A. Can generative AI improve social science? Proc. Natl Acad. Sci. USA 121, e2314021121 (2024).
Article Google Scholar
Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).
Article Google Scholar
Xu, R. et al. AI for social science and social science of AI: a survey. Inf. Process. Manag. 61, 103665 (2024).
Article Google Scholar
Argyle, L. P., Busby, E. C., Gubler, J. R. & Wingate, D. Testing theories of political persuasion using artificial intelligence. Proc. Natl Acad. Sci. USA 122, e2412815122 (2025).
Article Google Scholar
Lyman, A. et al. Balancing large language model alignment and algorithmic fidelity in social science research. Sociol. Methods Res. (2025).
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proc. Natl Acad. Sci. USA 121, e2322420121 (2024).
Article Google Scholar
Liu, M. & Shi, G. Enhancing LLM-based text classification in political science: automatic prompt optimization and dynamic exemplar selection for few-shot learning. Preprint at https://arxiv.org/abs/2409.01466 (2024).
Atreja, S., Ashkinaze, J., Li, L., Mendelsohn, J. & Hemphill, L. What’s in a prompt?: A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of LLM-generated text annotations. Proc. International AAAI Conference on Web and Social Media Vol. 19, 122–145 (AAAI, 2025); https://ojs.aaai.org/index.php/ICWSM/article/view/35807
Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D. & Kai Chen. ProSA: assessing and understanding the prompt sensitivity of LLMs. In Findings of the Association for Computational Linguistics 1950–1976 (Association for Computational Linguistics, 2024).
Spirling, A. Why open-source generative AI models are an ethical way forward for science. Nature 616, 413 (2023).
Article Google Scholar
Ollion, É., Shen, R., Macanovic, A. & Chatelain, A. The dangers of using proprietary LLMs for research. Nat. Mach. Intell. 6, 4–5 (2024).
Article Google Scholar
Jaźwińska, K. & Chandrasekar, A. AI search has a citation problem: We compared eight AI search engines. They’re all bad at citing news. Columbia Journalism Review (6 March 2025).
Briggs, R., Mellon, J., Arel-Bundock, V. & Larson, T. We used LLMs to track methodological and substantive publication patterns in political science and they seem to do a pretty good job. Preprint at https://osf.io/v7fe8 (2025).
Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).
Article Google Scholar
Li, J. et al. Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. In 37th Conference on Neural Information Processing Systems (NeurIPS, 2023).
Fernandez, R. C., Elmore, A. J., Franklin, M. J., Krishnan, S. & Tan, C. How large language models will disrupt data management. Proc. VLDB Endow. 16, 3302–3309 (2023).
Article Google Scholar
Xiao, C., Xu, S. X., Zhang, K., Wang, Y. & Xia, L. Evaluating reading comprehension exercises generated by LLMs: a showcase of ChatGPT in education applications. In Proc. 18th Workshop on Innovative Use of NLP for Building Educational Applications (eds Kochmar, E. et al.) 610–625 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.bea-1.52/
Lyu, W., Wang, Y., Chung, T. (R.), Sun, Y. & Zhang, Y. Evaluating the effectiveness of LLMs in introductory computer science education: a semester-long field study. In Proc. 11th ACM Conference on Learning @ Scale (eds Joyner, D.) 63–74 (Association for Computing Machinery, 2024); https://doi.org/10.1145/3657604.3662036
Milano, S., McGrane, J. A. & Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 5, 333–334 (2023).
Article Google Scholar
Yakura, H. et al. Empirical evidence of large language model’s influence on human spoken communication. Preprint at https://arxiv.org/abs/2409.01754 (2024).
Hohenstein, J. et al. Artificial intelligence in communication impacts language and social relationships. Sci. Rep. 13, 5487 (2023).
Article Google Scholar
Manning, B. S., Zhu, K. & Horton, J. J. Automated Social Science: Language Models as Scientist and Subjects Technical Report (National Bureau of Economic Research, 2024).
Rossi, L., Harrison, K. & Shklovski, I. The problems of LLM-generated data in social science research. Sociologica 18, 145–168 (2024).
Google Scholar
Hayes, A. S. ‘Conversing’ with qualitative data: enhancing qualitative research through large language models (LLMs). Int. J. Qual. Methods 24, 16094069251322346 (2025).
Article Google Scholar
Schroeder, H., Aubin Le Quéré, M., Randazzo, C., Mimno, D. & Schoenebeck, S. Large language models in qualitative research: uses, tensions, and intentions. In Proc. 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–17 (Association for Computing Machinery, 2025).
Dunivin, Z. O. Scaling hermeneutics: a guide to qualitative coding with llms for reflexive content analysis. EPJ Data Sci. 14, 28 (2025).
Article Google Scholar
Reiss, M. V. Testing the reliability of ChatGPT for text annotation and classification: a cautionary remark. Preprint at https://arxiv.org/abs/2304.11085 (2023).
Ollion, E., Shen, R., Macanovic, A. & Chatelain, A. ChatGPT for text annotation? Mind the hype. Preprint at https://osf.io/preprints/socarxiv/x58kn_v1 (2023).
Pangakis, N., Wolken, S. & Fasching, N. Automated annotation with generative AI requires validation. Preprint at https://arxiv.org/abs/2306.00176 (2023).
Törnberg, P. Best practices for text annotation with large language models. Sociologica 18, 67–85 (2024).
Google Scholar
Alizadeh, M. et al. Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning. J. Comput. Soc. Sci. 8, (2025).
King, G., Keohane, R. O. & Verba, S. Designing Social Inquiry: Scientific Inference in Qualitative Research: New Edition (Princeton Univ. Press, 2021).
Lakatos, I. Falsification and the methodology of scientific research programmes. In Criticism and the Growth of Knowledge: Proc. International Colloquium in the Philosophy of Science (eds Lakatos, I. & Musgrave, A.) 91–196 (Cambridge Univ. Press, 1970).
Cui, Z., Li, N. & Zhou, H. Can AI replace human subjects? A large-scale replication of psychological experiments with LLMs. Preprint at https://arxiv.org/abs/2409.00128 (2024).
Horton, J. J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Technical Report (National Bureau of Economic Research, 2023).
Lippert, S. et al. Can large language models help predict results from a complex behavioural science study? R. Soc. Open Sci. 11, 240682 (2024).
Article Google Scholar
Hackenburg, K. et al. Scaling language model size yields diminishing returns for single-message political persuasion. Proc. Natl Acad. Sci. USA 122, e2413443122 (2025).
Article Google Scholar
Szymanski, A. et al. Limitations of the LLM-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 952–966 (Association for Computing Machinery, 2025).
Huang, J.-T. et al. Apathetic or empathetic? Evaluating LLMs’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).
Google Scholar
Amirizaniani, M., Martin, E., Sivachenko, M., Mashhadi, A. & Shah, C. Can LLMs reason like humans? Assessing theory of mind reasoning in LLMs for open-ended questions. In Proc. 33rd ACM International Conference on Information and Knowledge Management (eds Serra, E. & Spezzano, F.) 34–44 (Association for Computing Machinery, 2024).
Valmeekam, K., Olmo, A., Sreedharan, S. & Kambhampati, S. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop (2022).
Eaton, K. How many R’s in ‘strawberry’? This AI doesn’t know. Inc. https://www.inc.com/kit-eaton/how-many-rs-in-strawberry-this-ai-cant-tell-you.html (2024).
Lu, Y., Zhu, W., Li, L., Qiao, Y. & Yuan, F. LLaMAX: Scaling linguistic horizons of LLM by enhancing translation capabilities beyond 100 languages. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y. et al.) 10748–10772 (Association for Computational Linguistics, 2024).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Cui, J., Chiang, W. L., Stoica, I., & Hsieh, C. J. Or-bench: an over-refusal benchmark for large language models. Preprint at https://arxiv.org/abs/2405.20947 (2025).
Tekgurler, M. Historical, low-resourced languages and contempo-rary AI models. In Proc. 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (eds Kazantseva, A. et al.) 227–237 (Association for Computational Linguistics, 2025).
Kirk, R. et al. Understanding the effects of RLHF on LLM generalisation and diversity. In 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=PXD3FAVHJT
Li, H., Ding, L., Fang, M. & Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y, et al.) 4297–4308 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-emnlp.249
Pezeshkpour, P. & Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Findings of the Association for Computational Linguistics: NAACL 2024 (eds Duh, K. et al.) 2006–2017 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-naacl.130
Wang, Q. et al. What limits LLM-based human simulation: LLMs or our design? Preprint at https://arxiv.org/abs/2501.08579 (2025).
Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 1–34 (ICML, 2023).
Boelaert, J., Coavoux, S., Ollion, É., Petev, I. & Präg, P. Machine bias generative large language models have a worldview of their own. Preprint at https://osf.io/preprints/socarxiv/r2pnb_v2 (2025).
Kim, J. & Lee, B. AI-augmented surveys: leveraging large language models and surveys for opinion prediction. Preprint at https://arxiv.org/abs/2305.09620 (2023).
Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd-workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Political Science, Brigham Young University, Provo, UT, US
Lisa P. Argyle, Ethan C. Busby & Joshua R. Gubler
Department of Computer Science, Brigham Young University, Provo, UT, US
Bryce Hepner, Alex Lyman & David Wingate

Authors

Lisa P. Argyle
View author publications
Search author on:PubMed Google Scholar
Ethan C. Busby
View author publications
Search author on:PubMed Google Scholar
Joshua R. Gubler
View author publications
Search author on:PubMed Google Scholar
Bryce Hepner
View author publications
Search author on:PubMed Google Scholar
Alex Lyman
View author publications
Search author on:PubMed Google Scholar
David Wingate
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the writing and editing of this paper.

Corresponding author

Correspondence to Lisa P. Argyle.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Argyle, L.P., Busby, E.C., Gubler, J.R. et al. Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences. Nat Comput Sci 5, 737–744 (2025). https://doi.org/10.1038/s43588-025-00843-4

Download citation

Received: 31 March 2025
Accepted: 27 June 2025
Published: 08 August 2025
Issue date: September 2025
DOI: https://doi.org/10.1038/s43588-025-00843-4