Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences

Abstract

To promote the scientific use of large language models (LLMs), we suggest that researchers in the political and social sciences refocus on the scientific goal of inference. We suggest that this refocus will improve the accumulation of shared scientific knowledge about these tools and their uses in the social sciences. We discuss the challenges and opportunities related to scientific inference with LLMs, using validation of model output as an illustrative case for discussion. We then propose a set of guidelines related to establishing the failure and success of LLMs when completing particular tasks and discuss how to make inferences from these observations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Meincke, L., Girotra, K., Nave, G., Terwiesch, C. & Ulrich, K. T. Using large language models for idea generation in innovation. Preprint at https://doi.org/10.2139/ssrn.4526071 (2024).

  2. Si, C., Yang, D. & Hashimoto, T. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. In 13th International Conference on Learning Representations (eds Yue, Y. et al.) 94003–94092 (ICLR, 2025); https://proceedings.iclr.cc/paper_files/paper/2025/file/ea94957d81b1c1caf87ef5319fa6b467-Paper-Conference.pdf

  3. Schmidgall, S. et al. Agent laboratory: using LLM agents as research assistants. Preprint at https://arxiv.org/abs/2501.04227 (2025).

  4. Agarwal, S. et al. LitLLMs, LLMs for literature review: are we there yet? Preprint at https://arxiv.org/abs/2412.15249 (2025).

  5. Nejjar, M., Zacharias, L., Stiehle, F. & Weber, I. LLMs for science: usage for code generation and data analysis. J. Softw. Evol. Process 37, e2723 (2025).

    Article  Google Scholar 

  6. Törnberg, P. Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Soc. Sci. Comput. Rev. https://doi.org/10.1177/08944393241286471 (2024).

  7. Ornstein, J. T., Blasingame, E. N. & Truscott, J. S. How to train your stochastic parrot: large language models for political texts. Political Sci. Res. Methods 13, 264–281 (2025).

    Article  Google Scholar 

  8. Heseltine, M. & Clemm von Hohenberg, B. Large language models as a substitute for human experts in annotating political text. Res. Politics 11, 20531680241236239 (2024).

    Article  Google Scholar 

  9. Wang, Y. LLMs in political science: heralding a new era of visual analysis. Preprint at https://arxiv.org/abs/2403.00154 (2024).

  10. Rytting, C. et al. Towards coding social science datasets with language models. Preprint at https://arxiv.org/abs/2306.02177 (2023).

  11. VELEZ, Y. R. & LIU, P. Confronting core issues: a critical assessment of attitude polarization using tailored experiments. Am. Political Sci. Rev. 119, 1036–1053 (2025).

    Article  Google Scholar 

  12. Argyle, L. P. et al. Leveraging AI for democratic discourse: chat interventions can improve online political conversations at scale. Proc. Natl Acad. Sci. USA 120, e2311627120 (2023).

    Article  Google Scholar 

  13. Tessler, M. H. et al. AI can help humans find common ground in democratic deliberation. Science 386, eadq2852 (2024).

    Article  Google Scholar 

  14. Hackenburg, K. & Margetts, H. Evaluating the persuasive influence of political microtargeting with large language models. Proc. Natl Acad. Sci. USA 121, e2403116121 (2024).

    Article  Google Scholar 

  15. Rozado, D. The political biases of ChatGPT. Soc. Sci. 12, 148 (2023).

    Article  Google Scholar 

  16. Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Follmer, S., Han, J., Steimle, J. and Riche, N. H. Proc. 36th Annual ACM Symposium on User Interface Software and Technology 1–22 (Association for Computing Machinery, 2023).

  17. Palmer, A. & Spirling, A. Large language models can argue in convincing ways about politics, but humans dislike AI authors: implications for governance. Political Sci. 75, 281–291 (2023).

    Article  Google Scholar 

  18. Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Political Anal. 31, 337–351 (2023).

    Article  Google Scholar 

  19. Törnberg, P., Valeeva, D., Uitermark, J. & Bail, C. Simulating social media using large language models to evaluate alternative news feed algorithms. Preprint at https://arxiv.org/abs/2310.05984 (2023).

  20. Sreedhar, K., Cai, A., Ma, J., Nickerson, J. V. & Chilton, L. B. Simulating cooperative prosocial behavior with multi-agent LLMs: evidence and mechanisms for AI agents to inform policy decisions. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 1272–1286 (Association for Computing Machinery, 2025).

  21. Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B. & Larson, J. M. Synthetic replacements for human survey data? The perils of large language models. Political Anal. 32, 401–416 (2023).

    Article  Google Scholar 

  22. Ashokkumar, A., Hewitt, L., Ghezae, I. & Willer, R. Predicting results of social science experiments using large language models. Ethics and Psychology (6 November 2024).

  23. Thorp, H. H. ChatGPT is fun, but not an author. Science 379, 313–313 (2023).

    Article  Google Scholar 

  24. Van Woudenberg, R., Ranalli, C. & Bracker, D. Authorship and ChatGPT: a conservative view. Phil. Technol. 37, 34 (2024).

    Article  Google Scholar 

  25. Bail, C. A. Can generative AI improve social science? Proc. Natl Acad. Sci. USA 121, e2314021121 (2024).

    Article  Google Scholar 

  26. Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).

    Article  Google Scholar 

  27. Xu, R. et al. AI for social science and social science of AI: a survey. Inf. Process. Manag. 61, 103665 (2024).

    Article  Google Scholar 

  28. Argyle, L. P., Busby, E. C., Gubler, J. R. & Wingate, D. Testing theories of political persuasion using artificial intelligence. Proc. Natl Acad. Sci. USA 122, e2412815122 (2025).

    Article  Google Scholar 

  29. Lyman, A. et al. Balancing large language model alignment and algorithmic fidelity in social science research. Sociol. Methods Res. (2025).

  30. McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D. & Griffiths, T. L. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proc. Natl Acad. Sci. USA 121, e2322420121 (2024).

    Article  Google Scholar 

  31. Liu, M. & Shi, G. Enhancing LLM-based text classification in political science: automatic prompt optimization and dynamic exemplar selection for few-shot learning. Preprint at https://arxiv.org/abs/2409.01466 (2024).

  32. Atreja, S., Ashkinaze, J., Li, L., Mendelsohn, J. & Hemphill, L. What’s in a prompt?: A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of LLM-generated text annotations. Proc. International AAAI Conference on Web and Social Media Vol. 19, 122–145 (AAAI, 2025); https://ojs.aaai.org/index.php/ICWSM/article/view/35807

  33. Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D. & Kai Chen. ProSA: assessing and understanding the prompt sensitivity of LLMs. In Findings of the Association for Computational Linguistics 1950–1976 (Association for Computational Linguistics, 2024).

  34. Spirling, A. Why open-source generative AI models are an ethical way forward for science. Nature 616, 413 (2023).

    Article  Google Scholar 

  35. Ollion, É., Shen, R., Macanovic, A. & Chatelain, A. The dangers of using proprietary LLMs for research. Nat. Mach. Intell. 6, 4–5 (2024).

    Article  Google Scholar 

  36. Jaźwińska, K. & Chandrasekar, A. AI search has a citation problem: We compared eight AI search engines. They’re all bad at citing news. Columbia Journalism Review (6 March 2025).

  37. Briggs, R., Mellon, J., Arel-Bundock, V. & Larson, T. We used LLMs to track methodological and substantive publication patterns in political science and they seem to do a pretty good job. Preprint at https://osf.io/v7fe8 (2025).

  38. Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).

    Article  Google Scholar 

  39. Li, J. et al. Can LLM already serve as a database interface? A big bench for large-scale database grounded text-to-SQLs. In 37th Conference on Neural Information Processing Systems (NeurIPS, 2023).

  40. Fernandez, R. C., Elmore, A. J., Franklin, M. J., Krishnan, S. & Tan, C. How large language models will disrupt data management. Proc. VLDB Endow. 16, 3302–3309 (2023).

    Article  Google Scholar 

  41. Xiao, C., Xu, S. X., Zhang, K., Wang, Y. & Xia, L. Evaluating reading comprehension exercises generated by LLMs: a showcase of ChatGPT in education applications. In Proc. 18th Workshop on Innovative Use of NLP for Building Educational Applications (eds Kochmar, E. et al.) 610–625 (Association for Computational Linguistics, 2023); https://aclanthology.org/2023.bea-1.52/

  42. Lyu, W., Wang, Y., Chung, T. (R.), Sun, Y. & Zhang, Y. Evaluating the effectiveness of LLMs in introductory computer science education: a semester-long field study. In Proc. 11th ACM Conference on Learning @ Scale (eds Joyner, D.) 63–74 (Association for Computing Machinery, 2024); https://doi.org/10.1145/3657604.3662036

  43. Milano, S., McGrane, J. A. & Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 5, 333–334 (2023).

    Article  Google Scholar 

  44. Yakura, H. et al. Empirical evidence of large language model’s influence on human spoken communication. Preprint at https://arxiv.org/abs/2409.01754 (2024).

  45. Hohenstein, J. et al. Artificial intelligence in communication impacts language and social relationships. Sci. Rep. 13, 5487 (2023).

    Article  Google Scholar 

  46. Manning, B. S., Zhu, K. & Horton, J. J. Automated Social Science: Language Models as Scientist and Subjects Technical Report (National Bureau of Economic Research, 2024).

  47. Rossi, L., Harrison, K. & Shklovski, I. The problems of LLM-generated data in social science research. Sociologica 18, 145–168 (2024).

    Google Scholar 

  48. Hayes, A. S. ‘Conversing’ with qualitative data: enhancing qualitative research through large language models (LLMs). Int. J. Qual. Methods 24, 16094069251322346 (2025).

    Article  Google Scholar 

  49. Schroeder, H., Aubin Le Quéré, M., Randazzo, C., Mimno, D. & Schoenebeck, S. Large language models in qualitative research: uses, tensions, and intentions. In Proc. 2025 CHI Conference on Human Factors in Computing Systems (eds Yamashita, N. et al.) 1–17 (Association for Computing Machinery, 2025).

  50. Dunivin, Z. O. Scaling hermeneutics: a guide to qualitative coding with llms for reflexive content analysis. EPJ Data Sci. 14, 28 (2025).

    Article  Google Scholar 

  51. Reiss, M. V. Testing the reliability of ChatGPT for text annotation and classification: a cautionary remark. Preprint at https://arxiv.org/abs/2304.11085 (2023).

  52. Ollion, E., Shen, R., Macanovic, A. & Chatelain, A. ChatGPT for text annotation? Mind the hype. Preprint at https://osf.io/preprints/socarxiv/x58kn_v1 (2023).

  53. Pangakis, N., Wolken, S. & Fasching, N. Automated annotation with generative AI requires validation. Preprint at https://arxiv.org/abs/2306.00176 (2023).

  54. Törnberg, P. Best practices for text annotation with large language models. Sociologica 18, 67–85 (2024).

    Google Scholar 

  55. Alizadeh, M. et al. Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning. J. Comput. Soc. Sci. 8, (2025).

  56. King, G., Keohane, R. O. & Verba, S. Designing Social Inquiry: Scientific Inference in Qualitative Research: New Edition (Princeton Univ. Press, 2021).

  57. Lakatos, I. Falsification and the methodology of scientific research programmes. In Criticism and the Growth of Knowledge: Proc. International Colloquium in the Philosophy of Science (eds Lakatos, I. & Musgrave, A.) 91–196 (Cambridge Univ. Press, 1970).

  58. Cui, Z., Li, N. & Zhou, H. Can AI replace human subjects? A large-scale replication of psychological experiments with LLMs. Preprint at https://arxiv.org/abs/2409.00128 (2024).

  59. Horton, J. J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Technical Report (National Bureau of Economic Research, 2023).

  60. Lippert, S. et al. Can large language models help predict results from a complex behavioural science study? R. Soc. Open Sci. 11, 240682 (2024).

    Article  Google Scholar 

  61. Hackenburg, K. et al. Scaling language model size yields diminishing returns for single-message political persuasion. Proc. Natl Acad. Sci. USA 122, e2413443122 (2025).

    Article  Google Scholar 

  62. Szymanski, A. et al. Limitations of the LLM-as-a-judge approach for evaluating llm outputs in expert knowledge tasks. In Proc. 30th International Conference on Intelligent User Interfaces (eds Li, T. et al.) 952–966 (Association for Computing Machinery, 2025).

  63. Huang, J.-T. et al. Apathetic or empathetic? Evaluating LLMs’ emotional alignments with humans. Adv. Neural Inf. Process. Syst. 37, 97053–97087 (2024).

    Google Scholar 

  64. Amirizaniani, M., Martin, E., Sivachenko, M., Mashhadi, A. & Shah, C. Can LLMs reason like humans? Assessing theory of mind reasoning in LLMs for open-ended questions. In Proc. 33rd ACM International Conference on Information and Knowledge Management (eds Serra, E. & Spezzano, F.) 34–44 (Association for Computing Machinery, 2024).

  65. Valmeekam, K., Olmo, A., Sreedharan, S. & Kambhampati, S. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). In NeurIPS 2022 Foundation Models for Decision Making Workshop (2022).

  66. Eaton, K. How many R’s in ‘strawberry’? This AI doesn’t know. Inc. https://www.inc.com/kit-eaton/how-many-rs-in-strawberry-this-ai-cant-tell-you.html (2024).

  67. Lu, Y., Zhu, W., Li, L., Qiao, Y. & Yuan, F. LLaMAX: Scaling linguistic horizons of LLM by enhancing translation capabilities beyond 100 languages. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y. et al.) 10748–10772 (Association for Computational Linguistics, 2024).

  68. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).

  69. Cui, J., Chiang, W. L., Stoica, I., & Hsieh, C. J. Or-bench: an over-refusal benchmark for large language models. Preprint at https://arxiv.org/abs/2405.20947 (2025).

  70. Tekgurler, M. Historical, low-resourced languages and contempo-rary AI models. In Proc. 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (eds Kazantseva, A. et al.) 227–237 (Association for Computational Linguistics, 2025).

  71. Kirk, R. et al. Understanding the effects of RLHF on LLM generalisation and diversity. In 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=PXD3FAVHJT

  72. Li, H., Ding, L., Fang, M. & Tao, D. Revisiting catastrophic forgetting in large language model tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024 (eds Al-Onaizan, Y, et al.) 4297–4308 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-emnlp.249

  73. Pezeshkpour, P. & Hruschka, E. Large language models sensitivity to the order of options in multiple-choice questions. In Findings of the Association for Computational Linguistics: NAACL 2024 (eds Duh, K. et al.) 2006–2017 (Association for Computational Linguistics, 2024); https://doi.org/10.18653/v1/2024.findings-naacl.130

  74. Wang, Q. et al. What limits LLM-based human simulation: LLMs or our design? Preprint at https://arxiv.org/abs/2501.08579 (2025).

  75. Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 1–34 (ICML, 2023).

  76. Boelaert, J., Coavoux, S., Ollion, É., Petev, I. & Präg, P. Machine bias generative large language models have a worldview of their own. Preprint at https://osf.io/preprints/socarxiv/r2pnb_v2 (2025).

  77. Kim, J. & Lee, B. AI-augmented surveys: leveraging large language models and surveys for opinion prediction. Preprint at https://arxiv.org/abs/2305.09620 (2023).

  78. Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd-workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the writing and editing of this paper.

Corresponding author

Correspondence to Lisa P. Argyle.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Argyle, L.P., Busby, E.C., Gubler, J.R. et al. Arti-‘fickle’ intelligence: using LLMs as a tool for inference in the political and social sciences. Nat Comput Sci 5, 737–744 (2025). https://doi.org/10.1038/s43588-025-00843-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-025-00843-4

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics