Abstract
Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains—including as replacements for human participants in computational social science, user testing, annotation tasks and so on. In many settings, researchers seek to distribute their surveys to a sample of participants that are representative of the underlying human population of interest. This means that to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (that is, the relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are likely to both misportray and flatten the representations of demographic groups, and then empirically show this on four LLMs through a series of human studies with 3,200 participants across 16 demographic identities. We also discuss a third limitation about how identity prompts can essentialize identities. Throughout, we connect each limitation to a pernicious history of epistemic injustice against the value of lived experiences that explains why replacement is harmful for marginalized demographic groups. Overall, we urge caution in use cases in which LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the benefits of LLM replacement are determined to outweigh the harms (for example, engaging human participants may cause them harm, or the goal is to supplement rather than fully replace), we empirically demonstrate that our inference-time techniques reduce—but do not remove—these harms.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
Under the conditions of our Institutional Review Board exemption and the consent form we provided, we cannot release the human participant data as they are sensitive and personal. To enquire about potential access to this confidential data, please contact the corresponding author with your research interest. Our LLM-generated data are available via OSF (https://doi.org/10.17605/OSF.IO/7GMZQ)90.
Code availability
Our code is available via OSF (https://doi.org/10.17605/OSF.IO/7GMZQ)90. We used the Hugging Face, OpenAI, NumPy, scikit-learn and SciPy Python packages.
References
Hämäläinen, P., Tavast, M. & Kunnari, A. Evaluating large language models in generating synthetic HCI research data: a case study. In Proc. CHI Conference on Human Factors in Computing Systems (CHI) 433 (Association for Computing Machinery, 2023).
Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).
Ziems, C. et al. Can large language models transform computational social science? Comput. Linguist. 50, 237–291 (2024).
Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Political. Anal. 31, 337–351 (2023).
Lohr, S. L. Sampling: Design and Analysis (Routledge, 2022).
Harding, S. Whose Science? Whose Knowledge? (Cornell Univ. Press, 1991).
Wylie, A. Why Standpoint Matters In Science and Other Cultures: Issues in Philosophies of Science and Technology (Routledge, 2003).
Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).
Collective, C. R. The Combahee River Collective Statement (Routledge, 1977).
Crenshaw, K. Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics (Routledge, 1989).
Korbak, T. et al. Pretraining language models with human preferences. In International Conference on Machine Learning (ICML) 17506–17533 (PMLR, 2023).
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluation? In Annual Meeting of the Association for Computational Linguistics 15607–15631 (Association for Computational Linguistics, 2023).
He, X. et al. AnnoLLM: making large language models to be better crowdsourced annotators. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics (eds Yang, Y. et al.) 165–190 (2024).
Wu, T. et al. LLMs as workers in human-computational algorithms? Replicating crowdsourcing pipelines with LLMs. In CHI Case Studies of HCI in Practice (Association for Computing Machinery, 2025).
Cegin, J., Simko, J. & Brusilovsky, P. ChatGPT to replace crowdsourcing of paraphrases for intent classification: higher diversity and comparable model robustness. In The 2023 Conference on Empirical Methods in Natural Language Processing (2023).
Hewitt, L., Ashokkumar, A., Ghezae, I. & Willer, R. Predicting results of social science experiments using large language models. Preprint at https://samim.io/dl/Predicting%20results%20of%20social%20science%20experiments%20using%20large%20language%20models.pdf (2024).
Rodriguez, S., Seetharaman, D. & Tilley, A. Meta to push for younger users with new AI chatbot characters. The Wall Street Journal https://www.wsj.com/tech/ai/meta-ai-chatbot-younger-users-dab6cb32 (2023).
Marr, B. The amazing ways Duolingo is using AI and GPT-4. Forbes https://www.forbes.com/sites/bernardmarr/2023/04/28/the-amazing-ways-duolingo-is-using-ai-and-gpt-4/ (2023).
Gupta, S. et al. Bias runs deep: implicit reasoning biases in persona-assigned LLMs. In The Twelfth International Conference on Learning Representations (ICLR, 2024).
Sheng, E., Arnold, J., Yu, Z., Chang, K.-W. & Peng, N. Revealing persona biases in dialogue systems. Preprint at https://arxiv.org/abs/2104.08728 (2021).
Wan, Y., Zhao, J., Chadha, A., Peng, N. & Chang, K.-W. Are personalized stochastic parrots more dangerous? Evaluating persona biases in dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2023 9677–9705 (Association for Computational Linguistics, 2023).
Cheng, M., Durmus, E. & Jurafsky, D. Marked personas: using natural language prompts to measure stereotypes in language models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 1504–1532 (Association for Computational Linguistics, 2023).
Cheng, M., Durmus, E. & Jurafsky, D. CoMPosT: characterizing and evaluating caricature in LLM simulations. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP, 2023).
Sun, H., Pei, J., Choi, M. & Jurgens, D. Aligning with whom? Large language models have gender and racial biases in subjective NLP tasks. Preprint at https://arxiv.org/abs/2311.09730 (2023).
Beck, T., Schuff, H., Lauscher, A. & Gurevych, I. Sensitivity, performance, robustness: deconstructing the effect of sociodemographic prompting. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 2589–2615 (Association for Computational Linguistics, 2024).
Agnew, W. et al. The illusion of artificial inclusion. In Proc. 2024 CHI Conference on Human Factors in Computing Systems 286 (Association for Computing Machinery, 2024).
Kinder, D. R. & Winter, N. Exploring the racial divide: Blacks, whites, and opinion on national policy. Am. J. Political Sci. 45, 439–456 (2001).
Sap, M. et al. Annotators with attitudes: how annotator beliefs and identities bias toxic language detection. In Proc. 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5884–5906 (Association for Computational Linguistics, 2022).
Denton, R., Díaz, M., Kivlichan, I., Prabhakaran, V. & Rosen, R. Whose ground truth? Accounting for individual and collective identities underlying dataset annotation. Preprint at https://arxiv.org/abs/2112.04554 (2021).
Díaz, M. et al. Crowdworksheets: accounting for individual and collective identities underlying crowdsourced dataset annotation. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 2342–2351 (Association for Computing Machinery, 2022).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Xu, C. et al. WizardLM: empowering large language models to follow complex instructions. In Twelth International Conference on Learning Representations (ICLR, 2024).
ehartford. Wizard-vicuna-7b-uncensored. Hugging Face https://huggingface.co/ehartford/Wizard-Vicuna-7B-Uncensored (2023).
OpenAI: GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Tam, Z. R. et al. Let me speak freely? A study on the impact of format restrictions on performance of large language models. In Proc. Conference on Empirical Methods in Natural Language Processing (eds Dernoncourt, F. et al.) 1218–1236 (Association for Computational Linguistics, 2024).
Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 3982–3992 (Association for Computational Linguistics, 2019).
Sue, D. W. Whiteness and ethnocentric monoculturalism: making the ‘invisible’ visible. Am. Psychol. 59, 761 (2004).
Kambhatla, G., Stewart, I. & Mihalcea, R. Surfacing racial stereotypes through identity portrayal. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 1604–1615 (Association for Computing Machinery, 2022).
Alcoff, L. The problem of speaking for others. Cult. Crit. 5–32 (1991).
Spivak, G.C. Can the subaltern speak? in Marxism and the Interpretation of Culture 24–28 (MacMillan, 1988).
Arnaud, S. First-person perspectives and scientific inquiry of autism: towards an integrative approach. Synthese 202, 147 (2023).
Benjamin, E., Ziss, B. E. & George, B. R. Representation is never perfect, but are parents even representatives? Am. J. Bioeth. 20, 51–53 (2020).
Nario-Redmond, M. R., Gospodinov, D. & Cobb, A. Crip for a day: the unintended negative consequences of disability simulations. Rehabil. Psychol. 62, 324 (2017).
Sears, A. & Hanson, V. L. Representing users in accessibility research. In ACM Transactions on Accessible Computing 7 (Association for Computing Machinery, 2012).
Bois, W. E. B. D. The Souls of Black Folk (A.C. McClurg & Company, 1903).
Collins, P. H. Black Feminist Thought (Hyman, 1990).
Ymous, A., Spiel, K., Keyes, O., Williams, R. M. & Good, J. ‘I am just terrified of my future’—epistemic violence in disability related technology research. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems 1–16 (Association for Computing Machinery, 2020).
Fricker, M. Epistemic Injustice: Power and the Ethics of Knowing (Oxford Univ. Press, 2007).
Hellman, D. When is Discrimination Wrong? (Harvard Univ. Press, 2011).
Durmus, E. et al. Towards measuring the representation of subjective global opinions in language models. In First Conference on Language Modeling (2024).
Ferguson, R. A. One-Dimensional Queer (John Wiley & Sons, 2018).
Lahoti, P. et al. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP, 2023).
Hayati, S. A., Lee, M., Rajagopal, D. & Kang, D. How far can we extract diverse perspectives from large language models? Criteria-based diversity prompting! In Proc. Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y. et al.) 5336–5366 (Association for Computational Linguistics, 2024).
Park, J. S. et al. Social simulacra: creating populated prototypes for social computing systems. In Proc. 35th Annual ACM Symposium on User Interface Software and Technology 74 (Association for Computing Machinery, 2022).
Buçinca, Z. et al. AHA!: facilitating AI impact assessment by generating examples of harms. Preprint at https://arxiv.org/abs/2306.03280 (2023).
Myers, I. B. The Myers-Briggs Type Indicator: Manual (Consulting Psychologists Press, 1962).
Zhang, S. et al. Personalizing dialogue agents: I have a dog, do you have pets too? In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2204–2213 (Association for Computational Linguistics, 2018).
Park, J. S. et al. Generative agent simulations of 1,000 people. Preprint at https://arxiv.org/abs/2411.10109 (2024).
Phillips, A. What’s wrong with essentialism? Distinktion: J. Social Theory 11, 47–60 (2011).
Grudin, J. The Persona Lifecycle: Keeping People in Mind (Morgan Kaufmann, 2006).
Chapman, C. N. & Milham, R. P. The personas’ new clothes: methodological and practical arguments against a popular method. In Proc. Human Factors and Ergonomics Society Annual Meeting 50, 634–636 (2006).
Marsden, N. & Haag, M. Stereotypes and politics: reflections on personas. In Proc. 2016 CHI Conference on Human Factors in Computing Systems 4017–4031 (Association for Computing Machinery, 2016).
Young, I. Describing personas. Inclusive Software https://medium.com/inclusive-software/describing-personas-af992e3fc527 (2016).
Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).
Harding, J., D’Alessandro, W., Laskowski, N. G. & Long, R. AI language models cannot replace human research participants. AI Soc. 39, 2603–2605 (2023).
Crockett, M. J. & Messeri, L. Should large language models replace human participants? Preprint at https://doi.org/10.31234/osf.io/4zdx9 (2023).
Messeri, L. & Crockett, M. J. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 49–58 (2024).
Geddes, K. Will you have autonomy in the metaverse? Denver Law Rev. 101 (2023).
Measuring digital development: facts and figures 2021. International Telecommunication Union (2021); https://www.itu.int/itu-d/reports/statistics/facts-figures-2021/index/
Wang, A., Ramaswamy, V. V. & Russakovsky, O. Towards intersectionality in machine learning: including more identities, handling underrepresentation and performing evaluation. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 336–349 (Association for Computing Machinery, 2022).
Sweeney, L. Discrimination in online ad delivery. Commun. ACM 56, 44–54 (2013).
Fryer Jr, R. G. & Levitt, S. D. The causes and consequences of distinctively Black names. Q. J. Econ. 119, 767–805 (2004).
Most common last names in the United States (with meanings) (Name Census, 2023); https://namecensus.com/last-names/
Aher, G., Arriaga, R. I. & Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In Proc. 40th International Conference on Machine Learning 202, 337–371 (PMLR, 2023).
Park, P. S., Schoenegger, P. & Zhu, C. Diminished diversity-of-thought in a standard large language model. Behav. Res. Methods 56, 5754–5770 (2024).
Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning 202, 29971–30004 (PMLR, 2023).
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Proc. 36th Annual ACM Symposium on User Interface Software and Technology 2 (Association for Computing Machinery, 2023).
Horton, J. J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Report No. 31122 (National Bureau of Economic Research, 2023).
Jiang, H., Beeferman, D., Roy, B. & Roy, D. CommunityLM: probing partisan worldviews from language models. In Proc. 29th International Conference on Computational Linguistics 6818–6826 (International Committee on Computational Linguistics, 2022).
Markel, J. M., Opferman, S. G., Landay, J. A. & Piech, C. GPTeach: interactive TA training with GPT-based students. In Proc. Tenth ACM Conference on Learning @ Scale 226–236 (Association for Computing Machinery, 2023).
CCES Dataverse (Harvard University, 2024); https://dataverse.harvard.edu/dataverse/cces
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Ziems, C., Li, M., Zhang, A. & Yang, D. Inducing positive perspectives with text reframing. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 3682–3700 (Association for Computational Linguistics, 2022).
Bagozzi, R. P., Wong, N. & Yi, Y. The role of culture and gender in the relationship between positive and negative affect. Cogn. Emot. 13, 641–672 (1999).
Goldstein, H. & Healy, M. J. R. The graphical presentation of a collection of means. J. R. Stat. Soc. A 158, 175–177 (1995).
Austin, P. C. & Hux, J. E. A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002).
Payton, M. E., Greenstone, M. H. & Schenker, N. Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003).
Greene, T., Dhurandhar, A. & Shmueli, G. Atomist or holist? A diagnosis and vision for more productive interdisciplinary AI ethics dialogue. Patterns 4, 100652 (2023).
Friedman, D. & Dieng, A. B. The Vendi score: a diversity evaluation metric for machine learning. Trans. Mach. Learn. Res. 2835–8856 (2023).
Wang, A., Morgenstern, J. & Dickerson, J. P. Large language models that replace human participants can harmfully misportray and flatten identity groups. OSF https://doi.org/10.17605/OSF.IO/7GMZQ (2024).
Acknowledgements
We thank X. Bai, R. Kamikubo, B. Stewart and H. Wallach for relevant discussions; A. Chen, T. Datta, N. Mukhija and D. Nissani for helping to pilot the human study; and T. Datta, E. Redmiles and T. Zhu for feedback on the draft. This material is based on work supported by the National Science Foundation Graduate Research Fellowship to A.W., and was work initiated during A.W.’s internship at Arthur.
Author information
Authors and Affiliations
Contributions
A.W. developed the idea and ran the experiments and analysis. J.P.D. supervised and advised the project. A.W., J.M. and J.P.D. collectively discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Travis Greene, Anna Strasser and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Sections 1–7, Figs. 1–9 and Table 1.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, A., Morgenstern, J. & Dickerson, J.P. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nat Mach Intell 7, 400–411 (2025). https://doi.org/10.1038/s42256-025-00986-z
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-025-00986-z
This article is cited by
-
Using LLMs to advance the cognitive science of collectives
Nature Computational Science (2025)
-
Participant Interactions with Artificial Intelligence: Using Large Language Models to Generate Research Materials for Surveys and Experiments
Journal of Business and Psychology (2025)