Large language models that replace human participants can harmfully misportray and flatten identity groups

Wang, Angelina; Morgenstern, Jamie; Dickerson, John P.

doi:10.1038/s42256-025-00986-z

Article
Published: 17 February 2025

Large language models that replace human participants can harmfully misportray and flatten identity groups

Nature Machine Intelligence volume 7, pages 400–411 (2025)Cite this article

5824 Accesses
37 Citations
31 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains—including as replacements for human participants in computational social science, user testing, annotation tasks and so on. In many settings, researchers seek to distribute their surveys to a sample of participants that are representative of the underlying human population of interest. This means that to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (that is, the relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are likely to both misportray and flatten the representations of demographic groups, and then empirically show this on four LLMs through a series of human studies with 3,200 participants across 16 demographic identities. We also discuss a third limitation about how identity prompts can essentialize identities. Throughout, we connect each limitation to a pernicious history of epistemic injustice against the value of lived experiences that explains why replacement is harmful for marginalized demographic groups. Overall, we urge caution in use cases in which LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the benefits of LLM replacement are determined to outweigh the harms (for example, engaging human participants may cause them harm, or the goal is to supplement rather than fully replace), we empirically demonstrate that our inference-time techniques reduce—but do not remove—these harms.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: LLMs compared with out-group imitations and in-group portrayals.**

**Fig. 3: Identity-coded names compared with explicit identity label.**

**Fig. 5: Temperature hyperparameter does not solve flatness for GPT-4.**

**Fig. 6: Response coverage is high without essentializing identity.**

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Article 23 April 2024

A large-scale replication of scenario-based experiments in psychology and management using large language models

Article 09 July 2025

Testing the limits of large language models in debating humans

Article Open access 22 April 2025

Data availability

Under the conditions of our Institutional Review Board exemption and the consent form we provided, we cannot release the human participant data as they are sensitive and personal. To enquire about potential access to this confidential data, please contact the corresponding author with your research interest. Our LLM-generated data are available via OSF (https://doi.org/10.17605/OSF.IO/7GMZQ)⁹⁰.

Code availability

Our code is available via OSF (https://doi.org/10.17605/OSF.IO/7GMZQ)⁹⁰. We used the Hugging Face, OpenAI, NumPy, scikit-learn and SciPy Python packages.

References

Hämäläinen, P., Tavast, M. & Kunnari, A. Evaluating large language models in generating synthetic HCI research data: a case study. In Proc. CHI Conference on Human Factors in Computing Systems (CHI) 433 (Association for Computing Machinery, 2023).
Gilardi, F., Alizadeh, M. & Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl Acad. Sci. USA 120, e2305016120 (2023).
Google Scholar
Ziems, C. et al. Can large language models transform computational social science? Comput. Linguist. 50, 237–291 (2024).
Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Political. Anal. 31, 337–351 (2023).
Google Scholar
Lohr, S. L. Sampling: Design and Analysis (Routledge, 2022).
Harding, S. Whose Science? Whose Knowledge? (Cornell Univ. Press, 1991).
Wylie, A. Why Standpoint Matters In Science and Other Cultures: Issues in Philosophies of Science and Technology (Routledge, 2003).
Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).
MATH Google Scholar
Collective, C. R. The Combahee River Collective Statement (Routledge, 1977).
Crenshaw, K. Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics (Routledge, 1989).
Korbak, T. et al. Pretraining language models with human preferences. In International Conference on Machine Learning (ICML) 17506–17533 (PMLR, 2023).
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluation? In Annual Meeting of the Association for Computational Linguistics 15607–15631 (Association for Computational Linguistics, 2023).
He, X. et al. AnnoLLM: making large language models to be better crowdsourced annotators. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics (eds Yang, Y. et al.) 165–190 (2024).
Wu, T. et al. LLMs as workers in human-computational algorithms? Replicating crowdsourcing pipelines with LLMs. In CHI Case Studies of HCI in Practice (Association for Computing Machinery, 2025).
Cegin, J., Simko, J. & Brusilovsky, P. ChatGPT to replace crowdsourcing of paraphrases for intent classification: higher diversity and comparable model robustness. In The 2023 Conference on Empirical Methods in Natural Language Processing (2023).
Hewitt, L., Ashokkumar, A., Ghezae, I. & Willer, R. Predicting results of social science experiments using large language models. Preprint at https://samim.io/dl/Predicting%20results%20of%20social%20science%20experiments%20using%20large%20language%20models.pdf (2024).
Rodriguez, S., Seetharaman, D. & Tilley, A. Meta to push for younger users with new AI chatbot characters. The Wall Street Journal https://www.wsj.com/tech/ai/meta-ai-chatbot-younger-users-dab6cb32 (2023).
Marr, B. The amazing ways Duolingo is using AI and GPT-4. Forbes https://www.forbes.com/sites/bernardmarr/2023/04/28/the-amazing-ways-duolingo-is-using-ai-and-gpt-4/ (2023).
Gupta, S. et al. Bias runs deep: implicit reasoning biases in persona-assigned LLMs. In The Twelfth International Conference on Learning Representations (ICLR, 2024).
Sheng, E., Arnold, J., Yu, Z., Chang, K.-W. & Peng, N. Revealing persona biases in dialogue systems. Preprint at https://arxiv.org/abs/2104.08728 (2021).
Wan, Y., Zhao, J., Chadha, A., Peng, N. & Chang, K.-W. Are personalized stochastic parrots more dangerous? Evaluating persona biases in dialogue systems. In Findings of the Association for Computational Linguistics: EMNLP 2023 9677–9705 (Association for Computational Linguistics, 2023).
Cheng, M., Durmus, E. & Jurafsky, D. Marked personas: using natural language prompts to measure stereotypes in language models. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 1504–1532 (Association for Computational Linguistics, 2023).
Cheng, M., Durmus, E. & Jurafsky, D. CoMPosT: characterizing and evaluating caricature in LLM simulations. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP, 2023).
Sun, H., Pei, J., Choi, M. & Jurgens, D. Aligning with whom? Large language models have gender and racial biases in subjective NLP tasks. Preprint at https://arxiv.org/abs/2311.09730 (2023).
Beck, T., Schuff, H., Lauscher, A. & Gurevych, I. Sensitivity, performance, robustness: deconstructing the effect of sociodemographic prompting. In Proc. 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 2589–2615 (Association for Computational Linguistics, 2024).
Agnew, W. et al. The illusion of artificial inclusion. In Proc. 2024 CHI Conference on Human Factors in Computing Systems 286 (Association for Computing Machinery, 2024).
Kinder, D. R. & Winter, N. Exploring the racial divide: Blacks, whites, and opinion on national policy. Am. J. Political Sci. 45, 439–456 (2001).
Sap, M. et al. Annotators with attitudes: how annotator beliefs and identities bias toxic language detection. In Proc. 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 5884–5906 (Association for Computational Linguistics, 2022).
Denton, R., Díaz, M., Kivlichan, I., Prabhakaran, V. & Rosen, R. Whose ground truth? Accounting for individual and collective identities underlying dataset annotation. Preprint at https://arxiv.org/abs/2112.04554 (2021).
Díaz, M. et al. Crowdworksheets: accounting for individual and collective identities underlying crowdsourced dataset annotation. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 2342–2351 (Association for Computing Machinery, 2022).
Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).
Xu, C. et al. WizardLM: empowering large language models to follow complex instructions. In Twelth International Conference on Learning Representations (ICLR, 2024).
ehartford. Wizard-vicuna-7b-uncensored. Hugging Face https://huggingface.co/ehartford/Wizard-Vicuna-7B-Uncensored (2023).
OpenAI: GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Tam, Z. R. et al. Let me speak freely? A study on the impact of format restrictions on performance of large language models. In Proc. Conference on Empirical Methods in Natural Language Processing (eds Dernoncourt, F. et al.) 1218–1236 (Association for Computational Linguistics, 2024).
Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 3982–3992 (Association for Computational Linguistics, 2019).
Sue, D. W. Whiteness and ethnocentric monoculturalism: making the ‘invisible’ visible. Am. Psychol. 59, 761 (2004).
MATH Google Scholar
Kambhatla, G., Stewart, I. & Mihalcea, R. Surfacing racial stereotypes through identity portrayal. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 1604–1615 (Association for Computing Machinery, 2022).
Alcoff, L. The problem of speaking for others. Cult. Crit. 5–32 (1991).
Spivak, G.C. Can the subaltern speak? in Marxism and the Interpretation of Culture 24–28 (MacMillan, 1988).
Arnaud, S. First-person perspectives and scientific inquiry of autism: towards an integrative approach. Synthese 202, 147 (2023).
MATH Google Scholar
Benjamin, E., Ziss, B. E. & George, B. R. Representation is never perfect, but are parents even representatives? Am. J. Bioeth. 20, 51–53 (2020).
Google Scholar
Nario-Redmond, M. R., Gospodinov, D. & Cobb, A. Crip for a day: the unintended negative consequences of disability simulations. Rehabil. Psychol. 62, 324 (2017).
Google Scholar
Sears, A. & Hanson, V. L. Representing users in accessibility research. In ACM Transactions on Accessible Computing 7 (Association for Computing Machinery, 2012).
Bois, W. E. B. D. The Souls of Black Folk (A.C. McClurg & Company, 1903).
Collins, P. H. Black Feminist Thought (Hyman, 1990).
Ymous, A., Spiel, K., Keyes, O., Williams, R. M. & Good, J. ‘I am just terrified of my future’—epistemic violence in disability related technology research. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems 1–16 (Association for Computing Machinery, 2020).
Fricker, M. Epistemic Injustice: Power and the Ethics of Knowing (Oxford Univ. Press, 2007).
Hellman, D. When is Discrimination Wrong? (Harvard Univ. Press, 2011).
Durmus, E. et al. Towards measuring the representation of subjective global opinions in language models. In First Conference on Language Modeling (2024).
Ferguson, R. A. One-Dimensional Queer (John Wiley & Sons, 2018).
Lahoti, P. et al. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP, 2023).
Hayati, S. A., Lee, M., Rajagopal, D. & Kang, D. How far can we extract diverse perspectives from large language models? Criteria-based diversity prompting! In Proc. Conference on Empirical Methods in Natural Language Processing (eds Al-Onaizan, Y. et al.) 5336–5366 (Association for Computational Linguistics, 2024).
Park, J. S. et al. Social simulacra: creating populated prototypes for social computing systems. In Proc. 35th Annual ACM Symposium on User Interface Software and Technology 74 (Association for Computing Machinery, 2022).
Buçinca, Z. et al. AHA!: facilitating AI impact assessment by generating examples of harms. Preprint at https://arxiv.org/abs/2306.03280 (2023).
Myers, I. B. The Myers-Briggs Type Indicator: Manual (Consulting Psychologists Press, 1962).
Zhang, S. et al. Personalizing dialogue agents: I have a dog, do you have pets too? In Proc. 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2204–2213 (Association for Computational Linguistics, 2018).
Park, J. S. et al. Generative agent simulations of 1,000 people. Preprint at https://arxiv.org/abs/2411.10109 (2024).
Phillips, A. What’s wrong with essentialism? Distinktion: J. Social Theory 11, 47–60 (2011).
Grudin, J. The Persona Lifecycle: Keeping People in Mind (Morgan Kaufmann, 2006).
Chapman, C. N. & Milham, R. P. The personas’ new clothes: methodological and practical arguments against a popular method. In Proc. Human Factors and Ergonomics Society Annual Meeting 50, 634–636 (2006).
Marsden, N. & Haag, M. Stereotypes and politics: reflections on personas. In Proc. 2016 CHI Conference on Human Factors in Computing Systems 4017–4031 (Association for Computing Machinery, 2016).
Young, I. Describing personas. Inclusive Software https://medium.com/inclusive-software/describing-personas-af992e3fc527 (2016).
Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27, 597–600 (2023).
Google Scholar
Harding, J., D’Alessandro, W., Laskowski, N. G. & Long, R. AI language models cannot replace human research participants. AI Soc. 39, 2603–2605 (2023).
Crockett, M. J. & Messeri, L. Should large language models replace human participants? Preprint at https://doi.org/10.31234/osf.io/4zdx9 (2023).
Messeri, L. & Crockett, M. J. Artificial intelligence and illusions of understanding in scientific research. Nature 627, 49–58 (2024).
MATH Google Scholar
Geddes, K. Will you have autonomy in the metaverse? Denver Law Rev. 101 (2023).
Measuring digital development: facts and figures 2021. International Telecommunication Union (2021); https://www.itu.int/itu-d/reports/statistics/facts-figures-2021/index/
Wang, A., Ramaswamy, V. V. & Russakovsky, O. Towards intersectionality in machine learning: including more identities, handling underrepresentation and performing evaluation. In Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency 336–349 (Association for Computing Machinery, 2022).
Sweeney, L. Discrimination in online ad delivery. Commun. ACM 56, 44–54 (2013).
Fryer Jr, R. G. & Levitt, S. D. The causes and consequences of distinctively Black names. Q. J. Econ. 119, 767–805 (2004).
MATH Google Scholar
Most common last names in the United States (with meanings) (Name Census, 2023); https://namecensus.com/last-names/
Aher, G., Arriaga, R. I. & Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In Proc. 40th International Conference on Machine Learning 202, 337–371 (PMLR, 2023).
Park, P. S., Schoenegger, P. & Zhu, C. Diminished diversity-of-thought in a standard large language model. Behav. Res. Methods 56, 5754–5770 (2024).
MATH Google Scholar
Santurkar, S. et al. Whose opinions do language models reflect? In Proc. 40th International Conference on Machine Learning 202, 29971–30004 (PMLR, 2023).
Park, J. S. et al. Generative agents: interactive simulacra of human behavior. In Proc. 36th Annual ACM Symposium on User Interface Software and Technology 2 (Association for Computing Machinery, 2023).
Horton, J. J. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? Report No. 31122 (National Bureau of Economic Research, 2023).
Jiang, H., Beeferman, D., Roy, B. & Roy, D. CommunityLM: probing partisan worldviews from language models. In Proc. 29th International Conference on Computational Linguistics 6818–6826 (International Committee on Computational Linguistics, 2022).
Markel, J. M., Opferman, S. G., Landay, J. A. & Piech, C. GPTeach: interactive TA training with GPT-based students. In Proc. Tenth ACM Conference on Learning @ Scale 226–236 (Association for Computing Machinery, 2023).
CCES Dataverse (Harvard University, 2024); https://dataverse.harvard.edu/dataverse/cces
Vinh, N. X., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010).
Ziems, C., Li, M., Zhang, A. & Yang, D. Inducing positive perspectives with text reframing. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 3682–3700 (Association for Computational Linguistics, 2022).
Bagozzi, R. P., Wong, N. & Yi, Y. The role of culture and gender in the relationship between positive and negative affect. Cogn. Emot. 13, 641–672 (1999).
MATH Google Scholar
Goldstein, H. & Healy, M. J. R. The graphical presentation of a collection of means. J. R. Stat. Soc. A 158, 175–177 (1995).
MATH Google Scholar
Austin, P. C. & Hux, J. E. A brief note on overlapping confidence intervals. J. Vasc. Surg. 36, 194–195 (2002).
MATH Google Scholar
Payton, M. E., Greenstone, M. H. & Schenker, N. Overlapping confidence intervals or standard error intervals: what do they mean in terms of statistical significance? J. Insect Sci. 3, 34 (2003).
MATH Google Scholar
Greene, T., Dhurandhar, A. & Shmueli, G. Atomist or holist? A diagnosis and vision for more productive interdisciplinary AI ethics dialogue. Patterns 4, 100652 (2023).
Friedman, D. & Dieng, A. B. The Vendi score: a diversity evaluation metric for machine learning. Trans. Mach. Learn. Res. 2835–8856 (2023).
Wang, A., Morgenstern, J. & Dickerson, J. P. Large language models that replace human participants can harmfully misportray and flatten identity groups. OSF https://doi.org/10.17605/OSF.IO/7GMZQ (2024).

Download references

Acknowledgements

We thank X. Bai, R. Kamikubo, B. Stewart and H. Wallach for relevant discussions; A. Chen, T. Datta, N. Mukhija and D. Nissani for helping to pilot the human study; and T. Datta, E. Redmiles and T. Zhu for feedback on the draft. This material is based on work supported by the National Science Foundation Graduate Research Fellowship to A.W., and was work initiated during A.W.’s internship at Arthur.

Author information

Authors and Affiliations

Computer Science, Stanford University, Palo Alto, CA, USA
Angelina Wang
Computer Science & Engineering, University of Washington, Seattle, WA, USA
Jamie Morgenstern
Computer Science, University of Maryland, College Park, MD, USA
John P. Dickerson
Arthur, New York City, NY, USA
John P. Dickerson

Authors

Angelina Wang
View author publications
Search author on:PubMed Google Scholar
Jamie Morgenstern
View author publications
Search author on:PubMed Google Scholar
John P. Dickerson
View author publications
Search author on:PubMed Google Scholar

Contributions

A.W. developed the idea and ran the experiments and analysis. J.P.D. supervised and advised the project. A.W., J.M. and J.P.D. collectively discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Angelina Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Travis Greene, Anna Strasser and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections 1–7, Figs. 1–9 and Table 1.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, A., Morgenstern, J. & Dickerson, J.P. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nat Mach Intell 7, 400–411 (2025). https://doi.org/10.1038/s42256-025-00986-z

Download citation

Received: 12 February 2024
Accepted: 07 January 2025
Published: 17 February 2025
Issue date: March 2025
DOI: https://doi.org/10.1038/s42256-025-00986-z

This article is cited by

Using LLMs to advance the cognitive science of collectives
- Ilia Sucholutsky
- Katherine M. Collins
- Robert D. Hawkins
Nature Computational Science (2025)
Participant Interactions with Artificial Intelligence: Using Large Language Models to Generate Research Materials for Surveys and Experiments
- Tara S. Behrend
- Richard N. Landers
Journal of Business and Psychology (2025)