Extended Data Fig. 2: Differences in the effects of slurs by identity across prompts (closed models).

This figure shows the difference in marginal means for between users with an identity cue and anonymous users for each slur and how these differences vary across prompts. The results for each of the closed models are shown (Nposts = 60,000 for each model). The top row shows results for human subjects (Nposts = 55,620 evaluated by Nsubjects=1854). Each point represents the estimated difference in marginal means and is colored based on the identity depicted. The shape of each point denotes the prompt variant. Error bars are 95% confidence intervals: the MLLM results use bootstrap confidence intervals, and the human experiment results include subject-level clustered standard errors.