Automated content moderation systems designed to detect prohibited content on social media often struggle to account for contextual information, which sometimes leads to the erroneous flagging of innocuous content. An experimental study on hate speech reveals how multimodal large language models can facilitate more context-sensitive content moderation on social media.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout

References
Wilson, R. A. & Land, M. K. Hate speech on social media: content moderation in context. Conn. Law Rev. 52, 1029–1076 (2021). This review discusses online hate speech, related legal debates and the importance of context.
Davidson, T. et al. Automated hate speech detection and the problem of offensive language. Proc. ICWSM 11, 512–515 (2017). This paper shows how hate speech is often conflated with offensive language.
Sap, M. et al. The risk of racial bias in hate speech detection. Proc. ACL 57, 1668–1679 (2019). This paper demonstrates racial bias in hate speech detection models.
Hainmueller, J., Hopkins, D. J. & Yamamoto, T. Causal inference in conjoint analysis: understanding multidimensional choices via stated preference experiments. Polit. Anal. 22, 1–30 (2014). This paper provides a thorough introduction to conjoint experiments.
Vecchiato, A. & Munger, K. Introducing the visual conjoint, with an application to candidate evaluation on social media. J. Exp. Polit. Sci. 12, 57–71 (2025). This paper introduces visual conjoint designs.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is a summary of: Davidson, T. Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement. Nat. Hum. Behav. https://doi.org/10.1038/s41562-025-02360-w (2025).
Rights and permissions
About this article
Cite this article
Measuring context sensitivity in artificial intelligence content moderation. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02363-7
Published:
Version of record:
DOI: https://doi.org/10.1038/s41562-025-02363-7