replying to: K. Warren et al; Scientific Reports https://doi.org/10.1038/s41598-025-25046-9 (2025).
In their response to Aharoni et al. (2024) on the modified Moral Turing Test1, Warren and colleagues make two core claims: (1) the semantic content of our stimuli were confounded with linguistic differences, and (2) the differences observed in our study can be fully explained by these low-level linguistic features rather than by the perceived quality of moral reasoning2. They make these claims on the basis of a novel linguistic analysis of our stimuli.
We agree with Warren et al. that linguistic differences may account for at least some of the variation in participants’ ratings of the quality of moral responses—a limitation we highlighted in our article. Their novel analysis helps to support that interpretation. However, their analysis does not show that these linguistic differences in fact explain our results. Further research would be needed to test this possibility, and their suggestions for future research complement our own. Below we offer some other possibilities, while also questioning whether linguistic and semantic properties can be fully distinguished.
Warren et al. do not dispute our findings that participants regarded the AI’s evaluations as higher in quality than the human evaluations. Rather, their concern is that our task design was not sufficient to distinguish between the two competing explanations for our effects—semantic and linguistic2. Indeed, we were explicit about this possibility. Most notably we reported response length and word choice were identified by our participants as attributes that could have helped them to identify the source of the computer passages (p. 6)1. We speculated that such structural attributes could potentially help explain how people were able to accurately identify the source of the computer responses (p. 8)1.
While we clearly recognize this potential confound, our decision to covary linguistic and semantic properties was a deliberate design choice with known advantages and disadvantages. Using the example of variable response length, we purposely chose not to match response lengths to preserve ecological realism. In the real world, we explained, LLMs are rarely expected to match a user’s word count, and imposing such a requirement might have produced responses that are not representative of the LLM’s ordinary tone. Our compromise solution was to impose an upper limit on the computer’s response length (p. 8)1. A similar strategy was taken by other researchers in an independent replication and extension of our study3. But as we noted, “future research should attempt to reproduce the results of our hypothesis tests after more closely matching attributes like response length, or perhaps bypassing stylistic factors by collecting non-linguistic representations of their moral responses such as illustrations” (p.8)1. However, evidence of stylistic differences does not rule out the possibility that participants were also moved by the content of the evaluations. Hence, we agree that AI-favoritism could be influenced by the AI’s characteristic linguistic features, but this remains an open question for future research. Warren et al. also seem to agree with our suggestion that another approach could be to prompt the AI to use imitation or roleplay to deceive the human judge into believing its output was human-generated (p. 8)1. However, even if the roleplay method were successful in matching the human’s style, there would be no assurance that in changing its style, the LLM didn’t also subtly change its meaning because people might rely on cues like word choice to infer meaning. After all, just how many manipulations of the LLM’s output would be allowable before researchers are no longer testing an AI model of a human and instead testing the researcher’s model of a human? These questions have no straight-forward answers, and so investigators are forced to make tradeoffs—like the well-known tradeoff between experimental control and ecological validity. Warren et al. suggest modifying the LLM’s output to perfectly match linguistic characteristics, including sophistication of vocabulary and grade-level readability—but as we’ve argued, doing that could inadvertently change the LLM’s meaning or perhaps trivialize the set of real-world cases to which the results can generalize. So that approach should be pursued, but as just one part of a more diverse set of tests.
An alternative approach might be to collect higher quality responses from humans, an approach later taken by Dillion and colleagues3. Effectively replicating our 2024 finding, they found that “Americans rate ethical advice from GPT-4o as slightly more moral, trustworthy, thoughtful, and correct” than that of a professional ethicist, Kwame Anthony Appiah of the New York Times. Still, their participants were better than chance at identifying the source of the AI content. The authors concluded that the linguistic differences they measured might have contributed to but could not fully explain participants’ accuracy in identifying the AI author. Another approach could be to filter laypersons’ moral evaluations through an LLM to standardize their language. Again, these approaches all have drawbacks, which is why a diversity of approaches are needed to tease apart content from form. On the other hand, part of what it means to have a good moral argument could include the ability to express it well with relevant high-level vocabulary. If so, it could be impossible to fully standardize the speaker’s language without also changing the meaning of their argument. But such efforts would be unnecessary since such a theory could account for some linguistic differences.
Understanding the factors that drive human perceptions of AI is crucial because if chatbots can influence moral attitudes on the basis of superficial language rather than substantive arguments, people risk being persuaded by them for the wrong reasons. On this point, we expressed skepticism about the genuine moral properties of LLMs, noting that these systems often generate “bullshit”, a technical term referring to persuasion without any regard for, or even understanding of, what is true or false4. And yet our research suggests that some people may sometimes regard their evaluations as equal to or higher in quality than humans, even though that inference is probably unwarranted, at least at this early stage in LLM technology development.
We appreciate Warren et al.’s linguistic analysis and call to extend the research on perceptions of LLMs’ moral language. Although they contributed additional evidence that AI linguistic properties may covary with semantic properties, it remains to be shown that people’s evaluations of AI moral responses are actually explained by linguistic properties. The inherent difficulties in this question are exactly why critical and constructive dialogues like this one are so important for scientific progress on perceptions of AI moral commentary.
References
Aharoni, E., Fernandes, S., Brady, D. J., Alexander, C., Criner, M., Queen, K., & Crespo, V. Attributions toward artificial agents in a modified Moral Turing Test. Sci. Rep. 14(1), 8458. https://doi.org/10.1038/s41598-024-58087-7 (2024).
Warren, K., Nichols, C., Petersen, D., Shalin, V. L., & Almor, A. Stylistic language drives perceived moral superiority of LLMs. Sci. Rep. https://doi.org/10.1038/s41598-025-25046-9 (2025).
Dillion, D., Mondal, D., Tandon, N. & Gray, K. AI language model rivals expert ethicist in perceived moral expertise. Sci. Rep. 15(1), 4084. https://doi.org/10.1038/s41598-025-86510-0 (2025).
Frankfurt, H. On Bullshit (Princeton University Press, 2005).
Author information
Authors and Affiliations
Contributions
Both authors wrote and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Aharoni, E., Nahmias, E. Reply to: Stylistic language drives perceived moral superiority of LLMs. Sci Rep 15, 39169 (2025). https://doi.org/10.1038/s41598-025-25047-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-25047-8