Introduction


arising from: E. Aharoni et al.; Scientific Reports https://doi.org/10.1038/s41598-024-58087-7 (2024).

In this paper, we are responding to a recent article published in Scientific Reports by Aharoni et al.1 titled, “Attributions toward artificial agents in a modified Moral Turing Test.” Aharoni et al. tested how humans evaluate the quality of moral reasoning in human-generated and LLM-generated responses to moral questions. The human responses were sourced from university undergraduates, while the LLM responses were generated using OpenAI’s ChatGPT-4. The prompts used to elicit the responses asked whether and why certain actions were morally wrong or acceptable. Ten pairs of human-generated responses and LLM-generated responses were then used as stimuli in a modified Moral Turing Test (m-MTT) in which different human participants rated the quality of these responses. Participants rated the LLM-generated stimuli as showing higher quality of moral virtuousness, trustworthiness, and intelligence. However, the participants were able to distinguish between the human-generated and the LLM-generated responses.

Aharoni et al.1 claimed that “participants’ aptitude at identifying the computer, [was due] not to its failures in moral reasoning, but potentially to its perceived superiority—not necessarily in the form of conscious attitudes about its general moral capabilities but at least in the form of implicit attitudes about the quality of the moral responses observed,” (Aharoni et al., 2024, p. 8). We argue that their findings do not yet merit this conclusion. While we appreciate the Aharoni et al. contribution to the ongoing discourse on AI and moral reasoning, we propose an alternative interpretation of their results. We suggest that the observed ratings primarily reflect participants’ perceptions of the LLM’s use of specialist language, not its moral reasoning. Specifically, we argue that the perceived superiority of the LLM-generated responses was driven by uncontrolled psycholinguistic features—namely, word frequency, age of acquisition, word length, and overall readability. These features are not specific to moral reasoning. Therefore, participants’ explicit judgements of intelligence, moral virtuosity, and trustworthiness are likely driven by well-known implicit, domain independent, (psycho)linguistic effects.

Indeed, such psycholinguistic features are well-known to influence perceived credibility, trustworthiness, intelligence, and persuasiveness (e.g.,2,3. Seminal research in human intelligence has demonstrated a positive correlation between larger vocabularies and intelligence4. Oppenheimer5 demonstrated that experimentally manipulating psycholinguistic features—such as word length—can significantly influence participants’ perception of an author’s intelligence, even when texts have identical semantic content. This finding underscores that perceived intelligence—and by extension, other evaluative judgments—can be shaped by surface-level linguistic features alone, independent of the actual substance of the argument. We argue that the differences observed in Aharoni et al.’s1 study can be fully explained by these uncontrolled low-level psycholinguistic features, that is, a simpler explanation, rather than by the perceived quality of moral reasoning. We recommend that future evaluations of AI controls for these types of confounding psycholinguistic variables, to disentangle the effects of language complexity from genuine perceptions of AI’s capabilities.

To test the linguistic differences between the LLM- and human-generated responses used as rating stimuli by Aharoni et al.1, we examined both responses for mean word length (measured by number of letters, number of phonemes, and number of syllables), mean word frequency, and mean age of acquisition. We also calculated their overall Flesch-Kincaid readability scores using the Text Ease and Readability Assessor (T.E.R.A.)6,7. If, as we suspected, the two response types show significant differences in these measures, then the differences in the rating results reported by Aharoni et al. are not informative about moral reasoning.

We used the South Carolina psycholinguistic metabase (SCOPE8, to extract, for each word in Aharoni et al.’s stimuli, the SUBTLEXus corpus word frequency (Brysbaert & New)9, Living Word Vocabulary (Dale & O’Rourke)10 age of acquisition (AoA), and word length (measured in letters, phonemes, and syllables). Out of the 444 distinct words in the stimuli, data were missing for 19 words on frequency, 7 on letter count, 13 on phoneme and syllable count, and 173 on AoA. For the analysis of each measure, we excluded words with missing values on that measure. For readability, each LLM-generated and human-generated response was analyzed using T.E.R.A. to assess the response’s grade level readability. For each response, T.E.R.A. produces a Flesch-Kincaid reading grade level6,7. We then computed the means of each variable for each text and used them to compare the texts generated by the humans to those generated by the LLM using two-tailed paired-sample t tests. All statistical testing was done in R 4.4.111.

The results are shown in Table 1.

Table 1 Comparison of linguistic measures of LLM and human responses.

As expected, the LLM responses were significantly more complex than the human responses in terms of word frequency, age of acquisition, word length, and readability. Although the numerical differences may appear modest, they fall within the range known to influence perceptions of author intelligence, expertise, and clarity2,5. Thus, we argue that the findings of Aharoni et al.1 reflect the persuasive effect of linguistic style rather than genuine perceived differences in moral reasoning.

To more accurately evaluate differences in the perception of moral reasoning between LLMs and humans, it is essential to control for psycholinguistic features. Prior work has shown that ChatGPT can adopt different expository styles depending on the prompt12,13. Thus, LLM prompts could be crafted to match the psycholinguistic style of the human comparison group. For instance, ChatGPT-4 could be explicitly instructed to respond in the style of undergraduate students, with subsequent verification that both LLM- and human-generated texts are comparable on key linguistic dimensions. Alternatively, researchers could directly revise LLM outputs to match the language form of the human population while preserving the underlying content.

In this paper, we have shown that Aharoni et al.’s1 results can be explained by low-level psycholinguistic features and thus do not merit conclusions about the perception of LLM’s moral reasoning. Considering basic, well known, psycholinguistic features is critical for any study that gauges LLMs’ performance in any domain on the basis of human evaluation of verbal responses14,15,16.