Grammaticality representation in ChatGPT as compared to linguists and laypeople

Qiu, Zhuang; Duan, Xufeng; Cai, Zhenguang G.

doi:10.1057/s41599-025-04907-8

Download PDF

Article
Open access
Published: 06 May 2025

Grammaticality representation in ChatGPT as compared to linguists and laypeople

Humanities and Social Sciences Communications volume 12, Article number: 617 (2025) Cite this article

2468 Accesses
3 Citations
25 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

Large language models (LLMs) have demonstrated exceptional performance across various linguistic tasks. However, it remains uncertain whether LLMs have developed human-like fine-grained grammatical intuition. This preregistered study (https://osf.io/t5nes/?view_only=07c7590306624eb7a6510d5c69e26c02) presents the first large-scale investigation of ChatGPT’s grammatical intuition, building upon a previous study that collected laypeople’s grammatical judgments on 148 linguistic phenomena that linguists judged to be grammatical, ungrammatical, or marginally grammatical (Sprouse et al., 2013). Our primary focus was to compare ChatGPT with both laypeople and linguists in the judgment of these linguistic constructions. In Experiment 1, ChatGPT assigned ratings to sentences based on a given reference sentence. Experiment 2 involved rating sentences on a 7-point scale, and Experiment 3 asked ChatGPT to choose the more grammatical sentence from a pair. Overall, our findings demonstrate convergence rates ranging from 73% to 95% between ChatGPT and linguists, with an overall point-estimate of 89%. Significant correlations were also found between ChatGPT and laypeople across all tasks, though the correlation strength varied by task. We attribute these results to the psychometric nature of the judgment tasks and the differences in language processing styles between humans and LLMs.

Languages with more speakers tend to be harder to (machine-)learn

Article Open access 28 October 2023

Testing theory of mind in large language models and humans

Article Open access 20 May 2024

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Article Open access 05 October 2023

Introduction

The technological progression within artificial intelligence, especially when it comes to the realm of natural language processing, has ignited significant discussions about how closely large language models (LLMs), including chatbots like ChatGPT, emulate human linguistic cognition and utilization (Chomsky et al., 2023; Piantadosi, 2024; Binz and Schulz, 2023; Kosinski, 2024; Qiu et al., 2023; Cai et al., 2024). With each technological leap, distinguishing between human linguistic cognition and the capabilities of AI-driven language models becomes even more intricate (Wilcox et al., 2022; Van Schijndel and Linzen, 2018; Futrell et al., 2019). This leads scholars to query if these LLMs genuinely reflect human linguistic nuances or merely reproduce them on a cosmetic level (e.g., Duan, et al., 2024, 2025; Wang et al., 2024). This research delves deeper into the congruencies and disparities between LLMs and humans, focusing primarily on their instinctive understanding of grammar. In three preregistered experiments, ChatGPT was asked to provide grammaticality judgments in different formats for over two thousand sentences with diverse structural configurations. We compared ChatGPT’s judgments with judgments from laypeople and linguists to map out any parallels or deviations.

The ascent of LLMs has been nothing short of remarkable, displaying adeptness in a plethora of linguistic challenges, including discerning ambiguities (Ortega-Martín, 2023), responding to queries (Brown et al., 2020), and transcribing across languages (Jiao et al., 2023). Interestingly, while these models weren’t inherently designed with a hierarchical syntactical structure specifically for human languages, they have shown the capability to discern complex filler-gap dependencies and develop incremental syntactic interpretations (Wilcox et al., 2022; Van Schijndel and Linzen, 2018; Futrell et al., 2019). But the overarching question lingers: Do LLMs genuinely mirror humans in terms of linguistic cognition? Chomsky, Roberts, and Watumull (2023) have been vocal about the inherent discrepancies between how LLMs and humans perceive and communicate. Yet, other scholars like Piantadosi (2024) hold a contrasting view, positioning LLMs as genuine reflections of human linguistic cognition.

Empirical studies have emerged as a crucial tool to answer this debate. Pioneering work by Binz and Schulz (2023) subjected GPT-3 to a battery of psychological tests, originally crafted to understand facets of human thought processes, ranging from decision-making matrices to reasoning pathways. The outcomes were intriguing, with GPT-3 not just mirroring but at times outperforming human benchmarks in specific scenarios. On a similar trajectory, Kosinski (2024) assessed the capacity of LLMs to understand and respond to false-belief scenarios, often utilized to gauge human empathy and comprehension. Here, the responses from ChatGPT echoed the patterns seen in school-going children, though subsequent research from Brunet-Gouet and colleagues (2023) voiced concerns about the consistency of such responses. Delving into ChatGPT’s language processing abilities, Cai et al. (2024) subjected ChatGPT to a myriad of psycholinguistic experiments and showed an impressive alignment between the models and humans in language use in a majority of the tests, ranging from sounds to syntax, all the way to dialog. Zhou et al. (2025) examined the difference in internal neuronal activation in response to grammatical and ungrammatical sentences in a minimal pair (thus the activation difference reflects the grammatical aspect the two sentences differ). They showed that more similar neuronal activation for minimal pairs from the same grammatical category (e.g., binding) than from different categories (e.g., binding and agreement). These findings suggest that LLMs represent linguistic knowledge in a similar way as humans do. However, it’s noteworthy that ChatGPT can diverge from humans in language use, for example, in word length preference for conveying lesser information (e.g., Mahowald et al., 2013).

When examining LLM-human similarities, it’s crucial to assess the extent to which ChatGPT’s representations of linguistic knowledge align with those of humans. Contemporary linguistic theories often distinguish between the inherent mental systems that enable language comprehension and production, and the actual use of language—illustrated by distinctions like “Langue vs. Parole” from Saussure (1916) and “Competence vs Performance” by Chomsky (1965). Grammaticality judgment is a central method to assess linguistic representation competence. Chomsky (1986) highlighted that evidence for linguistic theorizing largely depends on “the judgments of native speakers”. While there are other sources of evidence, like speech corpus or acquisition sequences (Devitt, 2006), formal linguists typically favor native speakers’ grammaticality intuitions. The prevailing assumption is that our language knowledge comprises abstract rules and principles, forming intuitions about sentence well-formedness (Graves et al. 1973; Chomsky 1980; Fodor 1981). However, relying on grammaticality judgments to frame linguistic theories isn’t without dispute. Hill (1961) noted that such judgments often disregard acoustic properties like intonations, potentially compromising informant reliability. The dual role of formal linguists, as both theory developers and data providers, might compromise objectivity (Lyons, 1968; Ferreira, 2005). However, advancements have been made, including better practices for eliciting judgment data (Schütze, 1996), improving the reliability and validity of grammaticality judgments. Our study doesn’t evaluate these methods, but we embrace grammaticality judgment for studying LLM knowledge representation, deeming it a practical tool. This decision rests on several reasons. First of all, more objective or direct measure of linguistic competence is not available for a comparative study between LLMs and human participants. Furthermore, generative linguistics’ explanatory and predictive power attests to the value of metalinguistic judgments (Riemer, 2009). Empirical studies also affirm the reliability of controlled grammaticality judgment tasks (Langsford et al, 2018).

Formal surveys assessing sentence grammaticality often take the form of acceptability judgment tasks. Rather than asking participants to determine if a sentence is “grammatical”, researchers frequently inquire whether sentences under consideration are “acceptable” (Sprouse et al., 2013), “sound good” (Davies and Kaplan, 1998; van der Lely et al., 2011), or are “possible” (Mandell, 1999). Chomsky (1965) elucidated the conceptual differences between grammaticality and acceptability. Here, “grammaticality” pertains to linguistic competence, while “acceptability” addresses the actual use of language. Acceptability is influenced not only by grammaticality but also by factors such as “memory limitations, intonational patterns, and stylistic considerations” (p. 11). As an illustration, slang, though grammatically correct, might be inacceptable in formal settings. In such instances, the demarcation between grammaticality and acceptability, based on social norms, is evident. In other scenarios, distinguishing between the two becomes ambiguous. For instance, multiple center-embedding sentences like “The rat, the cat, the dog chased, killed, ate the malt” are generally perceived as complex or challenging to interpret. Yet, the debate persists whether they should be categorized as “ungrammatical and unacceptable” or “grammatical but unacceptable” (Chomsky 1965; Bever, 1968). A consensus among researchers is that the generative concept of grammaticality is a subconscious mental representation. Therefore, in their pursuit of formulating a mental grammar theory, they operationalize grammaticality as acceptability (Riemer, 2009). Echoing Schütze (1996), we view grammaticality judgment and acceptability judgment synonymously, both gauging informants’ intuition of sentence “goodness” from a grammatical perspective, as opposed to alignment with socio-cultural norms.

Sprouse et al. (2013) surveyed 936 participants to obtain their judgments on the grammatical acceptability of various English sentences across three tasks. These sentences, exemplifying 148 pairwise syntactic phenomena, were sampled from the journal Linguistic Inquiry, with eight sentences representing each phenomenon. Though linguists had previously classified these sentences as grammatical, ungrammatical, or marginally grammatical, the aim of Sprouse and colleagues was to determine the degree of convergence between laypeople’s formal judgments and those of linguistic experts. They recruited native English speakers online for three distinct judgment tasks. In the Magnitude Estimation Task (ME task), participants were given a reference sentence with a pre-assigned acceptability rating. They were then asked to rate target sentences using multiples of the reference rating. In the 7-point Likert Scale Judgment Task (LS task), participants rated the grammatical acceptability of target sentences on a 7-point scale from least to most acceptable. Finally, in the two-alternative forced-choice task (FC task), participants were shown a pair of sentences (one deemed more grammatical than the other by linguists) and were asked to select the more grammatically acceptable option.

The collected data was analyzed using four statistical tests for each pairwise phenomenon. In particular, Sprouse et al. (2013) evaluated if laypeople rated grammatical sentences more favorably than their ungrammatical counterparts, as predicted by linguists. After summarizing the results of all 148 phenomena, the team calculated the convergence rate between expert informal ratings and laypeople’s formal ratings. They identified a 95% convergence rate, implying that both linguists and laypeople generally agreed on sentence grammaticality 95% of the time. As one of the pioneering large-scale surveys on the influence of research paradigms on grammaticality judgment, Sprouse et al. (2013) not only affirmed the legitimacy of expert ratings but also endorsed the three judgment tasks, all later corroborated as reliable grammatical knowledge measures by subsequent studies (e.g., Langsford et al., 2018).

The crux of our study revolves around the representation of grammatical well-formedness in LLMs like ChatGPT. Learners who utilize LLMs for writing assistance often assume that these models possess expert-level grammatical knowledge of the target languages (Wu et al., 2023). Similarly, researchers conducting experiments initially designed for humans (as seen in Cai et al., 2024; Binz and Schulz, 2023) expect LLMs to interpret written instructions similarly to native speakers. Yet, there is limited research into the grammatical intuitions of LLMs like ChatGPT, especially in comparison to the judgments of both linguists and laypeople across a broad range of grammatical phenomena. In this paper, we present a comprehensive exploration of ChatGPT’s grammatical intuition. Using the acceptability judgment tasks from Sprouse et al. (2013), ChatGPT evaluated the grammaticality of 2355 English sentences across three preregistered experiments (https://osf.io/t5nes/?view_only=07c7590306624eb7a6510d5c69e26c02). Its judgment patterns were juxtaposed against those of both laypeople and linguists. Our findings indicate a substantial agreement between ChatGPT and humans regarding grammatical intuition, though certain distinctions were also evident.

Experiment 1

In this experiment, we presented ChatGPT with a reference sentence that had a pre-assigned acceptability rating of 100. We then asked ChatGPT to assign a rating, in multiples of this reference rating, to target sentences. This approach is a replication of the ME task from Sprouse et al. (2013), but with two key modifications. First, rather than involving human participants, we sourced judgment data directly from ChatGPT. Second, our data collection adopted a “one trial per run” procedure, meaning each interaction session (or run) with ChatGPT encompassed only the instructions and a single experimental sentence (together with some filler sentences; see below). This procedure was chosen to mitigate any influence previous trials might have on ChatGPT’s subsequent judgments. We merged the human data from the ME task (available at https://www.jonsprouse.com/) with ChatGPT’s judgment data, subsequently examining both convergences and divergences using a variety of statistical tests.

Method

Experimental items were adopted from the stimuli of Sprouse et al. (2013). These consisted of 2355 English sentences that represented 148 pairwise syntactic phenomena, sampled from the Journal of Linguistic Inquiry^{Footnote 1} Each phenomenon pair featured grammatically correct sentences and their less grammatical counterparts, which could either be outright ungrammatical or marginally so. Table 1 provides examples of these experimental items.

Table 1 Examples of experimental items from Sprouse et al. (2013) that were used in all three experiments reported in this paper.

Full size table

We prompted ChatGPT to rate the grammatical acceptability of these items relative to the acceptability of a benchmark sentence. Following Sprouse, Wagers, and Phillips (2013), we used the sentence “Who said my brother was kept tabs on by the FBI?” as the reference sentence, assigning it an acceptability rating of 100. Adhering to our data collection procedure pre-registered with the Open Science Framework (https://osf.io/t5nes/?view_only=07c7590306624eb7a6510d5c69e26c02), we procured responses from the ChatGPT version dated Feb 13. In each run or session, a Python script mimicked human interaction with ChatGPT, prompting it to function as a linguist, evaluating the grammatical acceptability of sentences against the reference. Before rating experimental items, ChatGPT was exposed to six practice sentences. Those practice sentences served as anchors to various degrees of grammaticality, which corresponded to the anchoring items in Sprouse et al. (2013). ChatGPT’s responses were limited to numerical rating scores, without any supplementary comments or explanations. These responses were then logged for analysis.

Our data collection approach emphasized the “one trial per run” paradigm. In this mode, each ChatGPT interaction contained only a singular experimental trial. Contrary to the original procedure, where each participant was given a 50-item survey, this method minimized potential biases stemming from preceding trials on ChatGPT’s immediate judgment. This also circumvented an issue observed in prior projects, where ChatGPT would occasionally lose track of the instructions midway. Additionally, shorter sessions, characteristic of the “one trial per run” design, were less vulnerable to potential server or connectivity problems. In total, the experiment comprised 2368 items in line with Sprouse et al. (2013). An example of one run of an experimental trial is in Fig. 1. We conducted 50 experimental runs for each item.

**Fig. 1: Example of a run of an experimental trial.**

While the LLM judgment data was collected by our research team, we adopted the human judgment data published by Sprouse et al. (2013) as a proxy of human knowledge of sentence grammaticality. Their data was gathered from 304 native English speakers (312 recruited, with 8 of them being removed by Sprouse et al. based on their exclusion criteria), who were recruited from Amazon Mechanical Turk (AMT) and performed the corresponding ME task online. Given the large number of total experimental items, each human participant was assigned only a subset of 100 experimental items, including 50 pairwise syntactic phenomena in both grammatical and ungrammatical conditions. There were 8 lists (versions) of the experimental items, and a participant was randomly assigned to one of them. Each list started with six practice items anchoring a range of grammaticality from acceptable to unacceptable. These anchoring items remained identical across the lists.

We conducted two sets of statistical analyzes to address two research questions. The first pertains to the degree to which ChatGPT demonstrates grammatical intuition comparable to that of human participants who are not necessarily linguists. To address this question, we integrated data sourced from ChatGPT with the human data. Following Sprouse et al. (2013), ratings were standardized by participants using z-score transformation. By-item mean ratings were calculated for each experimental item from both ChatGPT and human responses. A correlation analysis was then conducted based on these by-item means, with the coefficient indicating the degree of agreement between ChatGPT’s grammatical intuition and that of the human participants. According to Cohen (1988, 1992), a correlation coefficient of 0.5 or higher is considered indicative of a strong correlation.

However, it is important to note that a strong correlation doesn’t imply perfect equivalence. To determine if ChatGPT’s grammatical knowledge differs from that of humans, we devised a Bayesian linear mixed-effects model using R package brms (Bürkner, 2017). In this model, the acceptability rating score is a function of grammaticality (grammatical vs. ungrammatical), participant type (human vs. ChatGPT), and their interactions. Predictors were dummy-coded, with the baseline being ChatGPT’s judgment of grammatical sentences. The model incorporated random effects structures for by-item intercepts and slopes as illustrated below:

$${\rm{score}} \sim {\rm{grammaticality}}* {\rm{participant\; type}}+(1+{{\rm{grammaticality}}}^{* }{\rm{participant\; type|item}})$$

For this model, a main effect of grammaticality would be expected, given that grammatical sentences should generally have a higher acceptability rating than ungrammatical ones. If a significant main effect for participant type or an interaction effect emerges, this would suggest that ChatGPT and human participants have different grammatical knowledge. Conversely, the absence of such effects would imply comparable grammatical competence between ChatGPT and humans.

Our second objective was to determine the extent of ChatGPT’s grammatical knowledge aligns with that of expert linguists. To this end, we computed the convergence rate between ChatGPT’s judgments and experts’ assessments, employing the same techniques used in Sprouse et al. (2013). For each pairwise phenomenon, five distinct analyzes were performed on ChatGPT’s rating data to ascertain if grammatical sentences were rated higher than their ungrammatical counterparts, as judged by linguists. The outcomes of these analyzes for all 148 pairwise phenomena were summarized. The percentage of phenomena wherein grammatical sentences achieved higher ratings than ungrammatical ones was treated as the convergence rate between expert assessments and ChatGPT’s evaluations. The five analyzes for each pairwise phenomenon consisted of: (1) Descriptive directionality, (2) One-tailed t-test, (3) Two-tailed t-test, (4) Mixed-effects model (5) Bayes factor analysis.

In the Descriptive Directionality analysis, the average rating scores of the grammatical sentences were juxtaposed with those of the ungrammatical ones. Should the average of the grammatical sentences surpass that of the ungrammatical ones in a particular pairwise phenomenon, it would be interpreted as a convergence between ChatGPT and linguists in their grammatical judgments for that phenomenon. Both the one-tailed and two-tailed t-tests examined the statistical significance of the difference in means between grammatical and ungrammatical sentences. A conclusion of convergence in acceptability judgments between ChatGPT and linguists for a given phenomenon would be made only if the average rating for grammatical sentences significantly exceeded that for ungrammatical ones. The mixed-effects models were constructed utilizing the R package lme4 (Bates et al., 2020), modeling the rating score as a function of grammaticality with items treated as random effects. The Bayes Factor analysis utilized a Bayesian version of the t-test (Rouder et al., 2009) facilitated by the R package BayesFactor (Morey et al., 2022). Data and scripts for these analyzes in experiment 1, 2, and 3 are accessible via the Open Science Framework (https://osf.io/crftu/?view_only=c8b338fba2504285bf271849af7863ae).

Results

We observed a robust correlation between the by-item rating scores of ChatGPT and humans (r = 0.69, p < 0.001). As illustrated in Fig. 2, sentences deemed more grammatical by humans similarly received higher acceptability ratings from ChatGPT, and vice versa.

**Fig. 2: Correlation of acceptability ratings between human participants and ChatGPT in Exp1.**

To discern whether ChatGPT’s ratings could be differentiated from those of human participants, we constructed a Bayesian linear mixed-effects model. In this model, the acceptability ratings were predicated on grammaticality, the participant type, and their interaction. Our findings showed a pronounced main effect of grammaticality: both ChatGPT and human participants rated ungrammatical sentences lower than their grammatical counterparts (see Fig. 3 and Table 2). Interestingly, an interaction effect surfaced between participant type and grammaticality. For sentences that were grammatical, human participants awarded higher rating scores (0.07, CI = [0.04, 0.10]) compared to ChatGPT’s ratings. Conversely, for ungrammatical sentences, humans attributed lower acceptability ratings (−0.15, CI = [−0.20, −0.10]) than ChatGPT.

**Fig. 3: Comparison of average rating scores across participant types and grammaticality manipulation in Exp1.**

Table 2 Summary of outputs from the Bayesian linear mixed-effects model in Exp1.

Full size table

Regarding the congruence between ChatGPT’s ratings and linguists’ judgments, 139 out of 148 pairwise phenomena showcased aligned directions. This indicates that for 139 of the 148 paired sets of sentences crafted by linguists, ChatGPT rated grammatical sentences as more acceptable than their ungrammatical counterparts when assessed solely by mean scores. The statistical significance of these mean differences was then evaluated using the methodologies outlined in “Method,” with the summary provided below (Table 3).

Table 3 Results of statistical tests assessing the convergence between ChatGPT and linguists in Exp1, using criteria from Sprouse et al. (2013).

Full size table

The convergence rate estimates fluctuated based on the test applied. Both the classic null-hypothesis significance t-tests and the Bayesian t-test (Rouder et al., 2009) indicated higher convergence rates, ranging from 89% (131/148) to 91% (134/148). In contrast, linear mixed-effects models (LME) posited a lower convergence estimate of ~73% (108/148). Notably, while differences in mean ratings between grammatical and ungrammatical sentences were evident in ChatGPT’s data, not all these differences were statistically significant. In certain instances, even when grammatical sentences held a higher average rating than ungrammatical ones, this differential lacked statistical significance, precluding it from being counted as a scenario where ChatGPT’s judgment aligns with that of the linguists. These discrepancies across statistical tests are not unique to ChatGPT’s dataset. As illustrated in Table 4, the convergence rate between laypeople and linguists also ranged from 86% (127/148) to 92% (136/148), contingent on the applied test.

Table 4 Results of various statistical tests examining the convergence between laypeople and linguists, based on a re-analysis of human data from the ME task in Sprouse et al. (2013).

Full size table

Discussion

In this experiment, we assessed the degree to which ChatGPT’s grammatical intuition mirrors that of humans in the ME task. The outcomes responded directly to the research questions posed in “Method”. Firstly, a pronounced correlation emerged between ChatGPT’s acceptability ratings and those of human participants who weren’t necessarily linguistic experts. This correlation suggested that ChatGPT’s capacity to discern grammatical acceptability resonates closely with judgments from human subjects, lending weight to the idea that ChatGPT, in spite of its AI origins, has linguistic intuitions akin to human language users.

Utilizing the Bayesian linear mixed-effects model, we gained a deeper understanding of the congruencies and disparities in the grammatical knowledge of ChatGPT and human participants. The aim was to ascertain whether distinctions could be drawn from their judgment patterns. Both cohorts consistently ranked grammatical sentences as more acceptable than their ungrammatical counterparts, thereby acknowledging a main effect of grammaticality. An interaction effect between participant type and grammaticality revealed subtle discrepancies in their acceptability ratings: humans tended to rate grammatical sentences higher than ChatGPT, while ChatGPT gave ungrammatical sentences higher ratings than humans did. These outcomes suggested that, compared to humans, ChatGPT was more conservative in its ratings in the ME task.

The convergence analysis juxtaposing ChatGPT and linguistic experts sheds light on the model’s resonance with established linguistic judgments. Contingent upon the statistical methods applied, the estimated convergence rate for the ME task fluctuated between 73% and 91%. This implies that ChatGPT’s grammatical determinations align substantially with expert linguistic judgments. Additionally, the ChatGPT-linguist convergence rate exhibited a broader range compared to the laypeople-linguist estimates (86% to 92%). However, given the unknown distribution of the convergence rate, its statistical significance remains ambiguous. This topic will be elaborated upon in the General Discussion section, where estimates from three experiments of this study are considered altogether.

In our first experiment, ChatGPT’s grammatical intuition was evaluated using the ME task. As per Langsford et al. (2018), ME scores can be influenced by individual response style variances, resulting in the between-participant reliability of the ME task being notably less than its within-participant reliability. This discrepancy isn’t observed in other grammaticality measures like the Likert scale and forced-choice tasks, where potential variations in response styles are minimized. To enhance our understanding, Experiment 2 and Experiment 3 implemented the Likert scale task and forced-choice task to juxtapose ChatGPT’s grammatical intuitions with those of both linguistic experts and laypeople.

Experiment 2

In this experiment, we further explored ChatGPT’s grammatical intuition by comparing it with laypeople and linguists using the Likert scale task. This task offers greater within-participant and between-participant reliability than the ME task. The selection of sentence stimuli and the data-analysis procedures remained consistent with those of the first experiment.