Hierarchical linguistic predictions and cross-level information updating during narrative comprehension

Zhou, Faxin; Zhou, Siyuan; Long, Yuhang; Flinker, Adeen; Lu, Chunming

doi:10.1038/s42003-025-09377-x

Download PDF

Article
Open access
Published: 18 December 2025

Hierarchical linguistic predictions and cross-level information updating during narrative comprehension

Communications Biology volume 9, Article number: 107 (2026) Cite this article

2607 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Language comprehension involves the prediction of upcoming linguistic units across multiple timescales. However, how this prediction process is hierarchically implemented in the human brain remains unclear. Combining natural language processing (NLP) and functional magnetic resonance imaging (fMRI) in a narrative comprehension task, we first applied the group-based general linear model (gGLM) to identify the neural underpinnings associated with anticipating upcoming words and sentences. Our results revealed a cortical hierarchy supporting linguistic prediction, extending from the superior temporal cortices to the default mode network (DMN). Next, we investigated how the word and sentence levels interact by testing two rival hypotheses: the continuous updating hypothesis posits that higher-level regions are updated continuously as inputs unfold over time, while the sparse updating hypothesis states that higher-level regions are updated only at the boundaries of their preferred timescales. Using computational modeling and autocorrelation analysis, we found that the sparse model outperformed the continuous model, with updating occurred at the sentence boundaries. Together, our results extend evidence for linguistic prediction to longer timescales and elucidate the neurocomputational mechanisms of hierarchical information updating in the human brain.

Temporal structure of natural language processing in the human brain corresponds to layered hierarchy of large language models

Article Open access 26 November 2025

Incremental accumulation of linguistic context in artificial and biological neural networks

Article Open access 18 January 2025

Evidence of a predictive coding hierarchy in the human brain listening to speech

Article Open access 02 March 2023

Introduction

Language comprehension requires listeners to predict upcoming inputs based on previous knowledge and context^1,2,3,4. Linguistic prediction can reduce computational load in the brain⁵, enabling listeners to instantaneously process highly dynamic speech flow (2–5 words per second)^6,7. Previous research has primarily focused on predicting linguistic units at shorter timescales. Neuroimaging and electrophysiological findings have shown that phoneme prediction primarily engages the bilateral primary auditory cortices^1,8,9, while word prediction involves a more distributed network, including the bilateral superior temporal gyrus (STG), left inferior parietal lobule (IPL), bilateral inferior frontal gyrus (IFG), and bilateral dorsolateral prefrontal cortex (dlPFC)^1,9,10,11,12. Moreover, leveraging neural encoding models (e.g., general linear model, GLM) and various language models (e.g., recurrent neural networks, RNNs), recent studies have shown that the STG and PFC are largely involved in predicting part-of-speech (POS) tags^{1,12,13,14,15}, indicating that grammatical structure can also be predicted at the word level (i.e., syntactic prediction).

However, natural language is not confined to smaller units; much of its complexity arises from larger units (e.g., sentences) that convey nuanced meanings and implicit messages^16,17. Such units also enable individuals to navigate complicated situations^18,19, such as interpreting social-emotional cues²⁰ or inferring underlying communicative intentions²¹. Further, converging evidence has shown that the brain integrates past context across multiple timescales (i.e., the temporal receptive window, TRW)^22,23,24, ranging from early sensory regions (e.g., the STG) operating on shorter timescales to higher-order areas (e.g., the PFC) on longer timescales. These findings raise the question of whether and how the brain implements the multilevel prediction of future linguistic units, particularly beyond phonemes and words.

Recent studies, though relatively limited, have examined how the brain predicts upcoming information over varying timescales. For instance, researchers found that the activity patterns in multiple brain regions shifted progressively during repeated movie viewing, following a posterior-to-anterior gradient across the cortex²⁵. This finding suggests that the brain actively anticipates upcoming movie plots after prior exposure. Relatedly, a study leveraged large language model (LLM)-based methods to test whether neural encoding performance improves when incorporating a “forecast window”, providing evidence for an anticipation hierarchy during language comprehension²⁶. Nonetheless, since timescales in the language system can be defined in different ways (e.g., a syntax-driven hierarchy²⁷ or a semantics-based hierarchy²⁸), it remains unclear which specific linguistic level(s) within the prediction hierarchy these studies capture. To bridge this gap, we focused on how the brain conducts semantic prediction of incoming words and sentences. We selected words and sentences because they are well-recognized linguistic levels in most language hierarchy frameworks^22,27,28,29. Moreover, both serve as natural semantic units that can convey meaning independently, yet at different levels of complexity, thereby providing a framework to investigate a semantic prediction hierarchy during natural language comprehension. Together, in the present study, we investigated multilevel linguistic prediction by probing neural predictive representations of longer-timescale units such as sentences and shorter-timescale units such as words.

Additionally, it is crucial to understand how information is updated between levels within the prediction hierarchy. Since the neural representation of the prediction hierarchy remains poorly characterized, only a few studies have investigated this question and have primarily focused on lower levels^1,8,29,30. A key debate emerging from these studies centers on how information is updated along the hierarchy. One perspective suggests that higher levels are updated continuously as inputs from lower levels unfold over time (i.e., continuous updating hypothesis). For example, studies on auditory perception³¹ and narrative comprehension³² have shown that neural responses at higher levels increase gradually when new inputs are introduced. Moreover, computational models built on the continuous updating hypothesis can capture the neural dynamics during context construction and forgetting³². In contrast, another perspective suggests that higher-level updates occur only at the end of their preferred timescales, leading to abrupt rather than gradual changes in neural responses (i.e., sparse updating hypothesis). For instance, evidence shows that neural activity in the precuneus changed sharply at event boundaries corresponding to its preferred timescales³³. Similarly, a study using the RNN model demonstrated that the sparse updating model, but not the continuous model, identified the processing architecture in the human brain along the temporo-parietal axis³⁰. Further, a study proposed that during discourse comprehension, the brain would instantiate a single conscious representation of the input (e.g., a word) and remain stable unless perturbed by new inputs²⁸. Achieving such representational stability requires cortical circuits to reach steady states sparsely across multiple intermediate levels. Based on these findings, we tested which hypothesis (continuous updating or sparse updating) better explains information updating from regions supporting word prediction to regions supporting sentence prediction.

To investigate the prediction hierarchy and examine the information-updating modes in the human brain, we combined natural language processing (NLP) and neural computational modeling approaches to analyze brain signals from individuals engaged in a narrative comprehension task, recorded using functional magnetic resonance imaging (fMRI). In this task, 31 participants listened to three stories presented either forward or backward. The forward condition intrinsically involves a linguistic prediction hierarchy^27,29, whereas the backward condition serves as a control for acoustic features. Next, we aimed to quantify the predictive relationship between preceding context and upcoming linguistic units at both the word and sentence levels (see “Methods” section). While decoder-only transformer architectures (e.g., generative pre-trained transformer, GPT) are widely used for language prediction^1,10, they typically predict linguistic units at shorter timescales (primarily words)¹. Therefore, we applied a multiple ridge regression approach to derive predictive representations at both the word and sentence levels^34,35. Further, using the group-based GLM (gGLM), we identified neural correlates associated with the predictive representations of upcoming linguistic units before their appearance (i.e., the neural pre-activation³). Finally, we applied the computational models to differentiate between the continuous and sparse updating hypotheses within the predictive coding (PC) framework. The PC framework posits that the brain processes inputs through a multilevel cascade, generating top-down prediction signals and bottom-up error signals that iteratively update the internal model^36,37,38,39. This theoretical account is supported by a growing body of empirical evidence from both computational^40,41,42 and neuroscience studies^43,44. In the present work, we simplified the PC architecture to two levels, where words and sentences corresponding to the lower and higher levels, respectively. We implemented two variants, one based on the continuous updating hypothesis and the other on the sparse updating hypothesis, which we compared and evaluated by simulating fMRI responses at the word and sentence levels.

In line with prior findings, we first predicted that word prediction would primarily engage lower-order brain regions such as the STG^10,11. Additionally, given recent advances showing that the default mode network (DMN) is largely recruited in the anticipation of future events (also known as prospective memory)^25,45,46,47 and is especially engaged in narrative understanding at longer timescales^{22,24,47,48,49}, we expected to observe neural representations of sentence prediction within the DMN regions. Furthermore, considering the chunking property of the language system^27,29, we postulated that the functional interactions between the word and sentence levels would occur in a sparse manner, which has also been shown to be more computationally efficient and able to accelerate updating^30,50. Overall, we provide evidence for the linguistic prediction hierarchy and the cross-level information updating in the brain.

Results

Behavioral performance in narrative comprehension

In the narrative comprehension task, participants were instructed to passively listen to three stories while fixating on a central cross presented on a black screen. The sequence of the six audio clips (three in the forward condition and three in the backward condition; see “Methods” section) was counterbalanced across participants. Details of the stimuli (e.g., the length of each story) are provided in Supplementary Table 1.

At the end of each forward story, participants rated how well they perceived and comprehended the content (see “Methods” section). We first assessed the clarity of story perception and found that clarity scores for all stories were significantly above the chance level (chance level = 2.5; one-sample t-test, p < 0.05; Supplementary Table 2). Additionally, we found no significant differences in perception scores, including clarity (one-way ANOVA, F_{(2, 90)} = 0.203, p = 0.817, f = 0.081), familiarity (F_{(2, 90)} = 0.594, p = 0.554, f = 0.138), and complexity (F_{(2, 90)} = 3.000, p = 0.055, f = 0.311). These results support the reliability of the comprehension scores reported below.

Next, we assessed how well participants comprehended the forward stories. We performed non-parametric tests due to the non-normal distribution of the data (see “Methods” section). Results showed that the comprehension scores for each forward story were significantly above the chance level (Wilcoxon signed rank test; story 1 chance level = 2.5; story 2 chance level = 1.5; story 3 chance level = 1.5; p < 0.05; Supplementary Table 2) but did not significantly differ among the three stories (Kruskal-Wallis test, H₍₂₎ = 5.524, p = 0.063). Thus, the comprehension scores were summed across the three stories to represent overall performance (mean across participants = 10.548, S.D. = 0.850), which was also significantly higher than the chance level (chance level = 5.5; Wilcoxon signed rank test, T₍₃₁₎ = 496.000, p < 0.001). Although we observed marginal significance for story complexity and comprehension scores, we analyzed each story separately in subsequent analyses and trained encoding models with a leave-one-subject-out (LOSO) approach. Therefore, differences across stories were unlikely to influence the results.

The predictive representations of words and sentences

We employed a two-stage procedure to obtain predictive embeddings at both the word and sentence levels. At the first stage, we employed the Robustly Optimized Bidirectional Encoder Representations from Transformers (BERT) with Whole Word Masking (WWM-RoBERTa) to obtain the vector representations of language information⁵¹. A BERT-based model was selected because it is trained on both preceding and following contexts, enabling the model to generate more comprehensive and context-rich representations than causal models⁵². Specifically, WWM-RoBERTa is a variant of the BERT model, featuring a bigger architecture, a larger batch size, and an expanded training dataset⁵². It is trained on the prediction task of whole words rather than characters, and therefore shows greater generalizability and adaptability for Mandarin⁵³. In practice, word representations were obtained by feeding each word individually into the WWM-RoBERTa model (without context). Sentence and context representations were acquired by feeding the entire texts into the WWM-RoBERTa model and then averaging embeddings across all words (see “Methods” section).

At the second stage, multiple ridge regression approach was used to model the predictive relationship of embeddings between prior linguistic context and upcoming linguistic units (Fig. 1a). We employed the multiple ridge regression to enable comparablility across the two levels of linguistic units (i.e., words and sentences). Note that the regression model is independent of the brain data, subserving solely the purpose of capturing the predictive relationship. This approach was conducted based on the vector representations obtained from the WWM-RoBERTa model^34,35. The multiple ridge regression approach assumes that the predictive relationship becomes approximately linear based on semantic vectors extracted from the WWM-RoBERTa model, supported by the evidence that embeddings from large language models exhibit analogical relations (e.g., queen – woman ≈ king – man)^54,55,56. Moreover, the ridge regression model effectively mitigates the overfitting problem. In practice, each dimension of the upcoming target vectors was predicted using a different ridge regression model, with parameters estimated from training data (80%) and validated on testing data (20%; see “Methods” section). Separate models were constructed for words and sentences.

**Fig. 1: Schematic demonstration of the analytic approach.**

To evaluate the performance of the ridge regression models, we first calculated cosine distances between the vectors predicted by the models and the actual target vectors (denoted as D1). D1 was compared with the cosine distances between the predicted vectors and the vectors randomly selected from the test set (denoted as D2). D2 served as a baseline, representing a scenario without a predictive relationship, as its distribution was centered around 1 (Fig. 2a, b, gray histograms). We randomly sampled 1000 instances from the testing set and found that D1 was significantly lower than D2 at both word and sentence levels (paired t-test; word level: t₍₉₉₉₎ = 19.18, p < 0.001, d = 0.876; sentence level: t₍₉₉₉₎ = 43.870, p < 0.001, d = 1.439; Fig. 2a, b). Additionally, we calculated and compared the Pearson correlation between predicted and real targets (r₁) or randomly generated targets (r₂) as validation. As expected, results showed significant differences between these two conditions at both levels (paired t-test after applying a Fisher-z transformation to the r values; word level: r₁ = 0.078 ± 0.112; r₂ = −0.001 ± 0.065; t₍₉₉₉₎ = −18.264, p < 0.001, d = 0.876; sentence level: r₁ = 0.113 ± 0.112; r₂ = 0.004 ± 0.065; t₍₉₉₉₎ = −43.812, p < 0.001, d = 1.439; Supplementary Fig. 1a).

**Fig. 2: Performance of the representational prediction models.**

Furthermore, a pairwise classification task was employed to compare D1 and D2 (Fig. 1a, right panel)⁵⁷, where an instance was classified as correct if D1 was smaller than D2. Otherwise, it was classified as incorrect. We repeated the procedure 1000 times to ensure robustness. The resulting prediction accuracy was significantly above the chance level (i.e., 50%) for both word (73.234 ± 4.265%, p < 0.001; Fig. 2c) and sentence (81.471 ± 3.478%, p < 0.001; Fig. 2d) models. To validate these findings, we generated a randomized dataset by shuffling the pairwise correspondence between the prediction targets and the preceding linguistic context. Applying the same pairwise classification analysis to this randomized data yielded accuracy that did not significantly differ from the chance level (permutation test; word model: 50.132 ± 3.458%, p = 0.351; sentence model: 49.812 ± 4.719%, p = 0.503; Fig. 2c, d). Moreover, classification accuracy in the original dataset was significantly higher than that in the randomized dataset (two-sample t-test; word model: t₍₁₉₉₈₎ = 128.800, p < 0.001, d = 5.760; sentence model: t₍₁₉₉₈₎ = 170.788, p < 0.001, d = 7.638; Fig. 2c, d). Finally, the word- and sentence-level ridge regression models were evaluated on the narrative stimuli used in this study. The word-level model achieved a classification accuracy of 68.241% (S.D. = 0.873%), and the sentence-level model reached 83.731% (S.D. = 2.811%), both significantly above the chance level (chance level = 50%; permutation test; word model: p < 0.001; sentence model: p < 0.001; Fig. 2g).

Together, these results indicated that our models reliably captured the predictive relationship between prior context and upcoming linguistic units. Notably, the sentence model consistently outperformed the word model on both the corpus and the experimental materials. This advantage may stem from the BERT-derived sentence representations, which encode richer and more context-dependent information. Furthermore, computing sentence embeddings by averaging word vectors likely improves the signal-to-noise ratio. However, we suggest that the absolute accuracy of our models is not a direct indicator of prediction quality. Instead, the statistical significance offers a better measure of the model’s capability in capturing the predictive relationship.

Model prediction performance increases with context length

Previous evidence suggests that predictions of upcoming linguistic units are incrementally shaped by the preceding context. Accordingly, our models are expected to demonstrate improved performance as the length of the prior context increases, i.e., the incremental context effect^10,58. To test this, we systematically varied the number of words or sentences in the prior context and assessed the impact of context length on model performance.

For the word-level model, the cosine distance between the predicted and actual vectors decreased as more preceding words were included (Fig. 2e). We identified the knee point using Kneed Python toolbox, which detects maximum curvature via a rotation-based algorithm⁵⁹. The knee point corresponded closely to the sentence boundary (Fig. 2e), based on the sentence length derived from Chinese Wikipedia (across approximately 3.9 million sentences, median length was 15 words; Supplementary Fig. 1b). Similarly, for the sentence-level model, cosine distance decreased as more preceding sentences were provided, with a notable knee point observed when the number of prior sentences reached 4 (Fig. 2f). Together, these results support the capacity of our models in capturing the predictive relationship in natural language.

The neural underpinnings of multilevel prediction

We employed encoding models to identify the neural correlates associated with the word- and sentence-level predictions. This method has been widely recognized for its reliability and validity in producing robust results^26,58,60,61. Specifically, we applied the gGLM to associate BOLD signals with the predicted vectors derived from the ridge regression models (Fig. 1b, c, see “Methods” section). The gGLM was performed separately for the word and sentence levels. We further employed the leave-one-subject-out (LOSO) cross-validation approach to avoid overfitting and reduce the non-independence error in the secondary test⁶². Additionally, to improve the computational efficiency, a template with 400 cortical parcels was used for the gGLM analysis⁶³. Moreover, to test the concept of “neural pre-activation” in language prediction³, we related predicted vectors of linguistic units N to BOLD signals of units N-1. A series of potential confounding factors—including temporal delays of words and sentences, frequencies of words and sentences usage, prior linguistic context effect—were excluded (see “Methods” section). A paired t-test was performed between forward and backward conditions on the explained variance (R²) of the gGLM. The results were corrected for multiple comparisons using the false discovery rate (FDR) method, with a significance threshold of p < 0.01⁶⁴.

At the word level, results showed that the predictive representations of words were associated with significant activations in the bilateral STG and the upper part of the middle temporal gyrus (MTG; Fig. 3a, c; Supplementary Fig. 3a; Supplementary Table 3). To validate this result, we conducted a permutation test on the significant regions of interest (ROIs) including the STG and MTG, where the word features were shuffled to remove the contextual predictive relationship. This procedure was repeated 1000 times to generate a null distribution. Results showed that the real value was significantly higher than the null distribution (p < 0.01, FDR corrected; Fig. 3b upper panel; Supplementary Fig. 4), confirming an association between word-level predictive representations and activity in the bilateral STG and MTG.

**Fig. 3: Brain responses associated with predictive representations.**

At the sentence level, results showed significant activation in the right TPJ, medial PFC (mPFC), and precuneus (Fig. 3a, c; Supplementary Fig. 3b; Supplementary Table 3). The same permutation test was performed, confirming significant activation in these brain regions (TPJ: p = 0.03; mPFC: p = 0.01; precuneus: p = 0.01; overall: p = 0.01; FDR corrected; Fig. 3b lower panel; Supplementary Fig. 4).

To differentiate the prediction effect from the context effect, we applied the gGLM to examine neural representations of past context at both word and sentence levels. We found that prior contextual information was broadly represented across frontal, temporal, and parietal regions (Supplementary Fig. 5a, b), consistent with previous findings on neural encoding of past linguistic context^22,24,65. Further, we performed a variance partitioning (VP) analysis to isolate the unique contribution of preceding context from the prediction effect. We observed significant representations at both word and sentence levels in the bilateral STG, whereas word-level representations were more prominent in the prefrontal cortex (Supplementary Fig. 5c, d). Please note that the predictive and context features fed into the gGLM are not linearly related due to the nonlinear operations during feature extraction (i.e., the Isomap method; see “Methods” section).

Together, these findings suggested that the brain predicts upcoming words and sentences in a hierarchical manner during language comprehension. This neural pattern of prediction hierarchy differed from that of past context representations.

Examining the information updating mode of the prediction hierarchy

Next, we aimed to investigate how information is updated across the two levels in the prediction hierarchy. To test the sparse and continuous updating models, we employed a series of computational modeling approaches grounded in the PC framework^42,43, which could characterize the dynamic interactions between neural regions associated with word- and sentence-level predictions.

Specifically, we implemented a two-level PC architecture, in which the word and sentence levels corresponded to the lower and higher levels, respectively (Fig. 4a). According to the PC framework, the higher level would generate a top-down prediction (${Z}_{s}$) that guides the lower level in updating its representation. Then, the higher-level prediction error (PE, ${x}_{s}$) was calculated as the difference between the top-down predictions and the upcoming signals. Next, ${x}_{s}$ propagated back to the higher level to optimize the next top-down prediction. At the lower level, its PE (${x}_{w}$) was simulated as cosine distances between predicted and actual word vectors, providing a robust measure of dissimilarity that is less sensitive to vector magnitude compared to other metrics (see “Methods” section).

**Fig. 4: Results of predictive coding (PC) neural modeling.**

In the continuous updating PC model, predictions and PEs were allowed to transmit between the lower and higher levels instantaneously (Fig. 4a, left). By contrast, in the sparse updating PC model, information transferred between levels was delayed by $\Delta t$ (Fig. 4a, right), ensuring that predictions and PEs were exchanged only at sentence boundaries^33,66,67,68. Neural activity was simulated using these PC models, and then converted into BOLD signals (Fig. 4b)^69,70. In practice, neural signals at the word (${Z}_{w}$) and sentence (${Z}_{s}$) levels were calculated as the averaged BOLD signals within the corresponding significant ROIs (as shown in Fig. 3a; Supplementary Table 3). We used a gradient descent algorithm to estimate the model parameters, with performance quantified by the mean square error (MSE) between simulated and actual BOLD signals (Fig. 4c). Lower MSE values indicate better model performance (see “Methods” section).

First, we compared MSE values between the forward and backward conditions. Both PC models performed significantly better in the forward condition than in the backward condition (paired t-test; t₍₁₈₅₎ = −12.760, p < 0.001, d = 1.086; Fig. 4d). Moreover, the sparse PC model significantly outperformed the continuous PC model in the forward condition (paired t-test; t₍₉₂₎ = −17.438, p < 0.001, d = 1.110; Fig. 4e), while no significant difference was observed in the backward condition (paired t-test; t₍₉₂₎ = 0.990, p = 0.325, d = 0.137; Fig. 4e). These effects remained consistent across all individual stories (Supplementary Fig. 6a). Further, we trained a control sparse model that preserved temporal sparsity but removed the sentence boundary information by shuffling the delay variable ($\Delta t$). This control sparse model outperformed the continuous model (paired t-test; t₍₉₂₎ = −8.885, p < 0.001, d = 0.581; FDR corrected; Supplementary Fig. 6b), but underperformed relative to the original sparse model (paired t-test; t₍₉₂₎ = 6.728, p < 0.001, d = 0.552; FDR corrected; Supplementary Fig. 6b). These results suggested that both general temporal sparsity and specific sentence boundaries enhanced the performance of the sparse model.

In addition, previous studies have shown that lower-level PE (${x}_{w}$) plays an important role in linguistic processing¹⁰ and events delineation⁷¹. We replaced the lower-level PE (${x}_{w}$) with white-noise signals to test this account. The results supported this hypothesis for the sparse model (paired t-test; t₍₉₂₎ = −12.246, p < 0.001, d = 0.777; Fig. 4f, g), but not for the continuous model (paired t-test; t₍₉₂₎ = 0.787, p = 0.433, d = 0.060) under the forward condition (Fig. 4f, g).

Together, these findings support the sparse updating hypothesis, suggesting that sentence boundaries serve as key drivers of information flow within the word-to-sentence prediction hierarchy.

Sparse updating revealed from an autocorrelation analysis

In contrast to the continuous updating hypothesis, the sparse hypothesis posits that brain responses associated with sentence prediction remain stable until the sentence boundary is reached³³. Therefore, we expect brain activity to exhibit a periodic pattern if the linguistic prediction hierarchy is sparsely updated. To this end, we examined autocorrelation in brain regions associated with word- and sentence-level predictions. Specifically, we temporally shifted the BOLD signals without pre-whitening over time lags from 1 TR to 50 TRs. For each lag, we computed the correlation between the shifted and original signals before comparing autocorrelations between the forward and backward conditions.

Our results revealed significantly stronger autocorrelation in the forward condition than in the backward condition for brain regions associated with sentence prediction, at time lags of 8-11 TRs (p < 0.01, Bonferroni corrected; Fig. 5c, d; Supplementary Fig. 7). This range corresponds to approximately twice the sentence length (median: 4 TRs, Supplementary Fig. 1c). These findings support our prediction, as updating information at sentence boundaries requires the brain to simultaneously maintain pre- and post-boundary sentence information, potentially leading to a periodic pattern repeating every two sentences. In comparison, this effect was absent in brain regions associated with word-level prediction (Fig. 5a, b). These findings provide additional support for the sparse updating hypothesis.

**Fig. 5: Results of the autocorrelation analysis.**

Discussion

We characterized hierarchical linguistic prediction at both the word and sentence levels and examined how these two levels interact during narrative comprehension. We observed that the predictive representations of upcoming words are associated with brain responses in the STG and MTG, while those of sentences are engaged in the TPJ, mPFC, and precuneus. In addition, our computational modeling results supported the sparse updating strategy, rather than the continuous strategy, for cross-level interaction within the prediction hierarchy. These results highlight the brain’s capacity to anticipate future information over both shorter and longer timescales, suggesting that sentence boundaries may serve as potential markers for updating semantic information during naturalistic language comprehension.

Our findings of linguistic prediction at the word and sentence levels are reminiscent of the research on the temporal receptive window (TRW), which examines how past context at multiple timescales influences processing of ongoing inputs^22,23,24. These studies have proposed a temporal representational hierarchy of context in the cerebral cortex, ranging from the early sensory regions responding to shorter timescales (i.e., small TRW) to higher-level brain regions responding to longer timescales (i.e., large TRW). These findings suggest a retrospective timescale focusing on the prior context, which is closely related to cortical tracking of the linguistic units^27,72,73,74. In contrast, we focus on the brain’s ability to anticipate future inputs across varying timescales, i.e., a prospective timescale of the future input. To our knowledge, this line of research is still understudied, with only a few studies beginning to explore it recently^25,26. Consequently, it remains unclear how the prospective timescale hierarchy can be interpreted in a neurolinguistic sense. Inspired by these studies, the present study aims to address this gap and further investigate how different levels within the hierarchy interact computationally and algorithmically. We believe this prospective hierarchy deserves greater attention in future research.

Another contribution of our work is the incremental context effect observed in the multiple ridge regression models at both word and sentence levels (Fig. 2e, f), supporting the biological plausibility of our approach. While previous studies have reported similar effect in GPT-2¹⁰ and BERT models⁵⁸ by manipulating context window size, these findings are largely limited to the word level. Here, we extend these results to the sentence level and demonstrate a comparable incremental pattern. Importantly, our approach offers a potential avenue for investigating how retrospective and prospective timescale hierarchies relate in both LLMs and the human brain.

Several studies on word prediction, however, have reported findings that differ from ours. For instance, some have identified associations between word prediction and widespread regions in the frontal and parietal lobes, in addition to the bilateral STG^11,75,76,77. Although other studies investigating word-level syntactic prediction (i.e., how grammatical structure within a sentence influences next-word prediction) have also emphasized the roles of the STG and MTG^{1,12,14,15,78}, some divergent results also indicate additional involvement of the lateral prefrontal cortex^1,14. One possibility for this discrepancy is that these studies tapped into word prediction with lexical processing difficulties such as cloze probability⁷⁹ or entropy⁸⁰ rather than the predictive representation itself. We postulated that the involvement of the frontal cortex may reflect processing difficulty and the associated cognitive control functions^81,82. However, Goldstein et al.¹⁰ employed the encoding model and found that the IFG was also significantly involved in word prediction¹⁰. Although the authors ruled out the potential context effect, their control analysis was conducted for the averaged signal across all significant electrodes, including both IFG and STG electrodes, leading to difficulties in disentangling the potential differences between IFG and STG. The present study investigated the neural underpinnings of linguistic prediction per se rather than processing difficulty, while controlling for potential confounding effects arising from the past linguistic context. Therefore, our results provide more direct evidence for the anatomical architecture supporting hierarchical linguistic prediction.

Furthermore, the DMN (especially the mPFC, precuneus, and TPJ) has been proposed to play a key role in processing naturalistic stimuli⁴⁷, such as written or spoken stories^22,48 and movies⁸³. To investigate its functions, on the one hand, previous studies scrambled the real-life stories at different timescales (e.g., word, sentence, paragraph, etc.)^22,32 or shuffled parts of the stories to create different versions^47,84. By comparing neural responses across different versions, researchers found that the DMN is largely involved in integrating external information over relatively long prior context (ranging from seconds to minutes)^23,47,84. On the other hand, another possible cognitive process associated with the DMN during narrative processing is using stored information to simulate possible future events and plan ahead (i.e., the prospective memory)^46,85. For example, evidence indicates that imagining a plausible event that had not occurred previously engages DMN regions such as the mPFC and precuneus^46,86. Interestingly, this “future envisioning” network largely overlaps with regions involved in episodic memory, supporting the constructive episodic simulation hypothesis⁴⁶. These findings suggest that a key function of the DMN is to enable simulation of future events based on past experiences, a perspective closely aligned with the concept of pre-activation in linguistic prediction^85,87. While anticipatory signals in DMN regions have been extensively observed^25,45, the timescales underlying prospective prediction remain unclear. In the current study, we identified involvement of the TPJ and DMN midline core areas in sentence prediction, providing further evidence for the DMN’s role in predicting linguistic units over longer timescales. Additionally, our study also revealed strong right-hemisphere lateralization for sentence prediction. Although recent studies have challenged the traditional view that natural language comprehension is left-laterized, showing instead bilateral involvement^88,89,90, the specific function of the right hemisphere is still poorly understood. Our results suggest that the right DMN plays a dominant role in sentence prediction, consistent with recent evidence highlighting the importance of the right hemisphere in perceptual segmentation and coarse-grained event boundaries in music⁹¹. Collectively, these findings support the notion that the right hemisphere may have a distinct role in processing longer timescale information.

Our computational modeling results support the sparse, rather than continuous, updating strategy for cross-level interactions within the prediction hierarchy. Previous research supporting the continuous updating hypothesis typically relied on correlation-based approaches, such as inter-subject pattern correlation (ISPC)³² or cross-context correlation³¹. ISPC examines spatial similarities in brain responses across subjects at each moment, while cross-context correlation calculates neural similarities across trials for each time point. These approaches, however, may conflate the effects of information updating and accumulation, limiting their ability to disentangle the two. Although Chien and Honey³² employed computational models to study multilevel interactions, their models were constructed solely under the continuous updating assumption, leaving an open question of how the two rival hypotheses compare³². Most importantly, no studies have tested the two hypotheses at the sentence level. In the present study, we uncovered the hidden neural states (i.e., information updating at the sentence level) and directly compared the continuous and sparse updating models. Our results underscore the sparse updating hypothesis, consistent with previous evidence that sparse updating is more computationally efficient and resource-saving than continuous updating⁵⁰. These findings further support the brain’s economy principle; that is, the human brain is organized to carefully manage the inputs in the service of delivering robust and efficient performance^92,93.

In addition, although emerging models have incorporated the PC framework to study language comprehension, the explicit computational mechanisms underlying multilevel interactions within a timescale hierarchy remain limited. For instance, Eddine et al.⁹⁴ provided an elegant PC account of the N400, modeling lexical-semantic integration across four layers (orthographic, lexical, semantic, and conceptual)⁹⁴. While their model successfully captures sentence-level context effects on N400 amplitude, it does not specifically address the communication between word and sentence levels. A potentially more biologically-grounded account was proposed by Bornkessel-Schlesewsky et al.²⁹, which posits a predictive sequence processing framework situated in the postero-dorsal auditory stream²⁹. However, this model also lacks detailed mechanisms describing how different linguistic levels interact computationally or algorithmically.

Our work builds on these theoretical frameworks and proposes a possible mechanism for the implementation. Our results are consistent with a semantic-based framework of discourse comprehension. In this framework, Baggio²⁸ proposed that the brain instantiates a single conscious representation of the input (e.g., a word) that remains stable unless perturbed by new information²⁸. To implement this representational stability during discourse comprehension, the author further posited a cortical steady-state organization which could be achieved sparsely at four intermediate levels: 1) individual word; 2) content word; 3) referring expression; and 4) utterance or proposition. Our sparse updating model conceptually aligns with this cortical steady-state account and provides a possible algorithmic implementation within the PC framework. Mathematically, we formulated a first-order linear ordinary differential equation (ODE) in which the sentence-level neural signal can be maintained at the steady-state by leveraging a delay term $\varDelta t$ that fixes the “input” within a sentence. This formulation allows the model to generate signals whose sparsity is bounded by sentence boundaries.

However, recent eye-tracking and electroencephalogram studies provide evidence for the incremental nature of language processing^79,95,96, which seemingly contradicts the sparse updating strategy⁹⁷. Incrementality generally refers to the process by which linguistic information underlying the message-level representation accumulates gradually as context builds. Empirical support for incremental comprehension includes the modulation of the N400 amplitude of different word positions in a sentence⁹⁷, or the slow drift of neural signals during continuous sentence processing⁹⁸. The converging evidence suggests that incremental language processing involves an ongoing construction of meaning with each incoming word. Within this scope, we propose that this incremental semantic construction is not incompatible with our sparse updating model. First, in our model, the “word-level” does not refer solely to the brain regions encoding lexical information, but rather to the regions integrating the current sentential context to make predictions about the upcoming words. Second, sparse updating in our model is restricted to the interactions between the word and sentence levels exclusively, rather than imposing sparsity at the word level. Therefore, word-level processing can still operate in an incremental and predictive manner. Input enters the word level and lexical information is allowed to be accrued instantaneously within a sentence. Then, the accumulated sentence context would be further used to compute the sentence-level prediction error, which is transferred to the sentence level at sentence boundaries for updating. In fact, as Ryskin & Nieuwland⁹⁹ stated, the incremental effect during sentence comprehension is also inherently aligned with the PC framework, assuming that the brain needs to employ the inputs to infer the internal model⁹⁹. The realization of internal model inference largely relies upon the local prediction error (PE) at each level⁹⁴, which plays an integral role in the optimization algorithm that the brain uses to approximate inference. This account aligns with our models, as the word-level integral can be viewed as the accumulation of the sentence context and thus show an incremental effect.

While the incremental effect has been extensively documented in sentence processing, it remains unclear how such an effect manifests during naturalistic language comprehension. Intuitively, for instance, when listening to a two-hour audiobook, it is unlikely that neural activity would continuously increase from beginning to end. No study, to the best of our knowledge, has demonstrated such a pattern. One possibility is that the brain engages in event-based processing during naturalistic comprehension–an idea supported by event segmentation theory¹⁰⁰. According to this framework, the brain parses continuous input into discrete events at multiple timescales (distinct from the “event” in ERP), which are processed separately and then integrated hierarchically at event boundaries via memory systems^33,100. In other words, the incremental effect may occur within a single level but not across levels of linguistic information. This perspective aligns with our finding that sentence boundaries (sentence viewed as “event” in this sense) serve as important anchors for narrative processing, supporting the sparse updating hypothesis.

Our results also underscore the importance of sentence boundaries during narrative comprehension, consistent with recent evidence of additional processing at sentence-final positions. This “sentence wrap-up” effect may reflect either the reconstruction of grammatical structure within a sentence (syntactic effect), or the resolution of meaning inconsistencies that cannot be addressed in a sentence (semantic effect)¹⁰¹. We propose that, in the present study, sentence boundaries serve as semantic markers for message-level updating during naturalistic language comprehension for two primary reasons. First, we recruited multiple raters to delineate boundaries between sentences. This empirical method produces more semantically-driven segmentation. Second, the sentence embeddings we used were obtained by averaging word embeddings, an approach that emphasizes semantic content over syntactic structure. Consequently, the sentence boundary effect observed in our sparse model likely reflects the reconciliation of semantic inconsistencies, potentially corresponding to the accumulated prediction errors generated at the word level. In this view, a sentence can be considered as functionally analogous to a narrative chunk, serving as a semantic segment within the broader narrative structure. Our findings also complement prior event segmentation research by highlighting the neural signatures of updating at sentence (or narrative chunk) boundaries, in line with models of hierarchical event processing¹⁰⁰.

However, it is important to note that the current study does not examine the multiple timescales within sentences. Many timescales could be defined for the language system, for example, the semantics-based temporal hierarchy (e.g., ranging from individual words and content words to referring expressions and entire utterances)²⁸, or the syntactic hierarchy extending from words to noun/verb phrases, and further to sentences²⁷. Therefore, we believe that investigating the properties of these intra-sentence timescales from a neurolinguistic perspective will be a valuable addition to the present findings.

This study has several limitations. First, we could not assess real-time attentional states of participants, as such measurements would disrupt continuous speech processing and linguistic predictions. Second, the relatively low temporal resolution of fMRI limits precise characterization of linguistic prediction at finer timescales (e.g., phonemes). Third, our averaging-based approach to sentence or context representations may overlook critical structural or sequential features essential for sentence-level processing. Future work could benefit from some advanced models (e.g., Sentence-BERT¹⁰²) that better capture longer-range dependencies in text.

In conclusion, by directly examining the multiscale prediction hierarchy in the brain, we demonstrated a cortical architecture spanning from the temporal cortices involved in word prediction to the DMN regions engaged in sentence prediction. Most significantly, our results highlight the role of sparse updating in facilitating cross-level interactions within this prediction hierarchy. Together, these findings advance the understanding of the cortical organization underlying hierarchical linguistic prediction and the neurocomputational mechanisms of information updating during narrative comprehension.

Methods

Participants

Before the formal experiment, the sample size was estimated based on a pilot study with four participants listening to the story stimuli¹⁰³. Using the Neuropower¹⁰⁴, we assessed whether STG voxels exhibited higher BOLD responses during forward narratives compared to backward speech (i.e., linguistic effect in the auditory cortex). A sample size of twenty-eight participants was recommended to achieve a statistical power greater than 0.8. The pilot data were not included in the formal analysis.

Thirty-eight healthy native Chinese speakers participated in the main study. All participants were right-handed¹⁰⁵ and self-reported no hearing, psychiatric, or neurological problems. Six participants were excluded due to excessive head motion (greater than 3 mm or 3 degrees) and one was excluded for falling asleep during the task, leaving thirty-one participants with valid data (mean age: 23 years, ranging from 19 to 26; 19 females).

The study protocol was approved by the Institutional Review Board of the State Key Laboratory of Cognitive Neuroscience and Learning at Beijing Normal University. Written informed consent was obtained from all participants. All ethical regulations relevant to human research participants were followed.

Stimuli

In narrative listening studies, it is common to include multiple runs to increase the reliability of statistical tests, reduce the fatigue state of participants, and minimize the impact of technical issues (for example, scanner overheating)^11,60,83. Therefore, three stories were employed in the present study. Story 1 and 2 were produced by asking two female speakers to freely recount “an unforgettable experience in your college life”, while story 3 was recorded by a female speaker reading a text adapted from The Kite Runner. All stories were recorded using the FOMRI III system (Optoacoustics Ltd.) and subsequently denoised using Audacity¹⁰⁶. These stories were matched for perceptual features such as clarity, familiarity, and complexity (see “Task and procedures”; Supplementary Table 2). Additionally, each audio was temporally inverted for the backward condition to control for acoustic features.

Task and procedures

Before the experiment, sound volume was adjusted to a comfortable level based on participants’ subjective reports. During the experiment, participants were instructed to passively listen to the three stories (i.e., forward condition; Supplementary Table 1) and the corresponding control audios (i.e., the temporally inverted audio, backward condition) while fMRI data were collected. Participants were asked to fixate on a cross at the center of the black screen during listening. The sequence of the six audios (three forward and three backward) was counterbalanced across participants, with flexible intervals inserted between runs to allow rest. All audios were preceded by a 10-s silence with a black screen to control for T1 equilibration effects^11,25. Audios were played via the OptoACTIVE headset, which actively eliminates MRI scanner noise in real time and has been widely used in previous auditory studies^107,108,109. E-prime (v2.0.10) was used to control stimulus presentation.

The participants were tested on both perception and comprehension at the end of each story. For perceptual evaluation, participants rated clarity, familiarity, and complexity on a 5-point Likert scale (1 was the lowest and 5 was the highest). For comprehension, participants answered several true-or-false questions based on the story contents (3 questions for story 1 and 2; 5 questions for story 3). These questions targeted either details (mentioned only once) or gist-level information (mentioned multiple times)¹¹⁰, with both types included for each story. Statistical analyses were performed on both perceptual ratings and comprehension scores to assess how well participants perceived and comprehended the stories.

Additionally, to validate these questions, we recruited an independent cohort of 21 participants who were not part of the main experiment and were unaware of the experimental purpose. They were asked to rate “How well do you think these questions could reflect the listener’s comprehension of the story?” on a 7-point Likert scale (1 was strongly disagree, and 7 was strongly agree). A one-sample t-test was performed on the scores against the scale midpoint (i.e., 3.5), and the FDR method was applied to correct for multiple comparisons⁶⁴. Results showed that scores for all three stories were significantly above the midpoint (story 1: 5.71 ± 0.78, t₍₂₀₎ = 10.02, p < 0.05; story 2: 5.43 ± 1.08, t₍₂₀₎ = 6.09, p < 0.05; story 3: 5.81 ± 0.80, t₍₂₀₎ = 9.60, p < 0.05), indicating that these questions reliably reflected story comprehension.

Statistics and reproducibility

To assess the robustness of our findings, the folllowing procedures were applied to both behavioral and neural data. First, the D’Agostino test was used to evaluate the normality of the data distribution. If data followed a normal distribution, parametric tests were used (e.g., paired t-test); otherwise, nonparametric tests were used (e.g., Wilcoxon signed-rank test). All statistical tests without further explanation were two-tailed with a threshold of p < 0.05. The false discovery rate (FDR) correction was applied when multiple comparisons were conducted unless stated otherwise.

Data acquisition and preprocessing

The fMRI data were acquired with a Siemens TRIO 3-Tesla scanner at the Imaging Center for Brain Research, Beijing Normal University. The functional images were acquired using an echo planar imaging (EPI) sequence (TR = 2000 ms, TE = 30 ms, flip angle = 90°, FOV = 200 mm, voxel size = 3.1 × 3.1 × 3.5 mm³, interleaved). The structural T1-weighted images were collected using magnetization-prepared rapid gradient-echo sequence (TR = 2530 ms, TE = 3.39 ms, flip angle = 7°, FOV = 256 mm, 144 sagittal slices, voxel size = 1.3 × 1.0 × 1.3 mm).

The DPABI toolbox was used for data preprocessing¹¹¹. After removing the first 5 volumes corresponding to the silent period (10 s), the images were slice-timing corrected, spatially realigned to the first image in a run using rigid-body registration, and co-registered to their corresponding anatomical images. Next, both functional and anatomical images were normalized to the standard Montreal Neurological Institute (MNI) space, with functional images resampled to 2 × 2 × 2 mm³ voxel size. Then, the data were spatially smoothed with a 6 mm full-width at half maximum (FWHM) Gaussian kernel. Finally, all data were detrended, temporally high-pass filtered (128 s cutoff), and denoised by regressing out nuisance variables (including Friston’s 24 motion parameters and five principal components of the white matter and cerebrospinal fluid signals)¹¹².

Obtaining the predictive representations of words and sentences

Dataset generation

Chinese Wikipedia, derived from the Large Scale Chinese Corpus for NLP project (https://github.com/brightmart/nlp_chinese_corpus), was used as the corpus. During preprocessing, symbols and tokens unrelated to content were first removed. Then, the corpus was segmented into words using the jieba toolbox (https://github.com/fxsjy/jieba) and parsed into sentences based on end-of-sentence punctuations (i.e., period, question mark, exclamation mark, and ellipsis). Further, a document was randomly sampled from the corpus, where a linguistic unit (a word or a sentence) was randomly selected as the to-be-predicted target. All preceding text were treated as the prior linguistic context. Following this procedure, we constructed two datasets—one for words and another for sentences—each containing approximately 0.2 million items, with each item comprising a target and its corresponding linguistic context.

We did not remove any functional words. Intuitively, removing functional words can reduce non-informative content, allowing NLP algorithms to focus more on content words. However, this approach may overlook the fact that functional words, such as the negation words like “not”, “nor”, and “never”, also carry crucial semantic content and syntactic information for understanding the natural language. In fact, there is an ongoing debate on whether functional words should be removed when applying BERT-based models. The original BERT model, for example, did not suggest removing any stop words⁵². Moreover, Qiao et al.¹¹³ found that removing functional words does not affect BERT model performance¹¹³. Alzahrani and Jololian¹¹⁴ showed that removing functional words can even impair the model performance in a gender classification task¹¹⁴, reducing accuracy from 86.67% to 78.86%. Therefore, we followed the original BERT practice and included both functional and content words to preserve semantic and syntactic information in the vector representations.

Vector representations

WWM-RoBERTa, a variant of the BERT model⁵², was applied to vectorize the prior linguistic context and prediction target⁵¹. BERT is a pre-trained language representation model designed with a multi-layer bidirectional transformer encoder conditioned on both left and right context⁵². Its core mechanism is multi-head self-attention, which is fundamentally the weighted sum of all the input vectors¹¹⁵. The WWM-RoBERTa model has a larger architecture (the number of layers = 24, the hidden size = 1024, the number of self-attention heads = 16, total parameters = 340 M) and is trained with a larger batch size. Importantly, it is trained to predict whole words rather than characters, providing high generalizability and adaptability for Mandarin⁵³.

The WWM-RoBERTa model was implemented using Python (v3.7) with bert-as-service module (https://github.com/jina-ai/clip-as-service), which maps variable-length text to a fixed-length vector (1024 dimensions). Here, a “sentence” refers to a text span from the corpus, which may extend beyond a single grammatical sentence⁵². To obtain a comprehensive text embeddings, the bert-as-service module averaged the vector from the penultimate hidden layer across all tokens in the input text, as the final layer representations are sensitive to the model training tasks (i.e., masked language model and the next sentence prediction). Alternatively, the text vector can also be derived from the [CLS] token. [CLS] is a special symbol added to the beginning of sentence inputs, which is frequently used to represent the overall information of the inputs⁵². However, previous studies have indicated that the [CLS] embedding is less effective than the averaging approach^116,117,118. Therefore, the default bert-as-service setting (i.e., the averaging approach) was used in the present study. Specifically, for word units, we obtained vector representation with the target word as the only input (without context). For sentences and context, we computed the average embedding across all words in the text.

In addition, due to the quadratic relationship between text length and computational cost⁵², the WWM-RoBERTa model is constained to a maximum input length of 512 characters. In practice, two additional tokens were inserted at the beginning ([CLS]) and the end ([SEP]) of the input, reducing the effective length to 510. Therefore, we applied a “split-and-average” method to circumvent this input length restriction: the text was divided into equal segments (e.g., 2 segments if the length was between 511 and 1020 characters) and their embeddings were averaged to produce the final representation. This method generalizes to texts up to 4080 characters (8 segments) and can be considered an extension of the averaging approach. Consequently, each dataset item was represented by a 1024-dimensional vector for the prior linguistic context and a 1024-dimensional vector for the prediction target (word or sentence).

To validate the split-and-average method, we randomly selected 1000 documents with text lengths ranging from 50 to 510 characters. First, the texts were converted into vectors using the WWM-RoBERTa model to obtain the Whole Text Vector (WTV). These texts were also split into N segments ($N\in \{2,\,3,\,4,\ldots ,8\}$), converted into vectors, and averaged to index the Segment Text Vector (STV). Cosine distances between the WTV and STV were calculated as D_orig. Next, WTVs and STVs were randomly paired 1000 times, where cosine distance was computed for each permutation to generate a null distribution. Results showed that D_orig was significantly larger than the null distribution in all segment conditions (all conditions p < 0.001, FDR corrected), supporting the validity of our method for deriving context embeddings.

Model building

A multiple ridge regression approach was used to delineate the predictive relationship between prior context and upcoming linguistic inputs. The model consisted of 1024 independent ridge regressions, with each dimension of the upcoming input vectors being predicted from the corresponding linguistic context embeddings. This method potentially decorrelates the feature space, aligning with recent findings that an embedding whitening procedure could enhance model performance¹¹⁷. Mathematically, for each ridge regression model, given n samples of one dimension in the upcoming input vector Y (n × 1) and all dimensions in the context matrix X (n × 1024), we expected to estimate coefficients ${{\boldsymbol{\beta }}}$ (1024 × 1) and ${{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}$ by minimizing the following cost function:

$${{||}{{\boldsymbol{X}}}{{\boldsymbol{\beta }}}+{{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}-{{\boldsymbol{Y}}}{||}}_{2}^{2}+\lambda {{||}{{\boldsymbol{\beta }}}{||}}_{2}^{2}$$

where λ is the regularization term for preventing model overfitting by reducing coefficients ${{\boldsymbol{\beta }}}$. To estimate the parameters (${{\boldsymbol{\beta }}}$ and ${{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}$), the dataset (see “Dataset generation”) was split into training (80%) and test sets (20%; Fig. 1a). An optimal λ was evaluated using 4-fold cross-validation within the training set. The input vector Y and the context matrix X were normalized by column (i.e., across training samples) in advance. The model training and testing processes were implemented with the sklearn toolbox¹¹⁹.

Model validation

A pairwise classification task was used to evaluate model performance⁵⁷. First, 1000 samples were randomly selected from the test set, each containing a vector V_real-target for the prediction target and a vector V_real-context for the prior linguistic context. Next, a predicted vector V_pred-target was generated using the trained models based on V_real-context. The cosine distance between V_pred-target and V_real-target was calculated as D1, and the distance between V_pred-target and a randomly selected V_rand-target was calculated as D2. If D1 > D2, the sample was labeled “right”, and “wrong” otherwise. Accuracy was then computed across all samples, which was repeated 1000 times for robustness. We also computed the Pearson correlation between the actual and predicted targets to supplement the classification results.

Obtaining the predictive representations of the experimental stimuli

The texts of the three stories were segmented at both word and sentence levels (Fig. 1b). Word segmentation was performed using the jieba toolbox in Python. A sentence is typically defined as a string of words expressing a complete thought, containing at least a subject and a predicate. However, in spoken language, subjects may be omitted for simplicity, speech errors may occur, and oral language can diverge from formal grammar. Thus, to partition sentences appropriately, we recruited 10 raters to mark the text wherever they thought the end of a sentence should be (sentence boundaries annotation task). A sentence boundary was established if at least 5 raters marked the same location³³. A trained experimenter then reviewed and refined the marked positions. Praat was further utilized to align the segmented text (words and sentences) to the audio recordings¹²⁰. Finally, the processed experimental materials were converted into vector representations using the same procedures described above.

Relating BOLD signals with the predictive representations using gGLM analysis

The gGLM analysis was conducted to identify the neural underpinnings of the predictive representations at the word and sentence levels. Specifically, we implemented a LOSO cross-validation procedure, where the model performance for each participant was evaluated using the data from all other participants. This approach could effectively avoid overfitting and suppress the non-independence error⁶². Additionally, to reduce the risk of overfitting due to the high dimensionality of the vector representations (1024 dimensions), we applied a feature reduction procedure combining Isomap and PCA^121,122. Prior studies have shown that concatenating Isomap and PCA components can achieve performance comparable to the full feature space¹²¹. Therefore, to meet the minimum criteria (i.e., PCA cumulative variances explained >=50% and Isomap residual variance at its minimum), we retained 15 Isomap components and 35 PCA components (Supplementary Table 4, 5).

Furthermore, the design matrices were generated using the function “make_first_level_design_matrix” with the python toolbox nilearn¹²³. Specifically, the following steps were conducted in the function (Supplementary Fig. 2): (1) Oversampling. Based on the timing information of language units in the stories (e.g., the offsets of words or sentences), a time course was generated and oversampled with a sampling rate of 50 Hz; (2) HRF convolution. The oversampled time course was convolved with HRF; (3) Downsampling. The convolved time course was further downsampled to 0.5 Hz, corresponding to the sampling rate of fMRI (i.e., TR = 2 s). The downsampled time course was used to fit the fMRI signals using gGLM.

The variance partitioning (VP) approach was employed to identify the prediction effects. Specifically, a model including only the context representations (M_C) was used to estimate the context effect (Supplementary Table 4, 5). Then, a full model including both the context and predictive representations (M_F) was used to obtain both effects. The difference in explained variance (R²) between two models (i.e., M_F − M_C) was the unique effect of predictive representations. The unique effect of the prior context could be estimated using a similar procedure (Supplementary Fig. 5), by training a model including only the predictive representations (M_P) and subtracting its performance from the full model (M_F). In addition, following the concept of pre-activation³, predictive representations of linguistic units N (i.e., word/sentence) were aligned to the offset of linguistic units N − 1 (Supplementary Fig. 2). For each model, participants and stories were dummy-coded and included as covariates to control for individual- and story-level differences. Regressors modeling the word and sentence boundaries were included to account for the temporal delay in BOLD signals with respect to the stimuli. The log-transformed word or sentence frequencies (i.e., the average frequency of all words in a sentence) were also regressed out from the corresponding models to control for the statistical influence of everyday language usage¹¹. Word frequencies were obtained from Cai and Brysbaert¹²⁴, derived from film subtitles that approximate everyday language exposure¹²⁴. Frequencies were log-transformed due to their inherently skewed distribution.

The gGLM analysis was conducted using a parcellation approach with 400 non-overlapping parcels⁶³. BOLD signals within each parcel were pre-whitened using an AR(1) noise model implemented in nistats¹²⁵ and then averaged. A paired t-test was performed between the forward and backward conditions to identify significant parcels. Multiple comparisons were controlled using a FDR threshold of q < 0.01⁶⁴. Significant parcels were visualized by projecting onto a cortical surface using Brainnet Viewer¹²⁶.

To validate the neural underpinnings of predictive representations, a permutation test was performed. For each iteration, the features associated with each linguistic unit were shuffled 1000 times to remove the prediction effect. Then, the same pipeline described above was repeated to generate a null distribution of R². Finally, p-values were obtained based on the position of the original R² value in this null distribution.

PC-based computational modeling

To directly test the sparse and continuous updating hypotheses, we constructed two computational models satisfying the minimal assumptions of the PC framework^43,127. The PC framework posits that the brain processes upcoming information hierarchically, with level N generating prediction signals for level N-1. Prediction errors (PEs), defined as the difference between the predicted and actual neural responses at level N-1, are sent back to level N to update subsequent predictions⁴². In our models, the architecture corresponding to the continuous updating hypothesis is formalized using the differential equations below (the continuous updating PC models):

$$\frac{{{dZ}}_{w}\left(t\right)}{{dt}}={w}_{0}\cdot {x}_{w}\left(t+{dt}\right)+{w}_{1}\cdot \left({Z}_{s}\left(t\right)-{Z}_{w}\left(t\right)\right)$$

$${x}_{s}\left(t+{dt}\right)={Z}_{s}\left(t\right)-{Z}_{w}\left(t+{dt}\right)$$

$$\frac{{{dZ}}_{s}\left(t\right)}{{dt}}={s}_{0}\cdot {x}_{s}\left(t+{dt}\right)+{s}_{1}\cdot \left({Prior}\left(t\right)-{Z}_{s}\left(t\right)\right)$$

where ${dt}$ was set to the repetition time (TR = 2 s) during model estimation. The word-level PE, ${x}_{w}\left(t\right)$, was defined as the min-max normalized cosine distances between the predicted and actual word vectors, and resampled to the fMRI acquisition rate (TR = 2 s). Cosine distance was used because it provides a robust measure of dissimilarity and is less sensitive to vector magnitude than alternative metrics^128,129,130. Min-max normalization ensured that PEs were constrained to positive values within the range [0–1]. ${Z}_{w}$ and ${Z}_{s}$ denote the neural signals associated with word- and sentence-level predictions, computed as the average signal across the parcels identified in the gGLM analyses (i.e., ${Z}_{w}$ is the average of bilateral STC and MTC; ${Z}_{s}$ is the average of the right TPJ, medial PFC and precuneus). The word-level PE, ${x}_{s}\left(t\right)$, was modeled as the difference between ${Z}_{s}\left(t\right)$ and the upcoming ${Z}_{w}\left(t\right)$. ${Prior}\left(t\right)$ represents the higher-level top-down input to the sentence level ${Z}_{s}\left(t\right).\,$Following the previous research, it was set to 0, under the assumption that top-down priors exert minimal influence on the information updating strategy at the current levels⁴³. In addition, ${w}_{0}$, ${w}_{1}$, ${s}_{0}$, and ${s}_{1}$ are parameters to be estimated. Mathematically, ${w}_{0}$ and ${s}_{0}$ determine the strength of how word- and sentence-level PEs affect neural signals, while ${w}_{1}$ and ${s}_{1}$ primarily regulate the decay rates of ${Z}_{w}\left(t\right)$ and ${Z}_{s}\left(t\right)$, respectively.

In contrast, the sparse updating hypothesis suggests a discretized information exchange between adjacent levels. The corresponding model can be described by the following differential equations with delays (the sparse updating PC models):

$$\frac{{{dZ}}_{w}\left(t\right)}{{dt}}={w}_{0}\cdot {x}_{w}\left(t+{dt}\right)+{w}_{1}\cdot \left({Z}_{s}\left(t-\varDelta t\right)-{Z}_{w}\left(t\right)\right)$$

$${x}_{s}\left(t+{dt}\right)={Z}_{s}\left(t-\varDelta t\right)-{Z}_{w}\left(t-\varDelta t+{dt}\right)$$

$$\frac{{{dZ}}_{s}\left(t\right)}{{dt}}={s}_{0}\cdot {x}_{s}\left(t+{dt}\right)+{s}_{1}\cdot \left({Prior}\left(t\right)-{Z}_{s}\left(t\right)\right)$$

where $\varDelta t$ is a variable quantifying the time lag between the current moment and the boundary of the preceding sentence. All other variables and parameters are identical to those in the continuous updating PC model.

Neural signals simulated by the PC models were subsequently transformed into BOLD responses to compare with actual BOLD signals using a hemodynamic model. The hemodynamic model comprises both the Balloon model and the BOLD model^69,70,131. Specifically, the Balloon model describes how neural activity induces changes in blood volume and deoxy-hemoglobin (dHb), and is formulated as follows:

$$\frac{{ds}\left(t\right)}{{dt}}=Z\left(t\right)-\kappa \cdot s\left(t\right)-\gamma \left(f\left(t\right)-1\right)$$

$$\frac{{df}\left(t\right)}{{dt}}=s\left(t\right)$$

$$\tau \frac{{dv}\left(t\right)}{{dt}}=f\left(t\right)-{v\left(t\right)}^{\frac{1}{\alpha }}$$

$$\tau \frac{{dq}\left(t\right)}{{dt}}=f\left(t\right){{\rm{\cdot }}}\frac{{\left(1-{E}_{0}\right)}^{\frac{1}{f\left(t\right)}}}{{E}_{0}}-{v\left(t\right)}^{\frac{1}{\alpha }}{{\rm{\cdot }}}\frac{q\left(t\right)}{v\left(t\right)}$$

where Z(t) is the neural responses derived from the PC models; $s\left(t\right)$ represents vasodilatory signal; $f\left(t\right)$ is the blood flow; $v\left(t\right)$ corresponds to the local change in the blood volume; $q\left(t\right)$ indicates the proportion of dHb.

Further, the BOLD model characterizes how blood volume and dHb synergistically contribute to changes in the BOLD signal, expressed by the following non-linear equation:

$$\frac{\Delta S\left(t\right)}{{S}_{0}}\approx {V}_{0}\left[{k}_{1}\left(1-q\right. \left(t\right)+{k}_{2}\left(1-\,\frac{q\left(t\right)}{v\left(t\right)}\right)+\,{k}_{3}\left(1-v\left(t\right)\right)\right]$$

where parameters ${k}_{1}$, ${k}_{2}$, and ${k}_{3}$ are calculated through the following equations:

$${k}_{1}=4.3\cdot {\vartheta }_{0}\cdot {E}_{0}\cdot {TE}$$

$${k}_{2}={\varepsilon }_{h}\cdot {r}_{0}\cdot {E}_{0}\cdot {TE}$$

$${k}_{3}=1-{\varepsilon }_{h}$$

in which ${S}_{0}$ is the BOLD signal at rest, and $\Delta S$ is the BOLD signal change induced by task performance. Details of all the parameters are listed in Supplementary Table 6. The simulation data for all hidden variables are visualized in Supplementary Fig. 8.

The gradient descent method was employed to estimate the parameters ${w}_{0}$, ${w}_{1}$, ${s}_{0}$, and ${s}_{1}$. Gradient descent is an iterative optimization algorithm that seeks the local minimum. Specifically, these four parameters (${{\boldsymbol{\theta }}}=\{{w}_{0},{w}_{1},{s}_{0},{s}_{1}\}$) were updated simultaneously:

$${{\boldsymbol{\theta }}}={{\boldsymbol{\theta }}}-\alpha \frac{d}{d{{\boldsymbol{\theta }}}}J\left({{\boldsymbol{\theta }}}\right)$$

where $\alpha$ denotes the learning rate. Then, the cost function $J\left({{\boldsymbol{\theta }}}\right)$ was defined as:

$$J\left({{\boldsymbol{\theta }}}\right)=\frac{1}{2n}{\sum }_{i=1}^{n}\frac{{\left({\hat{Z}}_{{w}_{i}}-{{Z}_{w}}_{i}\right)}^{2}+{\left({\hat{Z}}_{{s}_{i}}-{{Z}_{s}}_{i}\right)}^{2}}{2}$$

where ${Z}_{w}$ and ${Z}_{s}$ are the fMRI signals associated with the predictions of words and sentences, derived from the gGLM analysis at each level before averaging and z-scoring; ${\hat{Z}}_{w}$ and ${\hat{Z}}_{s}$ are the corresponding estimated signals; n is the signal length in TRs. During model training, the learning rate $\alpha$ was set to $1\times {10}^{-5}$, and the convergence threshold was defined as a change in the cost function ${dJ} < 1\times {10}^{-4}$. Because the cost function is not guaranteed to be concave, we randomized parameters ${{\boldsymbol{\theta }}}$ 10000 times to identify the best initial condition. A leave-one-subject-out cross-validation approach was applied to estimate $J\left({{\boldsymbol{\theta }}}\right)$. Model performance was quantified using the mean squared error (MSE), which equals to twice $J\left({{\boldsymbol{\theta }}}\right)$.

Autocorrelation analysis

The BOLD signals from significant parcels were used to calculate the autocorrelation effect. The time courses of the signals were temporally shifted forward from 1 to 50 TRs. Then, Pearson correlation was calculated between the original and shifted signals for each participant using the tsa.acf() function from the statsmodels toolbox¹³².

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The sample dataset and all experimental materials have been uploaded to Figshare (https://figshare.com/projects/hierarchical_linguistic_prediction/219754). The full dataset will be available upon request. The aggregated data used to generate Fig. 2 (Supplementary Data 1), Figs. 3–4 (Supplementary Data 2), and Fig. 5 (Supplementary Data 3) are provided along with the manuscript.

Code availability

Code is avaliable at GitHub (https://github.com/FaxinZ/Hierarchical_Linguistic_Prediction).

References

Heilbron, M., Armeni, K., Schoffelen, J.-M., Hagoort, P. & de Lange, F. P. A hierarchy of linguistic predictions during natural language comprehension. Proc. Natl. Acad. Sci. USA. 119, e2201968119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kuperberg, G. R. & Jaeger, T. F. What do we mean by prediction in language comprehension? Lang. Cogn. Neurosci. 31, 32–59 (2016).
Article PubMed Google Scholar
Pickering, M. J. & Gambi, C. Predicting while comprehending language: a theory and review. Psychol. Bull. 144, 1002–1044 (2018).
Article PubMed Google Scholar
Pickering, M. J. & Garrod, S. Do people use language production to make predictions during comprehension? Trends Cogn. Sci. 11, 105–110 (2007).
Article PubMed Google Scholar
Forseth, K. J., Hickok, G., Rollo, P. S. & Tandon, N. Language prediction mechanisms in human auditory cortex. Nat. Commun. 11, 14 (2020).
Article Google Scholar
Hagoort, P. The neurobiology of language beyond single-word processing. Science 366, 55 (2019).
Article CAS PubMed Google Scholar
Lewis, A. G. & Bastiaansen, M. A predictive coding framework for rapid neural dynamics during sentence-level language comprehension. Cortex 68, 155–168 (2015).
Article PubMed Google Scholar
Donhauser, P. W. & Baillet, S. Two distinct neural timescales for predictive speech processing. Neuron 105, 385 (2020).
Article CAS PubMed Google Scholar
Tezcan, F., Weissbart, H. & Martin, A. E. A tradeoff between acoustic and linguistic feature encoding in spoken language comprehension. eLife 12, e82386 (2023).
Article CAS PubMed PubMed Central Google Scholar
Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).
Article CAS PubMed PubMed Central Google Scholar
Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P. & van den Bosch, A. Prediction during natural language comprehension. Cereb. Cortex 26, 2506–2516 (2016).
Article PubMed Google Scholar
Lopopolo, A., Frank, S. L., van den Bosch, A. & Willems, R. M. Using stochastic language models (SLM) to map lexical, syntactic, and phonological information processing in the brain. PLOS One 12, (2017).
Shain, C., Blank, I. A., van Schijndel, M., Schuler, W. & Fedorenko, E. fNMI reveals language-specific predictive coding during naturalistic sentence comprehension. Neuropsychologia 138, 19 (2020).
Article Google Scholar
Brennan, J. R., Dyer, C., Kuncoro, A. & Hale, J. T. Localizing syntactic predictions using recurrent neural network grammars. Neuropsychologia 146, 107479 (2020).
Article PubMed Google Scholar
Stanojević, M., Brennan, J. R., Dunagan, D., Steedman, M. & Hale, J. T. Modeling structure-building in the brain with CCG parsing and large language models. Cogn. Sci. 47, e13312 (2023).
Article PubMed Google Scholar
Fedorenko, E. et al. Neural correlate of the construction of sentence meaning. Proc. Natl. Acad. Sci. 113, E6256–E6262 (2016).
Article CAS PubMed PubMed Central Google Scholar
Woolnough, O. et al. Spatiotemporally distributed frontotemporal networks for sentence reading. Proc. Natl. Acad. Sci. 120, e2300252120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hagoort, P. & van Berkum, J. Beyond the sentence given. Philos. Trans. R. Soc. B-Biol. Sci 362, 801–811 (2007).
Article Google Scholar
Meeds, R. & Bradley, S. D. The role of the sentence and its importance in marketing communications. In Psycholinguistic Phenomena In Marketing Communications. 103–120 (Lawrence Erlbaum Associates Publishers, 2007).
Mellem, M. S., Jasmin, K. M., Peng, C. & Martin, A. Sentence processing in anterior superior temporal cortex shows a social-emotional bias. Neuropsychologia 89, 217–224 (2016).
Article PubMed PubMed Central Google Scholar
Gao, P. et al. Temporal neural dynamics of understanding communicative intentions from speech prosody. NeuroImage 299, 120830 (2024).
Article PubMed Google Scholar
Lerner, Y., Honey, C. J., Silbert, L. J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci. 31, 2906–2915 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hasson, U., Chen, J. & Honey, C. J. Hierarchical process memory: memory as an integral component of information processing. Trends Cogn. Sci. 19, 304–313 (2015).
Article PubMed PubMed Central Google Scholar
Yeshurun, Y., Nguyen, M., & Hasson, U. Amplification of local changes along the timescale processing hierarchy. Proc. Natl. Acad. Sci. USA. 114, 9475–9480 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lee, C. S., Aly, M. & Baldassano, C. Anticipation of temporally structured events in the brain. Elife 10, (2021).
Caucheteux, C., Gramfort, A. & King, J.-R. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat. Hum. Behav. 7, 430–441 (2023).
Article PubMed PubMed Central Google Scholar
Ding, N., Melloni, L., Zhang, H., Tian, X. & Poeppel, D. Cortical tracking of hierarchical linguistic structures in connected speech. Nat. Neurosci. 19, 158 (2016).
Article CAS PubMed Google Scholar
Baggio, G. Meaning in the Brain. https://doi.org/10.7551/mitpress/11265.001.0001 (The MIT Press, 2018).
Bornkessel-Schlesewsky, I., Schlesewsky, M., Small, S. L. & Rauschecker, J. P. Neurobiological roots of language in primate audition: common computational properties. Trends Cogn. Sci. 19, 142–150 (2015).
Article PubMed PubMed Central Google Scholar
Schmitt, L.-M. et al. Predicting speech from a cortical hierarchy of event-based time scales. Sci. Adv. 7, eabi6070 (2021).
Article PubMed PubMed Central Google Scholar
Norman-Haignere, S. V. et al. Multiscale temporal integration organizes hierarchical computation in human auditory cortex. Nat. Hum. Behav. 6, 455–469 (2022).
Article PubMed PubMed Central Google Scholar
Chien, H. Y. S. & Honey, C. J. Constructing and forgetting temporal context in the human cerebral cortex. Neuron 106, 675 (2020).
Article CAS PubMed PubMed Central Google Scholar
Baldassano, C. et al. Discovering event structure in continuous narrative perception and memory. Neuron 95, 709 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rubio-Fernandez, P. & Jara-Ettinger, J. Incrementality and efficiency shape pragmatics across languages. Proc. Natl. Acad. Sci. USA. 117, 13399–13404 (2020).
Article CAS PubMed PubMed Central Google Scholar
Manning, C. D., Clark, K., Hewitt, J., Khandelwal, U. & Levy, O. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proc. Natl. Acad. Sci. 117, 30046–30054 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chao, Z. C., Takaura, K., Wang, L., Fujii, N. & Dehaene, S. Large-scale cortical networks for hierarchical prediction and prediction error in the primate brain. Neuron 100, 1252 (2018).
Article CAS PubMed Google Scholar
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36, 181–204 (2013).
Article PubMed Google Scholar
de Lange, F. P., Heilbron, M. & Kok, P. How do expectations shape perception? Trends Cogn. Sci. 22, 764–779 (2018).
Article PubMed Google Scholar
Friston, K. Does predictive coding have a future? Nat. Neurosci. 21, 1019–1021 (2018).
Article CAS PubMed Google Scholar
Arnal, L. H. & Giraud, A.-L. Cortical oscillations and sensory predictions. Trends Cogn. Sci. 16, 390–398 (2012).
Article PubMed Google Scholar
Bastos, A. M. et al. Canonical microcircuits for predictive coding. Neuron 76, 695–711 (2012).
Article CAS PubMed PubMed Central Google Scholar
Friston, K. J. A theory of cortical responses. Philos. Trans. R. Soc. B-Biol. Sci 360, 815–836 (2005).
Article Google Scholar
Alamia, A. & VanRullen, R. Alpha oscillations and traveling waves: signatures of predictive coding? PLOS Biol. 17, 26 (2019).
Article Google Scholar
Summerfield, C. et al. Predictive codes for forthcoming perception in the frontal cortex. Science 314, 1311–1314 (2006).
Article CAS PubMed Google Scholar
Bar, M. The proactive brain: using analogies and associations to generate predictions. Trends Cogn. Sci. 11, 280–289 (2007).
Article PubMed Google Scholar
Schacter, D. L., Addis, D. R. & Buckner, R. L. Remembering the past to imagine the future: the prospective brain. Nat. Rev. Neurosci. 8, 657–661 (2007).
Article CAS PubMed Google Scholar
Yeshurun, Y., Nguyen, M. & Hasson, U. The default mode network: where the idiosyncratic self meets the shared social world. Nat. Rev. Neurosci. https://doi.org/10.1038/s41583-020-00420-w (2021).
Xu, J., Kemeny, S., Park, G., Frattali, C. & Braun, A. Language in context: emergent features of word, sentence, and narrative comprehension. Neuroimage 25, 1002–1015 (2005).
Article PubMed Google Scholar
Yarkoni, T., Speer, N. K. & Zacks, J. M. Neural substrates of narrative comprehension and memory. Neuroimage 41, 1408–1425 (2008).
Article PubMed Google Scholar
Chung, J., Ahn, S. & Bengio, Y. Hierarchical multiscale recurrent neural networks. https://doi.org/10.48550/arXiv.1609.01704 (2017).
Cui, Y. et al. Pre-training with whole word masking for Chinese BERT. https://doi.org/10.48550/arXiv.1906.08101 (2019).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. 4171–4186 (Association for Computational Linguistics, 2019).
Sun, Y. et al. ERNIE: enhanced representation through knowledge integration. https://doi.org/10.48550/arXiv.1904.09223 (2019).
Allen, C. & Hospedales, T. Analogies explained: towards understanding word embeddings. In Proceedings of the 36th International Conference on Machine Learning (PMLR) 97, 223–231 (Long Beach, California, 2019).
Levy, O. & Goldberg, Y. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning 171–180. https://doi.org/10.3115/v1/W14-1618 (Association for Computational Linguistics, 2014).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. NeurIPS Proc. (2013).
Pereira, F. et al. Toward a universal decoder of linguistic meaning from brain activation. Nat. Commun. 9, 13 (2018).
Article Google Scholar
Toneva, M. & Wehbe, L. Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain). In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) vol. 32 (Curran Associates, Inc., 2019).
Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. Finding a ‘Kneedle’ in a haystack: detecting knee points in system behavior. In Proc. 31st International Conference on Distributed Computing Systems Workshops 166–171. https://doi.org/10.1109/ICDCSW.2011.20 (IEEE, 2011).
Huth, A. G., De Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).
Article PubMed PubMed Central Google Scholar
Millet, J. et al. Toward a realistic model of speech processing in the brain with self-supervised learning. In Proc. 6th Conference on Neural Information Processing Systems NeurIPS 2022 http://arxiv.org/abs/2206.01685 (2022).
Esterman, M., Tamber-Rosenau, B. J., Chiu, Y.-C. & Yantis, S. Avoiding non-independence in fMRI data analysis: Leave one subject out. NeuroImage 50, 572–576 (2010).
Article PubMed Google Scholar
Schaefer, A. et al. Local-Global Parcellation of the human cerebral cortex from intrinsic functional connectivity MRI. Cereb. Cortex 28, 3095–3114 (2018).
Article PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B-Stat. Methodol. 57, 289–300 (1995).
Article Google Scholar
Toneva, M., Mitchell, T. M. & Wehbe, L. Combining computational controls with natural text reveals aspects of meaning composition. Nat. Comput. Sci. 2, 745–757 (2022).
Article PubMed PubMed Central Google Scholar
Finn, E. S., Corlett, P. R., Chen, G., Bandettini, P. A. & Constable, R. T. Trait paranoia shapes inter-subject synchrony in brain activity during an ambiguous social narrative. Nat. Commun. 9, 2043 (2018).
Article PubMed PubMed Central Google Scholar
Whitney, C. et al. Neural correlates of narrative shifts during auditory story comprehension. Neuroimage 47, 360–366 (2009).
Article PubMed Google Scholar
Zacks, J. M. et al. Human brain activity time-locked to perceptual event boundaries. Nat. Neurosci. 4, 651–655 (2001).
Article CAS PubMed Google Scholar
Friston, K. J., Mechelli, A., Turner, R. & Price, C. J. Nonlinear responses in fMRI: the Balloon model, Volterra kernels, and other hemodynamics. NeuroImage 12, 466–477 (2000).
Article CAS PubMed Google Scholar
Zeidman, P. et al. A guide to group effective connectivity analysis, part 1: first level analysis with DCM for fMRI. Neuroimage 200, 174–190 (2019).
Article PubMed Google Scholar
Franklin, N. T., Norman, K. A., Ranganath, C., Zacks, J. M. & Gershman, S. J. Structured event memory: a neuro-symbolic model of event cognition. Psychol. Rev. 127, 327–361 (2020).
Article PubMed Google Scholar
Martin, A. E. & Doumas, L. A. A. A mechanism for the cortical computation of hierarchical linguistic structure. PLOS Biol. 15, e2000663 (2017).
Article PubMed PubMed Central Google Scholar
Bai, F., Meyer, A. S. & Martin, A. E. Neural dynamics differentially encode phrases and sentences during spoken language comprehension. PLOS Biol. 20, e3001713 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhao, J., Martin, A. E. & Coopmans, C. W. Structural and sequential regularities modulate phrase-rate neural tracking. Sci. Rep. 14, 16603 (2024).
Article PubMed PubMed Central Google Scholar
Armeni, K., Willems, R. M., van den Bosch, A. & Schoffelen, J. M. Frequency-specific brain dynamics related to prediction during language comprehension. Neuroimage 198, 283–295 (2019).
Article PubMed Google Scholar
Grisoni, L., Miller, T. M. & Pulvermüller, F. Neural correlates of semantic prediction and resolution in sentence processing. J. Neurosci. 37, 4848–4858 (2017).
Article CAS PubMed PubMed Central Google Scholar
Grisoni, L., Mohr, B. & Pulvermüller, F. Prediction mechanisms in motor and auditory areas and their role in sound perception and language understanding. NeuroImage 199, 206–216 (2019).
Article PubMed Google Scholar
Brennan, J. R., Stabler, E. P., Van Wagenen, S. E., Luh, W.-M. & Hale, J. T. Abstract linguistic structure correlates with temporal activity during naturalistic comprehension. Brain Lang. 157–158, 81–94 (2016).
Article PubMed PubMed Central Google Scholar
Kutas, M. & Federmeier, K. D. Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP). Annu. Rev. Psychol.62, 621–647 (2011).
Armeni, K., Willems, R. M. & Frank, S. L. Probabilistic language models in cognitive neuroscience: promises and pitfalls. Neurosci. Biobehav. Rev. 83, 579–588 (2017).
Article PubMed Google Scholar
Miller, E. K. The prefrontal cortex and cognitive control. Nat. Rev. Neurosci. 1, 59–65 (2000).
Article CAS PubMed Google Scholar
Wood, J. N. & Grafman, J. Human prefrontal cortex: processing and representational perspectives. Nat. Rev. Neurosci. 4, 139–147 (2003).
Article CAS PubMed Google Scholar
Chen, J. et al. Shared memories reveal shared structure in neural activity across individuals. Nat. Neurosci. 20, 115–125 (2017).
Article CAS PubMed Google Scholar
Yeshurun, Y. et al. Same story, different story: the neural representation of interpretive frameworks. Psychol. Sci. 28, 307–319 (2017).
Article PubMed PubMed Central Google Scholar
Schacter, D. L., Addis, D. R. & Buckner, R. L. Episodic simulation of future events: concepts, data, and applications. Ann. N.Y. Acad. Sci. 1124, 39–60 (2008).
Article PubMed Google Scholar
Preminger, S., Harmelech, T. & Malach, R. Stimulus-free thoughts induce differential activation in the human default network. NeuroImage 54, 1692–1702 (2011).
Article PubMed Google Scholar
Gilbert, D. T. & Wilson, T. D. Prospection: experiencing the future. Science 317, 1351–1354 (2007).
Article CAS PubMed Google Scholar
Ferstl, E. C., Neumann, J., Bogler, C. & von Cramon, D. Y. The extended language network: a meta-analysis of neuroimaging studies on text comprehension. Hum. Brain Mapp. 29, 581–593 (2008).
Article PubMed Google Scholar
Hamilton, L. S. & Huth, A. G. The revolution will not be controlled: natural stimuli in speech neuroscience. Lang. Cogn. Neurosci. 35, 573–582 (2020).
Article PubMed Google Scholar
Hasson, U., Egidi, G., Marelli, M. & Willems, R. M. Grounding the neurobiology of language in first principles: The necessity of non-language-centric explanations for language comprehension. Cognition 180, 135–157 (2018).
Article PubMed PubMed Central Google Scholar
Sridharan, D., Levitin, D. J., Chafe, C. H., Berger, J. & Menon, V. Neural Dynamics of event segmentation in music: converging evidence for dissociable ventral and dorsal networks. Neuron 55, 521–532 (2007).
Article CAS PubMed Google Scholar
Bullmore, E. & Sporns, O. The economy of brain network organization. Nat. Rev. Neurosci. 13, 336–349 (2012).
Article CAS PubMed Google Scholar
Sporns, O. The non-random brain: efficiency, economy, and complex dynamics. Front. Comput. Neurosci. 5, (2011).
Eddine, S. N., Brothers, T., Wang, L., Spratling, M. & Kuperberg, G. R. A predictive coding model of the N400. Cognition 246, 105755 (2024).
Article Google Scholar
Kamide, Y., Altmann, G. T. M. & Haywood, S. L. The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye movements. J. Mem. Lang. 49, 133–156 (2003).
Article Google Scholar
Sedivy, J. C., Tanenhaus, M. K., Chambers, C. G. & Carlson, G. N. Achieving incremental semantic interpretation through contextual representation. Cognition 71, 109–147 (1999).
Article CAS PubMed Google Scholar
Payne, B. R., Lee, C. & Federmeier, K. D. Revisiting the incremental effects of context on word processing: Evidence from single-word event-related brain potentials. Psychophysiology 52, 1456–1469 (2015).
Article PubMed PubMed Central Google Scholar
Leon-Cabrera, P., Flores, A., Rodriguez-Fornells, A. & Moris, J. Ahead of time: early sentence slow cortical modulations associated to semantic prediction. Neuroimage 189, 192–201 (2019).
Article PubMed Google Scholar
Ryskin, R. & Nieuwland, M. S. Prediction during language comprehension: what is next? Trends Cogn. Sci. 27, 1032–1052 (2023).
Article PubMed PubMed Central Google Scholar
Kurby, C. A. & Zacks, J. M. Segmentation in the perception and memory of events. Trends Cogn. Sci. 12, 72–79 (2008).
Article PubMed PubMed Central Google Scholar
Stowe, L. A., Kaan, E., Sabourin, L. & Taylor, R. C. The sentence wrap-up dogma. Cognition 176, 232–247 (2018).
Article PubMed Google Scholar
Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 3980–3990 (Association for Computational Linguistics, Hong Kong, China, 2019).
Mumford, J. A. A power calculation guide for fMRI studies. Soc. Cogn. Affect. Neurosci. 7, 738–742 (2012).
Article PubMed PubMed Central Google Scholar
Durnez, J. & Sochat, V. Neuropower. https://github.com/neuropower.
Oldfield, R. C. The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9, 97–113 (1971).
Article CAS PubMed Google Scholar
Audacity. https://www.audacityteam.org/.
Liu, L. F. et al. Auditory-articulatory neural alignment between listener and speaker during verbal communication. Cereb. Cortex 30, 942–951 (2020).
Article PubMed Google Scholar
Silbert, L. J., Honey, C. J., Simony, E., Poeppel, D. & Hasson, U. Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proc. Natl. Acad. Sci. USA. 111, E4687–E4696 (2014).
Article CAS PubMed PubMed Central Google Scholar
Stephens, G. J., Silbert, L. J. & Hasson, U. Speaker-listener neural coupling underlies successful communication. Proc. Natl. Acad. Sci. USA. 107, 14425–14430 (2010).
Article CAS PubMed PubMed Central Google Scholar
Nicholas, L. E. & Brookshire, R. H. Consistency of the effects of rate of speech on brain-damaged adults’ comprehension of narrative discourse. J. Speech Hear. Res. 29, 462–470 (1986).
Article CAS PubMed Google Scholar
Yan, C. G., Wang, X. D., Zuo, X. N. & Zang, Y. F. DPABI: data processing & analysis for (resting-state) brain imaging. Neuroinformatics 14, 339–351 (2016).
Article PubMed Google Scholar
Friston, K. J., Williams, S., Howard, R., Frackowiak, R. S. J. & Turner, R. Movement-related effects in fMRI time-series: movement artifacts in fMRI. Magn. Reson. Med. 35, 346–355 (1996).
Article CAS PubMed Google Scholar
Qiao, Y., Xiong, C., Liu, Z. & Liu, Z. Understanding the behaviors of BERT in ranking. Preprint at https://doi.org/10.48550/arXiv.1904.07531 (2019).
Alzahrani, E. & Jololian, L. How different text-preprocessing techniques using the Bert model affect the gender profiling of authors. In Advances in Machine Learning 01–08. https://doi.org/10.5121/csit.2021.111501 (Academy and Industry Research Collaboration Center (AIRCC), 2021).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) vol. 30 (Neural Information Processing Systems (NIPS), 2017).
Anderson, A. J. et al. Deep artificial neural networks reveal a distributed cortical network encoding propositional sentence-level meaning. J. Neurosci. 41, 4100 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. WhiteningBERT: an easy unsupervised sentence embedding approach. In Findings of the Association for Computational Linguistics: EMNLP 2021 238–244 (Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021).
Yu, L. & Ettinger, A. Assessing phrasal representation and composition in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)4896–4907 (Association for Computational Linguistics, Online, 2020).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Boersma, P. & Weenink, D. Praat. https://praat.org/.
Jha, R. & Mihata, K. On geodesic distances and contextual embedding compression for text classification. 144–149. https://doi.org/10.18653/v1/2021.textgraphs-1.15 (Association for Computational Linguistics, 2021).
Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article CAS PubMed Google Scholar
Nilearn. https://nilearn.github.io/stable/index.html.
Cai, Q. & Brysbaert, M. SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. PLoS ONE 5, e10729 (2010).
Article PubMed PubMed Central Google Scholar
Abraham, A. et al. Machine learning for neuroimaging with scikit-learn. Front. Neuroinform. 8, 71792 (2014).
Xia, M., Wang, J. & He, Y. BrainNet viewer: a network visualization tool for human brain connectomics. PLoS ONE 8, e68910 (2013).
Article CAS PubMed PubMed Central Google Scholar
Friston, K. J. Waves of prediction. PLOS Biol. 17, e3000426 (2019).
Article CAS PubMed PubMed Central Google Scholar
Brodbeck, C., Hong, L. E. & Simon, J. Z. Rapid transformation from auditory to linguistic representations of continuous speech. Curr. Biol. 28, 3976 (2018).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, H. V. & Bai, L. Cosine similarity metric learning for face verification. In Proc. Computer Vision – ACCV2010 (eds Kimmel, R., Klette, R. & Sugimoto, A.) vol. 6493, 709–720 (Springer Berlin Heidelberg, 2011).
Zhang, L. et al. Can computers understand words like humans do? Comparable semantic representation in neural and computer systems. https://doi.org/10.1101/843896 (2020).
Stephan, K. E., Weiskopf, N., Drysdale, P. M., Robinson, P. A. & Friston, K. J. Comparing hemodynamic models with DCM. NeuroImage 38, 387–401 (2007).
Article PubMed Google Scholar
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. of the 9th Python in Science Conference (SCIPY 2010) 92–96 (https://doi.org/10.25080/Majora-92bf1922-011 (SciPy, 2010).

Download references

Acknowledgements

We thank Xiangyu He, Xiaofang Lu, and Xinran Xu for helping collect the fMRI data, Amirhossein Khalilian-Gourtani for the advice and assistance on the neural modeling, and other members in Lu lab and Flinker lab for extensive discussions. This work was supported by the National Natural Science Foundation of China (62293550, 62293551).

Author information

Authors and Affiliations

State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
Faxin Zhou & Chunming Lu
Biomedical Engineering Department, Tandon School of Engineering, New York University, Brooklyn, NY, USA
Faxin Zhou & Adeen Flinker
Institute of Brain and Psychological Sciences, Sichuan Normal University, Chengdu, China
Siyuan Zhou
School of Education, Beijing Institute of Technology, Beijing, China
Yuhang Long
Neurology Department, Grossman School of Medicine, New York University, New York, NY, USA
Adeen Flinker

Authors

Faxin Zhou
View author publications
Search author on:PubMed Google Scholar
Siyuan Zhou
View author publications
Search author on:PubMed Google Scholar
Yuhang Long
View author publications
Search author on:PubMed Google Scholar
Adeen Flinker
View author publications
Search author on:PubMed Google Scholar
Chunming Lu
View author publications
Search author on:PubMed Google Scholar

Contributions

C.L. and F.Z. conceived the project; A.F., S.Z. and Y.L. contributed ideas for experiments and analysis; F.Z. collected data and performed the analyses; C.L. and A.F. critically revised the article; C.L. and F.Z. finished the manuscript with input from all authors.

Corresponding author

Correspondence to Chunming Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Tomoya Nakai and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Zenas Chao and Jasmine Pan. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file (download PDF )

Supplementary Information (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download XLSX )

Reporting Summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, F., Zhou, S., Long, Y. et al. Hierarchical linguistic predictions and cross-level information updating during narrative comprehension. Commun Biol 9, 107 (2026). https://doi.org/10.1038/s42003-025-09377-x

Download citation

Received: 20 May 2025
Accepted: 04 December 2025
Published: 18 December 2025
Version of record: 23 January 2026
DOI: https://doi.org/10.1038/s42003-025-09377-x