Introduction

Language comprehension requires listeners to predict upcoming inputs based on previous knowledge and context1,2,3,4. Linguistic prediction can reduce computational load in the brain5, enabling listeners to instantaneously process highly dynamic speech flow (2–5 words per second)6,7. Previous research has primarily focused on predicting linguistic units at shorter timescales. Neuroimaging and electrophysiological findings have shown that phoneme prediction primarily engages the bilateral primary auditory cortices1,8,9, while word prediction involves a more distributed network, including the bilateral superior temporal gyrus (STG), left inferior parietal lobule (IPL), bilateral inferior frontal gyrus (IFG), and bilateral dorsolateral prefrontal cortex (dlPFC)1,9,10,11,12. Moreover, leveraging neural encoding models (e.g., general linear model, GLM) and various language models (e.g., recurrent neural networks, RNNs), recent studies have shown that the STG and PFC are largely involved in predicting part-of-speech (POS) tags1,12,13,14,15, indicating that grammatical structure can also be predicted at the word level (i.e., syntactic prediction).

However, natural language is not confined to smaller units; much of its complexity arises from larger units (e.g., sentences) that convey nuanced meanings and implicit messages16,17. Such units also enable individuals to navigate complicated situations18,19, such as interpreting social-emotional cues20 or inferring underlying communicative intentions21. Further, converging evidence has shown that the brain integrates past context across multiple timescales (i.e., the temporal receptive window, TRW)22,23,24, ranging from early sensory regions (e.g., the STG) operating on shorter timescales to higher-order areas (e.g., the PFC) on longer timescales. These findings raise the question of whether and how the brain implements the multilevel prediction of future linguistic units, particularly beyond phonemes and words.

Recent studies, though relatively limited, have examined how the brain predicts upcoming information over varying timescales. For instance, researchers found that the activity patterns in multiple brain regions shifted progressively during repeated movie viewing, following a posterior-to-anterior gradient across the cortex25. This finding suggests that the brain actively anticipates upcoming movie plots after prior exposure. Relatedly, a study leveraged large language model (LLM)-based methods to test whether neural encoding performance improves when incorporating a “forecast window”, providing evidence for an anticipation hierarchy during language comprehension26. Nonetheless, since timescales in the language system can be defined in different ways (e.g., a syntax-driven hierarchy27 or a semantics-based hierarchy28), it remains unclear which specific linguistic level(s) within the prediction hierarchy these studies capture. To bridge this gap, we focused on how the brain conducts semantic prediction of incoming words and sentences. We selected words and sentences because they are well-recognized linguistic levels in most language hierarchy frameworks22,27,28,29. Moreover, both serve as natural semantic units that can convey meaning independently, yet at different levels of complexity, thereby providing a framework to investigate a semantic prediction hierarchy during natural language comprehension. Together, in the present study, we investigated multilevel linguistic prediction by probing neural predictive representations of longer-timescale units such as sentences and shorter-timescale units such as words.

Additionally, it is crucial to understand how information is updated between levels within the prediction hierarchy. Since the neural representation of the prediction hierarchy remains poorly characterized, only a few studies have investigated this question and have primarily focused on lower levels1,8,29,30. A key debate emerging from these studies centers on how information is updated along the hierarchy. One perspective suggests that higher levels are updated continuously as inputs from lower levels unfold over time (i.e., continuous updating hypothesis). For example, studies on auditory perception31 and narrative comprehension32 have shown that neural responses at higher levels increase gradually when new inputs are introduced. Moreover, computational models built on the continuous updating hypothesis can capture the neural dynamics during context construction and forgetting32. In contrast, another perspective suggests that higher-level updates occur only at the end of their preferred timescales, leading to abrupt rather than gradual changes in neural responses (i.e., sparse updating hypothesis). For instance, evidence shows that neural activity in the precuneus changed sharply at event boundaries corresponding to its preferred timescales33. Similarly, a study using the RNN model demonstrated that the sparse updating model, but not the continuous model, identified the processing architecture in the human brain along the temporo-parietal axis30. Further, a study proposed that during discourse comprehension, the brain would instantiate a single conscious representation of the input (e.g., a word) and remain stable unless perturbed by new inputs28. Achieving such representational stability requires cortical circuits to reach steady states sparsely across multiple intermediate levels. Based on these findings, we tested which hypothesis (continuous updating or sparse updating) better explains information updating from regions supporting word prediction to regions supporting sentence prediction.

To investigate the prediction hierarchy and examine the information-updating modes in the human brain, we combined natural language processing (NLP) and neural computational modeling approaches to analyze brain signals from individuals engaged in a narrative comprehension task, recorded using functional magnetic resonance imaging (fMRI). In this task, 31 participants listened to three stories presented either forward or backward. The forward condition intrinsically involves a linguistic prediction hierarchy27,29, whereas the backward condition serves as a control for acoustic features. Next, we aimed to quantify the predictive relationship between preceding context and upcoming linguistic units at both the word and sentence levels (see “Methods” section). While decoder-only transformer architectures (e.g., generative pre-trained transformer, GPT) are widely used for language prediction1,10, they typically predict linguistic units at shorter timescales (primarily words)1. Therefore, we applied a multiple ridge regression approach to derive predictive representations at both the word and sentence levels34,35. Further, using the group-based GLM (gGLM), we identified neural correlates associated with the predictive representations of upcoming linguistic units before their appearance (i.e., the neural pre-activation3). Finally, we applied the computational models to differentiate between the continuous and sparse updating hypotheses within the predictive coding (PC) framework. The PC framework posits that the brain processes inputs through a multilevel cascade, generating top-down prediction signals and bottom-up error signals that iteratively update the internal model36,37,38,39. This theoretical account is supported by a growing body of empirical evidence from both computational40,41,42 and neuroscience studies43,44. In the present work, we simplified the PC architecture to two levels, where words and sentences corresponding to the lower and higher levels, respectively. We implemented two variants, one based on the continuous updating hypothesis and the other on the sparse updating hypothesis, which we compared and evaluated by simulating fMRI responses at the word and sentence levels.

In line with prior findings, we first predicted that word prediction would primarily engage lower-order brain regions such as the STG10,11. Additionally, given recent advances showing that the default mode network (DMN) is largely recruited in the anticipation of future events (also known as prospective memory)25,45,46,47 and is especially engaged in narrative understanding at longer timescales22,24,47,48,49, we expected to observe neural representations of sentence prediction within the DMN regions. Furthermore, considering the chunking property of the language system27,29, we postulated that the functional interactions between the word and sentence levels would occur in a sparse manner, which has also been shown to be more computationally efficient and able to accelerate updating30,50. Overall, we provide evidence for the linguistic prediction hierarchy and the cross-level information updating in the brain.

Results

Behavioral performance in narrative comprehension

In the narrative comprehension task, participants were instructed to passively listen to three stories while fixating on a central cross presented on a black screen. The sequence of the six audio clips (three in the forward condition and three in the backward condition; see “Methods” section) was counterbalanced across participants. Details of the stimuli (e.g., the length of each story) are provided in Supplementary Table 1.

At the end of each forward story, participants rated how well they perceived and comprehended the content (see “Methods” section). We first assessed the clarity of story perception and found that clarity scores for all stories were significantly above the chance level (chance level = 2.5; one-sample t-test, p < 0.05; Supplementary Table 2). Additionally, we found no significant differences in perception scores, including clarity (one-way ANOVA, F(2, 90) = 0.203, p = 0.817, f = 0.081), familiarity (F(2, 90) = 0.594, p = 0.554, f = 0.138), and complexity (F(2, 90) = 3.000, p = 0.055, f = 0.311). These results support the reliability of the comprehension scores reported below.

Next, we assessed how well participants comprehended the forward stories. We performed non-parametric tests due to the non-normal distribution of the data (see “Methods” section). Results showed that the comprehension scores for each forward story were significantly above the chance level (Wilcoxon signed rank test; story 1 chance level = 2.5; story 2 chance level = 1.5; story 3 chance level = 1.5; p < 0.05; Supplementary Table 2) but did not significantly differ among the three stories (Kruskal-Wallis test, H(2) = 5.524, p = 0.063). Thus, the comprehension scores were summed across the three stories to represent overall performance (mean across participants = 10.548, S.D. = 0.850), which was also significantly higher than the chance level (chance level = 5.5; Wilcoxon signed rank test, T(31) = 496.000, p < 0.001). Although we observed marginal significance for story complexity and comprehension scores, we analyzed each story separately in subsequent analyses and trained encoding models with a leave-one-subject-out (LOSO) approach. Therefore, differences across stories were unlikely to influence the results.

The predictive representations of words and sentences

We employed a two-stage procedure to obtain predictive embeddings at both the word and sentence levels. At the first stage, we employed the Robustly Optimized Bidirectional Encoder Representations from Transformers (BERT) with Whole Word Masking (WWM-RoBERTa) to obtain the vector representations of language information51. A BERT-based model was selected because it is trained on both preceding and following contexts, enabling the model to generate more comprehensive and context-rich representations than causal models52. Specifically, WWM-RoBERTa is a variant of the BERT model, featuring a bigger architecture, a larger batch size, and an expanded training dataset52. It is trained on the prediction task of whole words rather than characters, and therefore shows greater generalizability and adaptability for Mandarin53. In practice, word representations were obtained by feeding each word individually into the WWM-RoBERTa model (without context). Sentence and context representations were acquired by feeding the entire texts into the WWM-RoBERTa model and then averaging embeddings across all words (see “Methods” section).

At the second stage, multiple ridge regression approach was used to model the predictive relationship of embeddings between prior linguistic context and upcoming linguistic units (Fig. 1a). We employed the multiple ridge regression to enable comparablility across the two levels of linguistic units (i.e., words and sentences). Note that the regression model is independent of the brain data, subserving solely the purpose of capturing the predictive relationship. This approach was conducted based on the vector representations obtained from the WWM-RoBERTa model34,35. The multiple ridge regression approach assumes that the predictive relationship becomes approximately linear based on semantic vectors extracted from the WWM-RoBERTa model, supported by the evidence that embeddings from large language models exhibit analogical relations (e.g., queen – woman ≈ king – man)54,55,56. Moreover, the ridge regression model effectively mitigates the overfitting problem. In practice, each dimension of the upcoming target vectors was predicted using a different ridge regression model, with parameters estimated from training data (80%) and validated on testing data (20%; see “Methods” section). Separate models were constructed for words and sentences.

Fig. 1: Schematic demonstration of the analytic approach.
Fig. 1: Schematic demonstration of the analytic approach.
Full size image

a Training and testing the multiple ridge regression models. The dataset (~0.2 million samples) was generated from Chinese Wikipedia by randomly selecting the prior linguistic context and upcoming linguistic units (word or sentence). The context and linguistic unit were transformed into fixed-length vectors via the WWM-RoBERTa model. Then, 80% of samples were used to train the multiple ridge regression model to capture the predictive relationship between the context and linguistic unit, and the remaining 20% were used for model evaluation. Word and sentence prediction models were trained separately. b Processing of the experiment materials. The story audios were transcribed, segmented, and aligned at both word (via “jieba” toolbox implemented in Python) and sentence level (via sentence boundary segmentation task), which were further used to generate the predictive representations using the ridge regression models. These representations were reduced to 50 dimensions, resampled, and convolved with the hemodynamic response function (HRF) for encoding model analyses. c Roadmap of the group-based general linear model (gGLM). BOLD signals were collected while participants listened to stories. The BOLD signals were then preprocessed and grouped into 400 parcels according to ref. 63. For each parcel, leave-one-subject-out (LOSO) cross-validation was employed to obtain the explained variance (R2) across participants.

To evaluate the performance of the ridge regression models, we first calculated cosine distances between the vectors predicted by the models and the actual target vectors (denoted as D1). D1 was compared with the cosine distances between the predicted vectors and the vectors randomly selected from the test set (denoted as D2). D2 served as a baseline, representing a scenario without a predictive relationship, as its distribution was centered around 1 (Fig. 2a, b, gray histograms). We randomly sampled 1000 instances from the testing set and found that D1 was significantly lower than D2 at both word and sentence levels (paired t-test; word level: t(999) = 19.18, p < 0.001, d = 0.876; sentence level: t(999) = 43.870, p < 0.001, d = 1.439; Fig. 2a, b). Additionally, we calculated and compared the Pearson correlation between predicted and real targets (r1) or randomly generated targets (r2) as validation. As expected, results showed significant differences between these two conditions at both levels (paired t-test after applying a Fisher-z transformation to the r values; word level: r1 = 0.078 ± 0.112; r2 = −0.001 ± 0.065; t(999) = −18.264, p < 0.001, d = 0.876; sentence level: r1 = 0.113 ± 0.112; r2 = 0.004 ± 0.065; t(999) = −43.812, p < 0.001, d = 1.439; Supplementary Fig. 1a).

Fig. 2: Performance of the representational prediction models.
Fig. 2: Performance of the representational prediction models.
Full size image

a, b Cosine distances between the predicted and actual target vectors for the word (green) and sentence (orange) prediction models, compared with the cosine distances between the predicted and randomly selected vectors (gray). c, d Classification accuracies of both word (73.234 ± 4.265%, green) and sentence predictions (81.471 ± 3.478%, orange) exceed those for random data (gray). e, f Both models exhibit an incremental context effect. The cosine distances were max–min normalized, and the vertical line in (e) indicates the average sentence length in words. Shade areas represent the standard error of the mean (SEM). g Model performance on the experimental materials.

Furthermore, a pairwise classification task was employed to compare D1 and D2 (Fig. 1a, right panel)57, where an instance was classified as correct if D1 was smaller than D2. Otherwise, it was classified as incorrect. We repeated the procedure 1000 times to ensure robustness. The resulting prediction accuracy was significantly above the chance level (i.e., 50%) for both word (73.234 ± 4.265%, p < 0.001; Fig. 2c) and sentence (81.471 ± 3.478%, p < 0.001; Fig. 2d) models. To validate these findings, we generated a randomized dataset by shuffling the pairwise correspondence between the prediction targets and the preceding linguistic context. Applying the same pairwise classification analysis to this randomized data yielded accuracy that did not significantly differ from the chance level (permutation test; word model: 50.132 ± 3.458%, p = 0.351; sentence model: 49.812 ± 4.719%, p = 0.503; Fig. 2c, d). Moreover, classification accuracy in the original dataset was significantly higher than that in the randomized dataset (two-sample t-test; word model: t(1998) = 128.800, p < 0.001, d = 5.760; sentence model: t(1998) = 170.788, p < 0.001, d = 7.638; Fig. 2c, d). Finally, the word- and sentence-level ridge regression models were evaluated on the narrative stimuli used in this study. The word-level model achieved a classification accuracy of 68.241% (S.D. = 0.873%), and the sentence-level model reached 83.731% (S.D. = 2.811%), both significantly above the chance level (chance level = 50%; permutation test; word model: p < 0.001; sentence model: p < 0.001; Fig. 2g).

Together, these results indicated that our models reliably captured the predictive relationship between prior context and upcoming linguistic units. Notably, the sentence model consistently outperformed the word model on both the corpus and the experimental materials. This advantage may stem from the BERT-derived sentence representations, which encode richer and more context-dependent information. Furthermore, computing sentence embeddings by averaging word vectors likely improves the signal-to-noise ratio. However, we suggest that the absolute accuracy of our models is not a direct indicator of prediction quality. Instead, the statistical significance offers a better measure of the model’s capability in capturing the predictive relationship.

Model prediction performance increases with context length

Previous evidence suggests that predictions of upcoming linguistic units are incrementally shaped by the preceding context. Accordingly, our models are expected to demonstrate improved performance as the length of the prior context increases, i.e., the incremental context effect10,58. To test this, we systematically varied the number of words or sentences in the prior context and assessed the impact of context length on model performance.

For the word-level model, the cosine distance between the predicted and actual vectors decreased as more preceding words were included (Fig. 2e). We identified the knee point using Kneed Python toolbox, which detects maximum curvature via a rotation-based algorithm59. The knee point corresponded closely to the sentence boundary (Fig. 2e), based on the sentence length derived from Chinese Wikipedia (across approximately 3.9 million sentences, median length was 15 words; Supplementary Fig. 1b). Similarly, for the sentence-level model, cosine distance decreased as more preceding sentences were provided, with a notable knee point observed when the number of prior sentences reached 4 (Fig. 2f). Together, these results support the capacity of our models in capturing the predictive relationship in natural language.

The neural underpinnings of multilevel prediction

We employed encoding models to identify the neural correlates associated with the word- and sentence-level predictions. This method has been widely recognized for its reliability and validity in producing robust results26,58,60,61. Specifically, we applied the gGLM to associate BOLD signals with the predicted vectors derived from the ridge regression models (Fig. 1b, c, see “Methods” section). The gGLM was performed separately for the word and sentence levels. We further employed the leave-one-subject-out (LOSO) cross-validation approach to avoid overfitting and reduce the non-independence error in the secondary test62. Additionally, to improve the computational efficiency, a template with 400 cortical parcels was used for the gGLM analysis63. Moreover, to test the concept of “neural pre-activation” in language prediction3, we related predicted vectors of linguistic units N to BOLD signals of units N-1. A series of potential confounding factors—including temporal delays of words and sentences, frequencies of words and sentences usage, prior linguistic context effect—were excluded (see “Methods” section). A paired t-test was performed between forward and backward conditions on the explained variance (R2) of the gGLM. The results were corrected for multiple comparisons using the false discovery rate (FDR) method, with a significance threshold of p < 0.0164.

At the word level, results showed that the predictive representations of words were associated with significant activations in the bilateral STG and the upper part of the middle temporal gyrus (MTG; Fig. 3a, c; Supplementary Fig. 3a; Supplementary Table 3). To validate this result, we conducted a permutation test on the significant regions of interest (ROIs) including the STG and MTG, where the word features were shuffled to remove the contextual predictive relationship. This procedure was repeated 1000 times to generate a null distribution. Results showed that the real value was significantly higher than the null distribution (p < 0.01, FDR corrected; Fig. 3b upper panel; Supplementary Fig. 4), confirming an association between word-level  predictive representations and activity in the bilateral STG and MTG.

Fig. 3: Brain responses associated with predictive representations.
Fig. 3: Brain responses associated with predictive representations.
Full size image

a Brain regions sensitive to predictive representations across different timescales. The brain map exhibits the R2 difference between forward and backward conditions, with only significant results plotted. b Results of the permutation test for significant ROIs at the word and sentence levels. In each panel, the x-axis represents the R2 of each permutation, which has been z-scored for display purposes. Gray histograms are the null distributions, and vertical lines indicate the positions of the real values, green for word level and orange for sentence level. c R2 differences between forward and backward conditions for each ROI, where each dot represents one subject. Significance levels are indicated as p < 0.001 (***), p < 0.01 (**), p < 0.05 (*), and p ≥ 0.05 (n.s.).

At the sentence level, results showed significant activation in the right TPJ, medial PFC (mPFC), and precuneus (Fig. 3a, c; Supplementary Fig. 3b; Supplementary Table 3). The same permutation test was performed, confirming significant activation in these brain regions (TPJ: p = 0.03; mPFC: p = 0.01; precuneus: p = 0.01; overall: p = 0.01; FDR corrected; Fig. 3b lower panel; Supplementary Fig. 4).

To differentiate the prediction effect from the context effect, we applied the gGLM to examine neural representations of past context at both word and sentence levels. We found that prior contextual information was broadly represented across frontal, temporal, and parietal regions (Supplementary Fig. 5a, b), consistent with previous findings on neural encoding of past linguistic context22,24,65. Further, we performed a variance partitioning (VP) analysis to isolate the unique contribution of preceding context from the prediction effect. We observed significant representations at both word and sentence levels in the bilateral STG, whereas word-level representations were more prominent in the prefrontal cortex (Supplementary Fig. 5c, d). Please note that the predictive and context features fed into the gGLM are not linearly related due to the nonlinear operations during feature extraction (i.e., the Isomap method; see “Methods” section).

Together, these findings suggested that the brain predicts upcoming words and sentences in a hierarchical manner during language comprehension. This neural pattern of prediction hierarchy differed from that of past context representations.

Examining the information updating mode of the prediction hierarchy

Next, we aimed to investigate how information is updated across the two levels in the prediction hierarchy. To test the sparse and continuous updating models, we employed a series of computational modeling approaches grounded in the PC framework42,43, which could characterize the dynamic interactions between neural regions associated with word- and sentence-level predictions.

Specifically, we implemented a two-level PC architecture, in which the word and sentence levels corresponded to the lower and higher levels, respectively (Fig. 4a). According to the PC framework, the higher level would generate a top-down prediction (\({Z}_{s}\)) that guides the lower level in updating its representation. Then, the higher-level prediction error (PE, \({x}_{s}\)) was calculated as the difference between the top-down predictions and the upcoming signals. Next, \({x}_{s}\) propagated back to the higher level to optimize the next top-down prediction. At the lower level, its PE (\({x}_{w}\)) was simulated as cosine distances between predicted and actual word vectors, providing a robust measure of dissimilarity that is less sensitive to vector magnitude compared to other metrics (see “Methods” section).

Fig. 4: Results of predictive coding (PC) neural modeling.
Fig. 4: Results of predictive coding (PC) neural modeling.
Full size image

a Two computational models were constructed based on the continuous and sparse updating hypothesis. Left panel: the continuous updating hypothesis assumes that the higher-level representations are updated continuously as inputs change over time. Right panel: the sparse updating hypothesis assumes that the higher-level only predicts and updates at its preferred timescales (i.e., sentence boundaries). b, c Simulated data generated by the PC models were converted into the putative BOLD signals via the hemodynamic model, and further compared with the real fMRI responses to evaluate model performance. d Model performance in the forward and backward conditions. e The sparse PC model outperforms the continuous PC model only in the forward condition. f Model performance without word-level prediction error. g Comparison of MSE values for sparse and continuous models by leveraging PE for all subjects across stories. h, i Examples of simulated signals from the PC models at the word (Subject 10, story 2) and sentence levels (Subject 04, story 2), shown alongside the corresponding real BOLD signals. Significant levels are indicated as p ≤ 0.001 (***), p ≤ 0.01 (**), p ≤ 0.05 (*), and p > 0.05 (n.s.).

In the continuous updating PC model, predictions and PEs were allowed to transmit between the lower and higher levels instantaneously (Fig. 4a, left). By contrast, in the sparse updating PC model, information transferred between levels was delayed by \(\Delta t\) (Fig. 4a, right), ensuring that predictions and PEs were exchanged only at sentence boundaries33,66,67,68. Neural activity was simulated using these PC models, and then converted into BOLD signals (Fig. 4b)69,70. In practice, neural signals at the word (\({Z}_{w}\)) and sentence (\({Z}_{s}\)) levels were calculated as the averaged BOLD signals within the corresponding significant ROIs (as shown in Fig. 3a; Supplementary Table 3). We used a gradient descent algorithm to estimate the model parameters, with performance quantified by the mean square error (MSE) between simulated and actual BOLD signals (Fig. 4c). Lower MSE values indicate better model performance (see “Methods” section).

First, we compared MSE values between the forward and backward conditions. Both PC models performed significantly better in the forward condition than in the backward condition (paired t-test; t(185) = −12.760, p < 0.001, d = 1.086; Fig. 4d). Moreover, the sparse PC model significantly outperformed the continuous PC model in the forward condition (paired t-test; t(92) = −17.438, p < 0.001, d = 1.110; Fig. 4e), while no significant difference was observed in the backward condition (paired t-test; t(92) = 0.990, p = 0.325, d = 0.137; Fig. 4e). These effects remained consistent across all individual stories (Supplementary Fig. 6a). Further, we trained a control sparse model that preserved temporal sparsity but removed the sentence boundary information by shuffling the delay variable (\(\Delta t\)). This control sparse model outperformed the continuous model (paired t-test; t(92) = −8.885, p < 0.001, d = 0.581; FDR corrected; Supplementary Fig. 6b), but underperformed relative to the original sparse model (paired t-test; t(92) = 6.728, p < 0.001, d = 0.552; FDR corrected; Supplementary Fig. 6b). These results suggested that both general temporal sparsity and specific sentence boundaries enhanced the performance of the sparse model.

In addition, previous studies have shown that lower-level PE (\({x}_{w}\)) plays an important role in linguistic processing10 and events delineation71. We replaced the lower-level PE (\({x}_{w}\)) with white-noise signals to test this account. The results supported this hypothesis for the sparse model (paired t-test; t(92) = −12.246, p < 0.001, d = 0.777; Fig. 4f, g), but not for the continuous model (paired t-test; t(92) = 0.787, p = 0.433, d = 0.060) under the forward condition (Fig. 4f, g).

Together, these findings support the sparse updating hypothesis, suggesting that sentence boundaries serve as key drivers of information flow within the word-to-sentence prediction hierarchy.

Sparse updating revealed from an autocorrelation analysis

In contrast to the continuous updating hypothesis, the sparse hypothesis posits that brain responses associated with sentence prediction remain stable until the sentence boundary is reached33. Therefore, we expect brain activity to exhibit a periodic pattern if the linguistic prediction hierarchy is sparsely updated. To this end, we examined autocorrelation in brain regions associated with word- and sentence-level predictions. Specifically, we temporally shifted the BOLD signals without pre-whitening over time lags from 1 TR to 50 TRs. For each lag, we computed the correlation between the shifted and original signals before comparing autocorrelations between the forward and backward conditions.

Our results revealed significantly stronger autocorrelation in the forward condition than in the backward condition for brain regions associated with sentence prediction, at time lags of 8-11 TRs (p < 0.01, Bonferroni corrected; Fig. 5c, d; Supplementary Fig. 7). This range corresponds to approximately twice the sentence length (median: 4 TRs, Supplementary Fig. 1c). These findings support our prediction, as updating information at sentence boundaries requires the brain to simultaneously maintain pre- and post-boundary sentence information, potentially leading to a periodic pattern repeating every two sentences. In comparison, this effect was absent in brain regions associated with word-level prediction (Fig. 5a, b). These findings provide additional support for the sparse updating hypothesis.

Fig. 5: Results of the autocorrelation analysis.
Fig. 5: Results of the autocorrelation analysis.
Full size image

a, c Autocorrelation results at the word and sentence levels, respectively. Colored lines indicate the forward condition and gray lines represent the backward condition. b, d Difference in autocorrelation between forward and backward conditions. The vertical dashed line represents twice the sentence length (8 TRs). Shade areas represent the standard error of the mean (SEM). Multiple comparisons were corrected using the Bonferroni method with a significance threshold of p < 0.01.

Discussion

We characterized hierarchical linguistic prediction at both the word and sentence levels and examined how these two levels interact during narrative comprehension. We observed that the predictive representations of upcoming words are associated with brain responses in the STG and MTG, while those of sentences are engaged in the TPJ, mPFC, and precuneus. In addition, our computational modeling results supported the sparse updating strategy, rather than the continuous strategy, for cross-level interaction within the prediction hierarchy. These results highlight the brain’s capacity to anticipate future information over both shorter and longer timescales, suggesting that sentence boundaries may serve as potential markers for updating semantic information during naturalistic language comprehension.

Our findings of linguistic prediction at the word and sentence levels are reminiscent of the research on the temporal receptive window (TRW), which examines how past context at multiple timescales influences processing of ongoing inputs22,23,24. These studies have proposed a temporal representational hierarchy of context in the cerebral cortex, ranging from the early sensory regions responding to shorter timescales (i.e., small TRW) to higher-level brain regions responding to longer timescales (i.e., large TRW). These findings suggest a retrospective timescale focusing on the prior context, which is closely related to cortical tracking of the linguistic units27,72,73,74. In contrast, we focus on the brain’s ability to anticipate future inputs across varying timescales, i.e., a prospective timescale of the future input. To our knowledge, this line of research is still understudied, with only a few studies beginning to explore it recently25,26. Consequently, it remains unclear how the prospective timescale hierarchy can be interpreted in a neurolinguistic sense. Inspired by these studies, the present study aims to address this gap and further investigate how different levels within the hierarchy interact computationally and algorithmically. We believe this prospective hierarchy deserves greater attention in future research.

Another contribution of our work is the incremental context effect observed in the multiple ridge regression models at both word and sentence levels (Fig. 2e, f), supporting the biological plausibility of our approach. While previous studies have reported similar effect in GPT-210 and BERT models58 by manipulating context window size, these findings are largely limited to the word level. Here, we extend these results to the sentence level and demonstrate a comparable incremental pattern. Importantly, our approach offers a potential avenue for investigating how retrospective and prospective timescale hierarchies relate in both LLMs and the human brain.

Several studies on word prediction, however, have reported findings that differ from ours. For instance, some have identified associations between word prediction and widespread regions in the frontal and parietal lobes, in addition to the bilateral STG11,75,76,77. Although other studies investigating word-level syntactic prediction (i.e., how grammatical structure within a sentence influences next-word prediction) have also emphasized the roles of the STG and MTG1,12,14,15,78, some divergent results also indicate additional involvement of the lateral prefrontal cortex1,14. One possibility for this discrepancy is that these studies tapped into word prediction with lexical processing difficulties such as cloze probability79 or entropy80 rather than the predictive representation itself. We postulated that the involvement of the frontal cortex may reflect processing difficulty and the associated cognitive control functions81,82. However, Goldstein et al.10 employed the encoding model and found that the IFG was also significantly involved in word prediction10. Although the authors ruled out the potential context effect, their control analysis was conducted for the averaged signal across all significant electrodes, including both IFG and STG electrodes, leading to difficulties in disentangling the potential differences between IFG and STG. The present study investigated the neural underpinnings of linguistic prediction per se rather than processing difficulty, while controlling for potential confounding effects arising from the past linguistic context. Therefore, our results provide more direct evidence for the anatomical architecture supporting hierarchical linguistic prediction.

Furthermore, the DMN (especially the mPFC, precuneus, and TPJ) has been proposed to play a key role in processing naturalistic stimuli47, such as written or spoken stories22,48 and movies83. To investigate its functions, on the one hand, previous studies scrambled the real-life stories at different timescales (e.g., word, sentence, paragraph, etc.)22,32 or shuffled parts of the stories to create different versions47,84. By comparing neural responses across different versions, researchers found that the DMN is largely involved in integrating external information over relatively long prior context (ranging from seconds to minutes)23,47,84. On the other hand, another possible cognitive process associated with the DMN during narrative processing is using stored information to simulate possible future events and plan ahead (i.e., the prospective memory)46,85. For example, evidence indicates that imagining a plausible event that had not occurred previously engages DMN regions such as the mPFC and precuneus46,86. Interestingly, this “future envisioning” network largely overlaps with regions involved in episodic memory, supporting the constructive episodic simulation hypothesis46. These findings suggest that a key function of the DMN is to enable simulation of future events based on past experiences, a perspective closely aligned with the concept of pre-activation in linguistic prediction85,87. While anticipatory signals in DMN regions have been extensively observed25,45, the timescales underlying prospective prediction remain unclear. In the current study, we identified involvement of the TPJ and DMN midline core areas in sentence prediction, providing further evidence for the DMN’s role in predicting linguistic units over longer timescales. Additionally, our study also revealed strong right-hemisphere lateralization for sentence prediction. Although recent studies have challenged the traditional view that natural language comprehension is left-laterized, showing instead bilateral involvement88,89,90, the specific function of the right hemisphere is still poorly understood. Our results suggest that the right DMN plays a dominant role in sentence prediction, consistent with recent evidence highlighting the importance of the right hemisphere in perceptual segmentation and coarse-grained event boundaries in music91. Collectively, these findings support the notion that the right hemisphere may have a distinct role in processing longer timescale information.

Our computational modeling results support the sparse, rather than continuous, updating strategy for cross-level interactions within the prediction hierarchy. Previous research supporting the continuous updating hypothesis typically relied on correlation-based approaches, such as inter-subject pattern correlation (ISPC)32 or cross-context correlation31. ISPC examines spatial similarities in brain responses across subjects at each moment, while cross-context correlation calculates neural similarities across trials for each time point. These approaches, however, may conflate the effects of information updating and accumulation, limiting their ability to disentangle the two. Although Chien and Honey32 employed computational models to study multilevel interactions, their models were constructed solely under the continuous updating assumption, leaving an open question of how the two rival hypotheses compare32. Most importantly, no studies have tested the two hypotheses at the sentence level. In the present study, we uncovered the hidden neural states (i.e., information updating at the sentence level) and directly compared the continuous and sparse updating models. Our results underscore the sparse updating hypothesis, consistent with previous evidence that sparse updating is more computationally efficient and resource-saving than continuous updating50. These findings further support the brain’s economy principle; that is, the human brain is organized to carefully manage the inputs in the service of delivering robust and efficient performance92,93.

In addition, although emerging models have incorporated the PC framework to study language comprehension, the explicit computational mechanisms underlying multilevel interactions within a timescale hierarchy remain limited. For instance, Eddine et al.94 provided an elegant PC account of the N400, modeling lexical-semantic integration across four layers (orthographic, lexical, semantic, and conceptual)94. While their model successfully captures sentence-level context effects on N400 amplitude, it does not specifically address the communication between word and sentence levels. A potentially more biologically-grounded account was proposed by Bornkessel-Schlesewsky et al.29, which posits a predictive sequence processing framework situated in the postero-dorsal auditory stream29. However, this model also lacks detailed mechanisms describing how different linguistic levels interact computationally or algorithmically.

Our work builds on these theoretical frameworks and proposes a possible mechanism for the implementation. Our results are consistent with a semantic-based framework of discourse comprehension. In this framework, Baggio28 proposed that the brain instantiates a single conscious representation of the input (e.g., a word) that remains stable unless perturbed by new information28. To implement this representational stability during discourse comprehension, the author further posited a cortical steady-state organization which could be achieved sparsely at four intermediate levels: 1) individual word; 2) content word; 3) referring expression; and 4) utterance or proposition. Our sparse updating model conceptually aligns with this cortical steady-state account and provides a possible algorithmic implementation within the PC framework. Mathematically, we formulated a first-order linear ordinary differential equation (ODE) in which the sentence-level neural signal can be maintained at the steady-state by leveraging a delay term \(\varDelta t\) that fixes the “input” within a sentence. This formulation allows the model to generate signals whose sparsity is bounded by sentence boundaries.

However, recent eye-tracking and electroencephalogram studies provide evidence for the incremental nature of language processing79,95,96, which seemingly contradicts the sparse updating strategy97. Incrementality generally refers to the process by which linguistic information underlying the message-level representation accumulates gradually as context builds. Empirical support for incremental comprehension includes the modulation of the N400 amplitude of different word positions in a sentence97, or the slow drift of neural signals during continuous sentence processing98. The converging evidence suggests that incremental language processing involves an ongoing construction of meaning with each incoming word. Within this scope, we propose that this incremental semantic construction is not incompatible with our sparse updating model. First, in our model, the “word-level” does not refer solely to the brain regions encoding lexical information, but rather to the regions integrating the current sentential context to make predictions about the upcoming words. Second, sparse updating in our model is restricted to the interactions between the word and sentence levels exclusively, rather than imposing sparsity at the word level. Therefore, word-level processing can still operate in an incremental and predictive manner. Input enters the word level and lexical information is allowed to be accrued instantaneously within a sentence. Then, the accumulated sentence context would be further used to compute the sentence-level prediction error, which is transferred to the sentence level at sentence boundaries for updating. In fact, as Ryskin & Nieuwland99 stated, the incremental effect during sentence comprehension is also inherently aligned with the PC framework, assuming that the brain needs to employ the inputs to infer the internal model99. The realization of internal model inference largely relies upon the local prediction error (PE) at each level94, which plays an integral role in the optimization algorithm that the brain uses to approximate inference. This account aligns with our models, as the word-level integral can be viewed as the accumulation of the sentence context and thus show an incremental effect.

While the incremental effect has been extensively documented in sentence processing, it remains unclear how such an effect manifests during naturalistic language comprehension. Intuitively, for instance, when listening to a two-hour audiobook, it is unlikely that neural activity would continuously increase from beginning to end. No study, to the best of our knowledge, has demonstrated such a pattern. One possibility is that the brain engages in event-based processing during naturalistic comprehension–an idea supported by event segmentation theory100. According to this framework, the brain parses continuous input into discrete events at multiple timescales (distinct from the “event” in ERP), which are processed separately and then integrated hierarchically at event boundaries via memory systems33,100. In other words, the incremental effect may occur within a single level but not across levels of linguistic information. This perspective aligns with our finding that sentence boundaries (sentence viewed as “event” in this sense) serve as important anchors for narrative processing, supporting the sparse updating hypothesis.

Our results also underscore the importance of sentence boundaries during narrative comprehension, consistent with recent evidence of additional processing at sentence-final positions. This “sentence wrap-up” effect may reflect either the reconstruction of grammatical structure within a sentence (syntactic effect), or the resolution of meaning inconsistencies that cannot be addressed in a sentence (semantic effect)101. We propose that, in the present study, sentence boundaries serve as semantic markers for message-level updating during naturalistic language comprehension for two primary reasons. First, we recruited multiple raters to delineate boundaries between sentences. This empirical method produces more semantically-driven segmentation. Second, the sentence embeddings we used were obtained by averaging word embeddings, an approach that emphasizes semantic content over syntactic structure. Consequently, the sentence boundary effect observed in our sparse model likely reflects the reconciliation of semantic inconsistencies, potentially corresponding to the accumulated prediction errors generated at the word level. In this view, a sentence can be considered as functionally analogous to a narrative chunk, serving as a semantic segment within the broader narrative structure. Our findings also complement prior event segmentation research by highlighting the neural signatures of updating at sentence (or narrative chunk) boundaries, in line with models of hierarchical event processing100.

However, it is important to note that the current study does not examine the multiple timescales within sentences. Many timescales could be defined for the language system, for example, the semantics-based temporal hierarchy (e.g., ranging from individual words and content words to referring expressions and entire utterances)28, or the syntactic hierarchy extending from words to noun/verb phrases, and further to sentences27. Therefore, we believe that investigating the properties of these intra-sentence timescales from a neurolinguistic perspective will be a valuable addition to the present findings.

This study has several limitations. First, we could not assess real-time attentional states of participants, as such measurements would disrupt continuous speech processing and linguistic predictions. Second, the relatively low temporal resolution of fMRI limits precise characterization of linguistic prediction at finer timescales (e.g., phonemes). Third, our averaging-based approach to sentence or context representations may overlook critical structural or sequential features essential for sentence-level processing. Future work could benefit from some advanced models (e.g., Sentence-BERT102) that better capture longer-range dependencies in text.

In conclusion, by directly examining the multiscale prediction hierarchy in the brain, we demonstrated a cortical architecture spanning from the temporal cortices involved in word prediction to the DMN regions engaged in sentence prediction. Most significantly, our results highlight the role of sparse updating in facilitating cross-level interactions within this prediction hierarchy. Together, these findings advance the understanding of the cortical organization underlying hierarchical linguistic prediction and the neurocomputational mechanisms of information updating during narrative comprehension.

Methods

Participants

Before the formal experiment, the sample size was estimated based on a pilot study with four participants listening to the story stimuli103. Using the Neuropower104, we assessed whether STG voxels exhibited higher BOLD responses during forward narratives compared to backward speech (i.e., linguistic effect in the auditory cortex). A sample size of twenty-eight participants was recommended to achieve a statistical power greater than 0.8. The pilot data were not included in the formal analysis.

Thirty-eight healthy native Chinese speakers participated in the main study. All participants were right-handed105 and self-reported no hearing, psychiatric, or neurological problems. Six participants were excluded due to excessive head motion (greater than 3 mm or 3 degrees) and one was excluded for falling asleep during the task, leaving thirty-one participants with valid data (mean age: 23 years, ranging from 19 to 26; 19 females).

The study protocol was approved by the Institutional Review Board of the State Key Laboratory of Cognitive Neuroscience and Learning at Beijing Normal University. Written informed consent was obtained from all participants. All ethical regulations relevant to human research participants were followed.

Stimuli

In narrative listening studies, it is common to include multiple runs to increase the reliability of statistical tests, reduce the fatigue state of participants, and minimize the impact of technical issues (for example, scanner overheating)11,60,83. Therefore, three stories were employed in the present study. Story 1 and 2 were produced by asking two female speakers to freely recount “an unforgettable experience in your college life”, while story 3 was recorded by a female speaker reading a text adapted from The Kite Runner. All stories were recorded using the FOMRI III system (Optoacoustics Ltd.) and subsequently denoised using Audacity106. These stories were matched for perceptual features such as clarity, familiarity, and complexity (see “Task and procedures”; Supplementary Table 2). Additionally, each audio was temporally inverted for the backward condition to control for acoustic features.

Task and procedures

Before the experiment, sound volume was adjusted to a comfortable level based on participants’ subjective reports. During the experiment, participants were instructed to passively listen to the three stories (i.e., forward condition; Supplementary Table 1) and the corresponding control audios (i.e., the temporally inverted audio, backward condition) while fMRI data were collected. Participants were asked to fixate on a cross at the center of the black screen during listening. The sequence of the six audios (three forward and three backward) was counterbalanced across participants, with flexible intervals inserted between runs to allow rest. All audios were preceded by a 10-s silence with a black screen to control for T1 equilibration effects11,25. Audios were played via the OptoACTIVE headset, which actively eliminates MRI scanner noise in real time and has been widely used in previous auditory studies107,108,109. E-prime (v2.0.10) was used to control stimulus presentation.

The participants were tested on both perception and comprehension at the end of each story. For perceptual evaluation, participants rated clarity, familiarity, and complexity on a 5-point Likert scale (1 was the lowest and 5 was the highest). For comprehension, participants answered several true-or-false questions based on the story contents (3 questions for story 1 and 2; 5 questions for story 3). These questions targeted either details (mentioned only once) or gist-level information (mentioned multiple times)110, with both types included for each story. Statistical analyses were performed on both perceptual ratings and comprehension scores to assess how well participants perceived and comprehended the stories.

Additionally, to validate these questions, we recruited an independent cohort of 21 participants who were not part of the main experiment and were unaware of the experimental purpose. They were asked to rate “How well do you think these questions could reflect the listener’s comprehension of the story?” on a 7-point Likert scale (1 was strongly disagree, and 7 was strongly agree). A one-sample t-test was performed on the scores against the scale midpoint (i.e., 3.5), and the FDR method was applied to correct for multiple comparisons64. Results showed that scores for all three stories were significantly above the midpoint (story 1: 5.71 ± 0.78, t(20) = 10.02, p < 0.05; story 2: 5.43 ± 1.08, t(20) = 6.09, p < 0.05; story 3: 5.81 ± 0.80, t(20) = 9.60, p < 0.05), indicating that these questions reliably reflected story comprehension.

Statistics and reproducibility

To assess the robustness of our findings, the folllowing procedures were applied to both behavioral and neural data. First, the D’Agostino test was used to evaluate the normality of the data distribution. If data followed a normal distribution, parametric tests were used (e.g., paired t-test); otherwise, nonparametric tests were used (e.g., Wilcoxon signed-rank test). All statistical tests without further explanation were two-tailed with a threshold of p < 0.05. The false discovery rate (FDR) correction was applied when multiple comparisons were conducted unless stated otherwise.

Data acquisition and preprocessing

The fMRI data were acquired with a Siemens TRIO 3-Tesla scanner at the Imaging Center for Brain Research, Beijing Normal University. The functional images were acquired using an echo planar imaging (EPI) sequence (TR = 2000 ms, TE = 30 ms, flip angle = 90°, FOV = 200 mm, voxel size = 3.1 × 3.1 × 3.5 mm3, interleaved). The structural T1-weighted images were collected using magnetization-prepared rapid gradient-echo sequence (TR = 2530 ms, TE = 3.39 ms, flip angle = 7°, FOV = 256 mm, 144 sagittal slices, voxel size = 1.3 × 1.0 × 1.3 mm).

The DPABI toolbox was used for data preprocessing111. After removing the first 5 volumes corresponding to the silent period (10 s), the images were slice-timing corrected, spatially realigned to the first image in a run using rigid-body registration, and co-registered to their corresponding anatomical images. Next, both functional and anatomical images were normalized to the standard Montreal Neurological Institute (MNI) space, with functional images resampled to 2 × 2 × 2 mm3 voxel size. Then, the data were spatially smoothed with a 6 mm full-width at half maximum (FWHM) Gaussian kernel. Finally, all data were detrended, temporally high-pass filtered (128 s cutoff), and denoised by regressing out nuisance variables (including Friston’s 24 motion parameters and five principal components of the white matter and cerebrospinal fluid signals)112.

Obtaining the predictive representations of words and sentences

Dataset generation

Chinese Wikipedia, derived from the Large Scale Chinese Corpus for NLP project (https://github.com/brightmart/nlp_chinese_corpus), was used as the corpus. During preprocessing, symbols and tokens unrelated to content were first removed. Then, the corpus was segmented into words using the jieba toolbox (https://github.com/fxsjy/jieba) and parsed into sentences based on end-of-sentence punctuations (i.e., period, question mark, exclamation mark, and ellipsis). Further, a document was randomly sampled from the corpus, where a linguistic unit (a word or a sentence) was randomly selected as the to-be-predicted target. All preceding text were treated as the prior linguistic context. Following this procedure, we constructed two datasets—one for words and another for sentences—each containing approximately 0.2 million items, with each item comprising a target and its corresponding linguistic context.

We did not remove any functional words. Intuitively, removing functional words can reduce non-informative content, allowing NLP algorithms to focus more on content words. However, this approach may overlook the fact that functional words, such as the negation words like “not”, “nor”, and “never”, also carry crucial semantic content and syntactic information for understanding the natural language. In fact, there is an ongoing debate on whether functional words should be removed when applying BERT-based models. The original BERT model, for example, did not suggest removing any stop words52. Moreover, Qiao et al.113 found that removing functional words does not affect BERT model performance113. Alzahrani and Jololian114 showed that removing functional words can even impair the model performance in a gender classification task114, reducing accuracy from 86.67% to 78.86%. Therefore, we followed the original BERT practice and included both functional and content words to preserve semantic and syntactic information in the vector representations.

Vector representations

WWM-RoBERTa, a variant of the BERT model52, was applied to vectorize the prior linguistic context and prediction target51. BERT is a pre-trained language representation model designed with a multi-layer bidirectional transformer encoder conditioned on both left and right context52. Its core mechanism is multi-head self-attention, which is fundamentally the weighted sum of all the input vectors115. The WWM-RoBERTa model has a larger architecture (the number of layers = 24, the hidden size = 1024, the number of self-attention heads = 16, total parameters = 340 M) and is trained with a larger batch size. Importantly, it is trained to predict whole words rather than characters, providing high generalizability and adaptability for Mandarin53.

The WWM-RoBERTa model was implemented using Python (v3.7) with bert-as-service module (https://github.com/jina-ai/clip-as-service), which maps variable-length text to a fixed-length vector (1024 dimensions). Here, a “sentence” refers to a text span from the corpus, which may extend beyond a single grammatical sentence52. To obtain a comprehensive text embeddings, the bert-as-service module averaged the vector from the penultimate hidden layer across all tokens in the input text, as the final layer representations are sensitive to the model training tasks (i.e., masked language model and the next sentence prediction). Alternatively, the text vector can also be derived from the [CLS] token. [CLS] is a special symbol added to the beginning of sentence inputs, which is frequently used to represent the overall information of the inputs52. However, previous studies have indicated that the [CLS] embedding is less effective than the averaging approach116,117,118. Therefore, the default bert-as-service setting (i.e., the averaging approach) was used in the present study. Specifically, for word units, we obtained vector representation with the target word as the only input (without context). For sentences and context, we computed the average embedding across all words in the text.

In addition, due to the quadratic relationship between text length and computational cost52, the WWM-RoBERTa model is constained to a maximum input length of 512 characters. In practice, two additional tokens were inserted at the beginning ([CLS]) and the end ([SEP]) of the input, reducing the effective length to 510. Therefore, we applied a “split-and-average” method to circumvent this input length restriction: the text was divided into equal segments (e.g., 2 segments if the length was between 511 and 1020 characters) and their embeddings were averaged to produce the final representation. This method generalizes to texts up to 4080 characters (8 segments) and can be considered an extension of the averaging approach. Consequently, each dataset item was represented by a 1024-dimensional vector for the prior linguistic context and a 1024-dimensional vector for the prediction target (word or sentence).

To validate the split-and-average method, we randomly selected 1000 documents with text lengths ranging from 50 to 510 characters. First, the texts were converted into vectors using the WWM-RoBERTa model to obtain the Whole Text Vector (WTV). These texts were also split into N segments (\(N\in \{2,\,3,\,4,\ldots ,8\}\)), converted into vectors, and averaged to index the Segment Text Vector (STV). Cosine distances between the WTV and STV were calculated as Dorig. Next, WTVs and STVs were randomly paired 1000 times, where cosine distance was computed for each permutation to generate a null distribution. Results showed that Dorig was significantly larger than the null distribution in all segment conditions (all conditions p < 0.001, FDR corrected), supporting the validity of our method for deriving context embeddings.

Model building

A multiple ridge regression approach was used to delineate the predictive relationship between prior context and upcoming linguistic inputs. The model consisted of 1024 independent ridge regressions, with each dimension of the upcoming input vectors being predicted from the corresponding linguistic context embeddings. This method potentially decorrelates the feature space, aligning with recent findings that an embedding whitening procedure could enhance model performance117. Mathematically, for each ridge regression model, given n samples of one dimension in the upcoming input vector Y (n × 1) and all dimensions in the context matrix X (n × 1024), we expected to estimate coefficients \({{\boldsymbol{\beta }}}\) (1024 × 1) and \({{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}\) by minimizing the following cost function:

$${{||}{{\boldsymbol{X}}}{{\boldsymbol{\beta }}}+{{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}-{{\boldsymbol{Y}}}{||}}_{2}^{2}+\lambda {{||}{{\boldsymbol{\beta }}}{||}}_{2}^{2}$$

where λ is the regularization term for preventing model overfitting by reducing coefficients \({{\boldsymbol{\beta }}}\). To estimate the parameters (\({{\boldsymbol{\beta }}}\) and \({{{\boldsymbol{\beta }}}}_{{{\bf{0}}}}\)), the dataset (see “Dataset generation”) was split into training (80%) and test sets (20%; Fig. 1a). An optimal λ was evaluated using 4-fold cross-validation within the training set. The input vector Y and the context matrix X were normalized by column (i.e., across training samples) in advance. The model training and testing processes were implemented with the sklearn toolbox119.

Model validation

A pairwise classification task was used to evaluate model performance57. First, 1000 samples were randomly selected from the test set, each containing a vector Vreal-target for the prediction target and a vector Vreal-context for the prior linguistic context. Next, a predicted vector Vpred-target was generated using the trained models based on Vreal-context. The cosine distance between Vpred-target and Vreal-target was calculated as D1, and the distance between Vpred-target and a randomly selected Vrand-target was calculated as D2. If D1 > D2, the sample was labeled “right”, and “wrong” otherwise. Accuracy was then computed across all samples, which was repeated 1000 times for robustness. We also computed the Pearson correlation between the actual and predicted targets to supplement the classification results.

Obtaining the predictive representations of the experimental stimuli

The texts of the three stories were segmented at both word and sentence levels (Fig. 1b). Word segmentation was performed using the jieba toolbox in Python. A sentence is typically defined as a string of words expressing a complete thought, containing at least a subject and a predicate. However, in spoken language, subjects may be omitted for simplicity, speech errors may occur, and oral language can diverge from formal grammar. Thus, to partition sentences appropriately, we recruited 10 raters to mark the text wherever they thought the end of a sentence should be (sentence boundaries annotation task). A sentence boundary was established if at least 5 raters marked the same location33. A trained experimenter then reviewed and refined the marked positions. Praat was further utilized to align the segmented text (words and sentences) to the audio recordings120. Finally, the processed experimental materials were converted into vector representations using the same procedures described above.

Relating BOLD signals with the predictive representations using gGLM analysis

The gGLM analysis was conducted to identify the neural underpinnings of the predictive representations at the word and sentence levels. Specifically, we implemented a LOSO cross-validation procedure, where the model performance for each participant was evaluated using the data from all other participants. This approach could effectively avoid overfitting and suppress the non-independence error62. Additionally, to reduce the risk of overfitting due to the high dimensionality of the vector representations (1024 dimensions), we applied a feature reduction procedure combining Isomap and PCA121,122. Prior studies have shown that concatenating Isomap and PCA components can achieve performance comparable to the full feature space121. Therefore, to meet the minimum criteria (i.e., PCA cumulative variances explained >=50% and Isomap residual variance at its minimum), we retained 15 Isomap components and 35 PCA components (Supplementary Table 4, 5).

Furthermore, the design matrices were generated using the function “make_first_level_design_matrix” with the python toolbox nilearn123. Specifically, the following steps were conducted in the function (Supplementary Fig. 2): (1) Oversampling. Based on the timing information of language units in the stories (e.g., the offsets of words or sentences), a time course was generated and oversampled with a sampling rate of 50 Hz; (2) HRF convolution. The oversampled time course was convolved with HRF; (3) Downsampling. The convolved time course was further downsampled to 0.5 Hz, corresponding to the sampling rate of fMRI (i.e., TR = 2 s). The downsampled time course was used to fit the fMRI signals using gGLM.

The variance partitioning (VP) approach was employed to identify the prediction effects. Specifically, a model including only the context representations (MC) was used to estimate the context effect (Supplementary Table 4, 5). Then, a full model including both the context and predictive representations (MF) was used to obtain both effects. The difference in explained variance (R2) between two models (i.e., MF − MC) was the unique effect of predictive representations. The unique effect of the prior context could be estimated using a similar procedure (Supplementary Fig. 5), by training a model including only the predictive representations (MP) and subtracting its performance from the full model (MF). In addition, following the concept of pre-activation3, predictive representations of linguistic units N (i.e., word/sentence) were aligned to the offset of linguistic units N − 1 (Supplementary Fig. 2). For each model, participants and stories were dummy-coded and included as covariates to control for individual- and story-level differences. Regressors modeling the word and sentence boundaries were included to account for the temporal delay in BOLD signals with respect to the stimuli. The log-transformed word or sentence frequencies (i.e., the average frequency of all words in a sentence) were also regressed out from the corresponding models to control for the statistical influence of everyday language usage11. Word frequencies were obtained from Cai and Brysbaert124, derived from film subtitles that approximate everyday language exposure124. Frequencies were log-transformed due to their inherently skewed distribution.

The gGLM analysis was conducted using a parcellation approach with 400 non-overlapping parcels63. BOLD signals within each parcel were pre-whitened using an AR(1) noise model implemented in nistats125 and then averaged. A paired t-test was performed between the forward and backward conditions to identify significant parcels. Multiple comparisons were controlled using a FDR threshold of q < 0.0164. Significant parcels were visualized by projecting onto a cortical surface using Brainnet Viewer126.

To validate the neural underpinnings of predictive representations, a permutation test was performed. For each iteration, the features associated with each linguistic unit were shuffled 1000 times to remove the prediction effect. Then, the same pipeline described above was repeated to generate a null distribution of R2. Finally, p-values were obtained based on the position of the original R2 value in this null distribution.

PC-based computational modeling

To directly test the sparse and continuous updating hypotheses, we constructed two computational models satisfying the minimal assumptions of the PC framework43,127. The PC framework posits that the brain processes upcoming information hierarchically, with level N generating prediction signals for level N-1. Prediction errors (PEs), defined as the difference between the predicted and actual neural responses at level N-1, are sent back to level N to update subsequent predictions42. In our models, the architecture corresponding to the continuous updating hypothesis is formalized using the differential equations below (the continuous updating PC models):

$$\frac{{{dZ}}_{w}\left(t\right)}{{dt}}={w}_{0}\cdot {x}_{w}\left(t+{dt}\right)+{w}_{1}\cdot \left({Z}_{s}\left(t\right)-{Z}_{w}\left(t\right)\right)$$
$${x}_{s}\left(t+{dt}\right)={Z}_{s}\left(t\right)-{Z}_{w}\left(t+{dt}\right)$$
$$\frac{{{dZ}}_{s}\left(t\right)}{{dt}}={s}_{0}\cdot {x}_{s}\left(t+{dt}\right)+{s}_{1}\cdot \left({Prior}\left(t\right)-{Z}_{s}\left(t\right)\right)$$

where \({dt}\) was set to the repetition time (TR = 2 s) during model estimation. The word-level PE, \({x}_{w}\left(t\right)\), was defined as the min-max normalized cosine distances between the predicted and actual word vectors, and resampled to the fMRI acquisition rate (TR = 2 s). Cosine distance was used because it provides a robust measure of dissimilarity and is less sensitive to vector magnitude than alternative metrics128,129,130. Min-max normalization ensured that PEs were constrained to positive values within the range [0–1]. \({Z}_{w}\) and \({Z}_{s}\) denote the neural signals associated with word- and sentence-level predictions, computed as the average signal across the parcels identified in the gGLM analyses (i.e., \({Z}_{w}\) is the average of bilateral STC and MTC; \({Z}_{s}\) is the average of the right TPJ, medial PFC and precuneus). The word-level PE, \({x}_{s}\left(t\right)\), was modeled as the difference between \({Z}_{s}\left(t\right)\) and the upcoming \({Z}_{w}\left(t\right)\). \({Prior}\left(t\right)\) represents the higher-level top-down input to the sentence level \({Z}_{s}\left(t\right).\,\)Following the previous research, it was set to 0, under the assumption that top-down priors exert minimal influence on the information updating strategy at the current levels43. In addition, \({w}_{0}\), \({w}_{1}\), \({s}_{0}\), and \({s}_{1}\) are parameters to be estimated. Mathematically, \({w}_{0}\) and \({s}_{0}\) determine the strength of how word- and sentence-level PEs affect neural signals, while \({w}_{1}\) and \({s}_{1}\) primarily regulate the decay rates of \({Z}_{w}\left(t\right)\) and \({Z}_{s}\left(t\right)\), respectively.

In contrast, the sparse updating hypothesis suggests a discretized information exchange between adjacent levels. The corresponding model can be described by the following differential equations with delays (the sparse updating PC models):

$$\frac{{{dZ}}_{w}\left(t\right)}{{dt}}={w}_{0}\cdot {x}_{w}\left(t+{dt}\right)+{w}_{1}\cdot \left({Z}_{s}\left(t-\varDelta t\right)-{Z}_{w}\left(t\right)\right)$$
$${x}_{s}\left(t+{dt}\right)={Z}_{s}\left(t-\varDelta t\right)-{Z}_{w}\left(t-\varDelta t+{dt}\right)$$
$$\frac{{{dZ}}_{s}\left(t\right)}{{dt}}={s}_{0}\cdot {x}_{s}\left(t+{dt}\right)+{s}_{1}\cdot \left({Prior}\left(t\right)-{Z}_{s}\left(t\right)\right)$$

where \(\varDelta t\) is a variable quantifying the time lag between the current moment and the boundary of the preceding sentence. All other variables and parameters are identical to those in the continuous updating PC model.

Neural signals simulated by the PC models were subsequently transformed into BOLD responses to compare with actual BOLD signals using a hemodynamic model. The hemodynamic model comprises both the Balloon model and the BOLD model69,70,131. Specifically, the Balloon model describes how neural activity induces changes in blood volume and deoxy-hemoglobin (dHb), and is formulated as follows:

$$\frac{{ds}\left(t\right)}{{dt}}=Z\left(t\right)-\kappa \cdot s\left(t\right)-\gamma \left(f\left(t\right)-1\right)$$
$$\frac{{df}\left(t\right)}{{dt}}=s\left(t\right)$$
$$\tau \frac{{dv}\left(t\right)}{{dt}}=f\left(t\right)-{v\left(t\right)}^{\frac{1}{\alpha }}$$
$$\tau \frac{{dq}\left(t\right)}{{dt}}=f\left(t\right){{\rm{\cdot }}}\frac{{\left(1-{E}_{0}\right)}^{\frac{1}{f\left(t\right)}}}{{E}_{0}}-{v\left(t\right)}^{\frac{1}{\alpha }}{{\rm{\cdot }}}\frac{q\left(t\right)}{v\left(t\right)}$$

where Z(t) is the neural responses derived from the PC models; \(s\left(t\right)\) represents vasodilatory signal; \(f\left(t\right)\) is the blood flow; \(v\left(t\right)\) corresponds to the local change in the blood volume; \(q\left(t\right)\) indicates the proportion of dHb.

Further, the BOLD model characterizes how blood volume and dHb synergistically contribute to changes in the BOLD signal, expressed by the following non-linear equation:

$$\frac{\Delta S\left(t\right)}{{S}_{0}}\approx {V}_{0}\left[{k}_{1}\left(1-q\right. \left(t\right)+{k}_{2}\left(1-\,\frac{q\left(t\right)}{v\left(t\right)}\right)+\,{k}_{3}\left(1-v\left(t\right)\right)\right]$$

where parameters \({k}_{1}\), \({k}_{2}\), and \({k}_{3}\) are calculated through the following equations:

$${k}_{1}=4.3\cdot {\vartheta }_{0}\cdot {E}_{0}\cdot {TE}$$
$${k}_{2}={\varepsilon }_{h}\cdot {r}_{0}\cdot {E}_{0}\cdot {TE}$$
$${k}_{3}=1-{\varepsilon }_{h}$$

in which \({S}_{0}\) is the BOLD signal at rest, and \(\Delta S\) is the BOLD signal change induced by task performance. Details of all the parameters are listed in Supplementary Table 6. The simulation data for all hidden variables are visualized in Supplementary Fig. 8.

The gradient descent method was employed to estimate the parameters \({w}_{0}\), \({w}_{1}\), \({s}_{0}\), and \({s}_{1}\). Gradient descent is an iterative optimization algorithm that seeks the local minimum. Specifically, these four parameters (\({{\boldsymbol{\theta }}}=\{{w}_{0},{w}_{1},{s}_{0},{s}_{1}\}\)) were updated simultaneously:

$${{\boldsymbol{\theta }}}={{\boldsymbol{\theta }}}-\alpha \frac{d}{d{{\boldsymbol{\theta }}}}J\left({{\boldsymbol{\theta }}}\right)$$

where \(\alpha\) denotes the learning rate. Then, the cost function \(J\left({{\boldsymbol{\theta }}}\right)\) was defined as:

$$J\left({{\boldsymbol{\theta }}}\right)=\frac{1}{2n}{\sum }_{i=1}^{n}\frac{{\left({\hat{Z}}_{{w}_{i}}-{{Z}_{w}}_{i}\right)}^{2}+{\left({\hat{Z}}_{{s}_{i}}-{{Z}_{s}}_{i}\right)}^{2}}{2}$$

where \({Z}_{w}\) and \({Z}_{s}\) are the fMRI signals associated with the predictions of words and sentences, derived from the gGLM analysis at each level before averaging and z-scoring; \({\hat{Z}}_{w}\) and \({\hat{Z}}_{s}\) are the corresponding estimated signals; n is the signal length in TRs. During model training, the learning rate \(\alpha\) was set to \(1\times {10}^{-5}\), and the convergence threshold was defined as a change in the cost function \({dJ} < 1\times {10}^{-4}\). Because the cost function is not guaranteed to be concave, we randomized parameters \({{\boldsymbol{\theta }}}\) 10000 times to identify the best initial condition. A leave-one-subject-out cross-validation approach was applied to estimate \(J\left({{\boldsymbol{\theta }}}\right)\). Model performance was quantified using the mean squared error (MSE), which equals to twice \(J\left({{\boldsymbol{\theta }}}\right)\).

Autocorrelation analysis

The BOLD signals from significant parcels were used to calculate the autocorrelation effect. The time courses of the signals were temporally shifted forward from 1 to 50 TRs. Then, Pearson correlation was calculated between the original and shifted signals for each participant using the tsa.acf() function from the statsmodels toolbox132.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.