Fig. 1: Schematic demonstration of the analytic approach.

a Training and testing the multiple ridge regression models. The dataset (~0.2 million samples) was generated from Chinese Wikipedia by randomly selecting the prior linguistic context and upcoming linguistic units (word or sentence). The context and linguistic unit were transformed into fixed-length vectors via the WWM-RoBERTa model. Then, 80% of samples were used to train the multiple ridge regression model to capture the predictive relationship between the context and linguistic unit, and the remaining 20% were used for model evaluation. Word and sentence prediction models were trained separately. b Processing of the experiment materials. The story audios were transcribed, segmented, and aligned at both word (via “jieba” toolbox implemented in Python) and sentence level (via sentence boundary segmentation task), which were further used to generate the predictive representations using the ridge regression models. These representations were reduced to 50 dimensions, resampled, and convolved with the hemodynamic response function (HRF) for encoding model analyses. c Roadmap of the group-based general linear model (gGLM). BOLD signals were collected while participants listened to stories. The BOLD signals were then preprocessed and grouped into 400 parcels according to ref. 63. For each parcel, leave-one-subject-out (LOSO) cross-validation was employed to obtain the explained variance (R2) across participants.