Abstract
The evolution of conversation facilitates the exchange of intricate thoughts and emotions. The meaning is progressively constructed by integrating both produced and perceived speech into hierarchical linguistic structures across multiple timescales, including words, sentences and discourse. However, the neural mechanisms underlying these interactive sense-making processes remain largely unknown. Here we used functional magnetic resonance imaging to measure brain activity during hours of spontaneous conversations, modelling neural representations of conversational content using contextual embeddings derived from a large language model (GPT) at varying timescales. Our results reveal that linguistic representations are both shared and distinct between production and comprehension, distributed across various functional networks. Shared representations, predominantly localized within language-selective regions, were consistently observed at shorter timescales, corresponding to words and single sentences. By contrast, modality-specific representations exhibited opposing timescale selectivity: shorter for production and longer for comprehension, suggesting that distinct mechanisms are involved in contextual integration. These findings suggest that conversational meaning emerges from the interplay between shared linguistic codes and modality-specific temporal integration, facilitating context-dependent comprehension and adaptive speech production.
Similar content being viewed by others
Main
Humans can share complex thoughts and emotions through conversation, from casual greetings to formal discussions. Successful communication assumes a shared understanding of the context and goals, which provide a framework for what is talked about (content) and how the conversation unfolds (process)1. Conversational content can be conveyed not only through linguistic alignment between interlocutors2,3,4 but also through extralinguistic abilities such as social cognition, world knowledge and situation modelling5,6,7. By contrast, the process of conversation requires temporal alignment, enabling the seamless alternation between language production and comprehension8,9. Backchannel responses (for example, ‘yeah’) and fillers (for example, ‘uh’) play critical roles in reinforcing shared understanding and facilitating the coordination of speech planning and listening10,11,12. Moreover, conveying complex ideas often relies on structured narratives, which form the foundation of effective knowledge sharing13. Thus, conversation emerges as a multidimensional (linguistic and extralinguistic) and multitimescale joint activity involving the intricate interplay of language production and comprehension.
The neural underpinnings of language processing have been extensively investigated using functional magnetic resonance imaging (fMRI), particularly in the context of naturalistic narrative comprehension. Comprehension can be viewed as the transformation of low-level sensory inputs (for example, speech sounds and written text) into high-level hierarchical linguistic structures14,15,16,17,18,19,20. These structures are created by integrating linguistic information—both semantic and syntactic—across multiple timescales, ranging from word-level semantics to sentence meaning, ultimately culminating in a coherent narrative14,15,16,17,18,19,20,21,22,23,24. While these studies have highlighted the brain’s capacity to encode hierarchical linguistic structures across various sensory modalities (for example, listening and reading) and timescales (words, sentences, discourse and narrative arcs), the functioning of these mechanisms during real-time conversations remains largely unexplored.
In this study, we address this gap by investigating the neural representation of conversational content, focusing on two key hypotheses. The first posits the existence of a unified linguistic representation that integrates information across production and comprehension. This hypothesis aligns with evidence of shared neural representations for semantic and syntactic information across modalities25,26,27,28,29. The second proposes distinct representations for production and comprehension, enabling differentiation between self-generated speech and that of an interlocutor30. This dual representation aligns with findings of modality-specific syntactic processes31 and mechanisms that facilitate concurrent speech planning during listening32, along with the predictive coding of an interlocutor’s speech20,23.
To test these hypotheses, we used fMRI during natural conversations and utilized voxel-wise encoding modelling33,34 to characterize neural representations of conversational content across multiple timescales, ranging from 1 to 32 s. Our analyses addressed two key questions: (1) Are linguistic representations shared between production and comprehension? (2) Are certain linguistic representations uniquely associated with production, comprehension or both? Throughout the study, the term linguistic encompasses both semantic and syntactic dimensions, reflecting their tight interdependence35,36.
Results
Shared representations between production and comprehension
We collected fMRI data from eight native Japanese speakers during spontaneous conversations with an experimenter (Fig. 1a). Each participant talked about 27 topics across 27 runs (Supplementary Table 5), with each run lasting 7 min and 10 s, resulting in approximately 3 h of data per participant over three to four sessions. Despite efforts to balance the amount of speech production and comprehension, individual variation in speech samples was observed (Supplementary Fig. 1). Rigorous preprocessing minimized potential confounds from motion artifacts and peripheral articulatory and auditory processes, effectively isolating blood-oxygen-level-dependent (BOLD) responses associated with higher-level linguistic representations (Supplementary Figs. 2–5 and Supplementary Results).
a, Participants (N = 8) engaged in natural conversations on topics provided by the experimenter while undergoing fMRI. Speech was transcribed and temporally segmented for each fMRI TR (1,000 ms intervals). b, Utterances were concatenated with prior context (1–32 s) and processed by an instruction-tuned GPT model. The embeddings were extracted from layers in multiples of three for subsequent analysis. c, Voxel-wise encoding modelling in the separate linguistic model. The GPT embeddings were modality specific, combined across modalities and utilized to train a FIR regression model, using banded ridge regularization for each voxel. The prediction performance was evaluated by correlating predicted and observed BOLD responses on held-out test data. d, The cross-modality prediction. The model weights were exchanged between modalities, and the prediction performance was assessed. e, The unified linguistic model. The GPT embeddings were extracted uniformly from concatenated content spanning both modalities.
We used voxel-wise encoding modelling, utilizing GPT embeddings extracted from conversational content as linguistic features (Fig. 1b). The transcriptions were transformed into contextual embeddings using an instruction-tuned GPT model37 that was fine-tuned for interactive language tasks. This GPT model comprises an input embedding layer and 36 transformer layers, each containing 2,816 hidden units. We extracted embeddings from 13 hierarchical layers (the input layer and every third transformer layer) across six context lengths (1, 2, 4, 8, 16 and 32 s). We averaged the embeddings across all tokens within each segment (or fMRI volume, repetition time (TR) of 1,000 ms), resulting in 78 feature combinations per modality (13 layers × 6 context lengths). These features were integrated into a joint model, referred to as the separate linguistic model, with 5,632 features (2 modalities × 2,816 features). A finite impulse response (FIR) model was used to predict BOLD responses with delays ranging from 2 to 7 s (5,632 features × 6 delays = 33,792 features). This joint model was fit to BOLD responses using banded ridge regression38,39 for each voxel. We used the leave-one-session-out cross-validation to train the model on N-1 sessions and test it on one session. The prediction accuracy was evaluated using Pearson’s correlation coefficients between observed and predicted BOLD responses. To account for autocorrelation, the data were divided into 20-s blocks, permuted to estimate the null distribution, and the correlations were calculated across 1,000 permutations to obtain P values. The reported prediction accuracy is derived from averaging across cross-validation folds, and the combined P values were calculated using Fisher’s method.
The separate linguistic model achieved good prediction accuracy across extensive cortical regions. For example, in participant P7, embeddings derived from an 8-s context at layer 18 exhibited high prediction accuracy in the bilateral prefrontal, temporal and parietal cortices (Fig. 2a and see Supplementary Figs. 6 and 9 for results from individual layers and participants). To investigate whether the average prediction accuracy across the cortex was influenced by layer position and context length, we conducted a linear mixed-effects (LME) model analysis. The participants were specified as random effects, permitting variations in the effects of context length and its squared term across participants (Methods). We found an inverted U-shaped relationship across timescales (context length: t(7) = 4.41, P = 0.0031, β = 0.74, 95% confidence interval (CI) 0.39 to 1.08; context length squared: t(7) = −4.54, P = 0.0027, β = −0.43, 95% CI −0.63 to −0.24, summarized in Supplementary Table 1) and across layers (layer position: t(597) = 11.36, P < 0.001, β = 0.15, 95% CI 0.12 to 0.17; layer position squared: t(597) = −17.95, β = −0.26, P < 0.001, 95% CI −0.29 to −0.23). In addition, we found a significant interaction effect between context length and layer position (t(597) = −6.95, P < 0.001, β = −0.09, 95% CI −0.11 to −0.06).
a, The prediction accuracy of the separate linguistic model is visualized on the flattened cortical surface of one participant under a specific condition (P7, context length of 32 s, layer 18) (see Supplementary Figs. 6 and 9 for individual layers and participants). The voxels with significant prediction accuracy (one-sided permutation test, P < 0.05, FDR corrected) are displayed. PFC, prefrontal cortex; MPC, medial parietal cortex; AC, auditory cortex; VC, visual cortex. b, The mean prediction performance across participants, averaged over voxels and layers. The cross-modality prediction was significantly less accurate than same-modality predictions (actual model: t(7) = 7.05, two-sided parametric test, P = 2.0 × 10−4, β = 0.057, 95% CI 0.040 to 0.074). c, Cross-modal voxel weight correlations at two context lengths (1 and 32 s), shown for one participant under a specific layer condition (P7, layer 18) (see Supplementary Fig. 14 for individual participant data). Only the voxels with prediction accuracy above 0.05 in both actual and cross-modality conditions are shown. d, The mean weight correlations across participants, highlighting linguistic (purple) and cross-modal (yellow) voxels. The cross-modal voxels exhibited significantly higher weight correlations (cross-modal voxels: t(7) = 11.35, two-sided parametric test, P = 9.2 × 10−6, β = 0.20, 95% CI 0.17 to 0.24). The shaded regions in b and d represent the standard deviation across participants. LH, left hemisphere; RH, right hemisphere.
Next, we addressed the question of how neural linguistic representations are shared between production and comprehension. To evaluate this, we assessed cross-modality prediction accuracy by interchanging the model weights between production and comprehension (Fig. 1d). This analysis was restricted to ‘linguistic voxels,’ defined as those exhibiting significant prediction accuracy in the separate linguistic model. We found a notable reduction in cross-modality prediction accuracy (Fig. 2b and see Supplementary Figs. 7 and 9 for individual layers and participants), consistent with findings from an LME model (actual model: t(7) = 7.05, P < 0.001, β = 0.057, 95% CI 0.040 to 0.074). Significantly predicted voxels were scattered across prefrontal, temporal, parietal and occipital cortices (Fig. 2a). Notably, cross-modality prediction accuracy increased with longer context lengths as revealed by significant fixed effects for both context length (t(7) = 4.01, P = 0.0052, β = 0.40, 95% CI 0.15 to 0.66). These findings suggest that while linguistic representations are partially shared between production and comprehension, the topographic organization is modulated by the timescales.
Although these results demonstrated generalizable linguistic representations across modalities, another critical question arises: is a unified linguistic representation sufficient for accurate predictions? To address this, we developed a unified linguistic model that extracted GPT embeddings from combined transcripts (Fig. 1e). Compared with the separate linguistic model, the unified linguistic model showed a slight reduction in prediction accuracy (Fig. 2b and see Supplementary Figs. 8 and 9 for individual layers and participants), as indicated by a significant fixed effect of model type (separate linguistic model: t(1227) = 4.01, P < 0.001, β = 3.4 × 10−3, 95% CI 2.7 × 10−3 to 4.1 × 10−3). Further LME analysis of the unified linguistic model revealed significant effects of layer position (layer position: t(597) = 7.45, P < 0.001, β = 0.08, 95% CI 0.06 to 0.11; layer position squared: t(597) = −22.79, P < 0.001, β = −0.29, 95% CI −0.31 to −0.26) and its interaction with context length (t(597) = −9.45, P < 0.001, β = −0.11, 95% CI −0.13 to −0.08), whereas no significant fixed effect of context length was detected. These findings suggest that the lower prediction performance of the unified linguistic model, relative to the Separate Linguistic model, may be attributable to its limited ability to leverage longer contextual information to enhance predictions.
Next, we quantified the similarities in the linguistic representations across modalities. Given the considerably decreased prediction accuracy from same-modality to cross-modality predictions, we hypothesized that these voxels might exhibit similar yet unique linguistic tuning for each modality (that is, a weak positive correlation). Here, we focus on ‘cross-modal voxels’ that demonstrated robust prediction in both same- and cross-modality conditions (r > 0.05). The weight correlation was calculated using Pearson’s correlation coefficient for the separate linguistic model weights for each voxel. Cross-modal voxels exhibited moderately positive correlations across layers and timescales (Fig. 2d and see Supplementary Fig. 11 for individual layers and participants) and showed higher weight correlations compared with linguistic voxels (cross-modal voxels: t(7) = 11.35, P < 0.001, β = 0.20, 95% CI 0.17 to 0.24). Notably, positively correlated voxels clustered within the prefrontal, temporal and parietal cortices at shorter timescales (1–4 s) (Fig. 2c and see Supplementary Fig. 14 for individual participant data). By contrast, at longer timescales (16–32 s), positively correlated voxels were more diffusely distributed and idiosyncratic among participants. These findings suggest that while cross-modal voxels may share some aspects of linguistic representation, they also exhibit unique tuning across modalities.
Modality-specific timescale selectivity
We next explored modality-specific linguistic representations by fitting the production-only and comprehension-only linguistic models. These models utilized modality-specific contextual embeddings to quantify the variance in BOLD responses that could be uniquely attributed to each modality. Variance partitioning16,40 was used to assign variance to either production or comprehension using the following equations
We found that production explained more variance at shorter timescales (1–4 s) (Fig. 3a and see Supplementary Fig. 10 for individual layers and participants). The LME analysis revealed significant effects of context length (t(7) = −3.87, P = 0.0061, β = −0.40, 95% CI −0.62 to −0.18), layer position (layer position: t(597) = 3.72, P < 0.001, β = 0.049, 95% CI 0.023 to 0.075; layer position squared: t(597) = −8.82, P < 0.001, β = −0.13, 95% CI −0.16 to −0.10) and their interaction (t(597) = −5.27, P < 0.001, β = −0.070, 95% CI −0.096 to −0.044). By contrast, comprehension explained more variance at longer timescales (16–32 s). The LME analysis indicated an inverted U-shaped relationship for context length (context length: t(7) = 4.55, P = 0.0026, β = 1.02, 95% CI 0.55 to 1.49; context length squared: t(7) = −2.42, P = 0.046, β = −0.41, 95% CI −0.75 to −0.06) and layer position (layer position: t(598) = 8.33, P < 0.001, β = 0.11, 95% CI 0.08 to 0.13; layer position squared: t(598) = −11.57, P < 0.001, β = −0.16, 95% CI −0.19 to −0.13). Across participants, the context length that maximized prediction accuracy consistently varied between modalities, with production peaking at shorter timescales and comprehension at longer timescales (Fig. 3b). These findings suggest distinct timescale selectivity for production and comprehension.
a, The mean variance (across voxels and layers) uniquely explained by production or comprehension, averaged across participants. b, The mean context lengths (across layers) maximizing unique variance explained for each modality and participant. c, The mean weight correlation (across voxels and layers) between the unified and separate linguistic models, calculated for each modality and averaged across participants. d, The changes in weight correlations between the unified and separate linguistic models for two context lengths (1 and 32 s) on the flattened cortical surface of one participant under one condition (P7, layer 18) (see Supplementary Fig. 15 for individual participant data). The voxels with good unified model prediction accuracy (r ≥ 0.05) are shown. PFC, prefrontal cortex; MPC, medial parietal cortex; AC, auditory cortex; VC, visual cortex. The shaded areas in a–c represent the standard deviation across layers.
To further investigate modality-specific timescale selectivity, we compared the weights of the unified linguistic model weights to those of the separate linguistic models by calculating voxel-wise weight correlations. For production, the LME analysis revealed a U-shaped relationship across timescales (context length: t(7) = −8.59, P < 0.001, β = −1.01, 95% CI −1.26 to −0.77; context length squared: t(7) = 5.59, P < 0.001, β = 0.32, 95% CI 0.20 to 0.44) (Fig. 3c and see Supplementary Fig. 12 for individual layers and participants). A similar U-shaped relationship was observed for comprehension (context length: t(7) = 0.70, P = 0.50, β = 0.12, 95% CI −0.23 to 0.47; context length squared: t(7) = 3.30, P = 0.013, β = 0.23, 95% CI 0.08 to 0.37). The weights of the unified linguistic model were more closely aligned with production at shorter timescales, while they resembled comprehension at longer timescales (Fig. 3d and Supplementary Fig. 12 for individual layers and participants). These results underscore modality-specific timescale selectivity, with production favoring shorter contexts and comprehension benefiting from longer contexts.
To mitigate potential biases in variance partitioning results due to disparities in sample sizes, we examined the correlation between production-to-comprehension sample size ratios and the corresponding variance explained. A significant correlation was observed for early layers at a 1-s context length, with Spearman’s rank correlation rho of 1.00 for layer 0 and 0.93 for layer 3 (P < 0.05, false discovery rate (FDR) corrected). The participants who produced more speech demonstrated greater variance explained by production under these conditions. Importantly, the sample proportions were balanced overall, with four participants producing more speech (P3, P4, P6 and P7) and the remaining four comprehending more (P1, P2, P5 and P8). These findings confirm that variance partitioning results were not systematically biased towards either modality across participants.
Dual linguistic representations in bimodal voxels
After analysing the variance uniquely explained by production and comprehension, we investigated the shared variance explained by both modalities. The shared variance was calculated as follows
We found that shared variance increased progressively with longer context lengths, peaking at an average of 8 s (Fig. 4a and see Supplementary Fig. 10 for individual layers and participants). Notably, for all participants, the context length that maximized shared variance exceeded 8 s (Fig. 4b). The LME analysis revealed an inverted U-shaped relationship across timescales (context length: t(7) = 4.55, P < 0.001, β = 1.02, 95% CI 0.68 to 1.36; context length squared: t(7) = −4.59, P = 0.0025, β = −0.49, 95% CI −0.71 to −0.27). A significant quadratic effect of layer position was also observed (layer position squared: t(597) = −12.30, P < 0.001, β = −0.20, 95% CI −0.23 to −0.17).
a, The mean variance (across voxels and layers) uniquely explained by the intersection of production and comprehension, averaged across participants. b, The mean context lengths (across layers) maximizing unique variance explained by the intersection for each participant. c, The cortical surface showing the best variance partition at two context lengths (1 and 32 s) for one participant in one condition (P7, layer 18) (see Supplementary Fig. 16 for individual layers and participants). The voxels with good prediction accuracy (r > 0.05) in the separate linguistic model are shown. PFC, prefrontal cortex; MPC, medial parietal cortex; AC, auditory cortex; VC, visual cortex. d, The mean weight correlation (across voxels and layers) for bimodal voxels compared with Production-only and Comprehension-only voxels, averaged across participants. The bimodal voxels showed significantly lower weight correlation than Production-only (production: t(8.4) = 11.99, two-sided parametric test, P = 1.4 × 10−6, β = 0.038, 95% CI 0.032 to 0.045) and Comprehension-only voxels (comprehension: t(1227) = −21.00, two-sided parametric test, P < 2.2 × 10−16, β = 0.021, 95% CI 0.019 to 0.023). e, The weight correlations for bimodal linguistic voxels at two context lengths (1 and 32 s) shown for one participant in one condition (P7, layer 18) (see Supplementary Fig. 16 for all participants). The voxels with good prediction accuracy (r > 0.05) in the separate linguistic model are shown. The shaded areas in a, b and d represent the standard deviation across layers.
To map the topographic organization of selectivity to a single modality or shared between both, we created cortical maps depicting the patterns that explained the largest variance for each voxel. These maps revealed that voxels with the largest shared variance, hereafter referred to as ‘bimodal voxels,’ were distributed across various cortical regions (Fig. 4c and see Supplementary Fig. 16 for individual layers and participants). Notably, contextual information spanning 8 s or longer appeared to drive substantial bimodal responses, suggesting the integration of linguistic information across modalities.
We then examined whether bimodal voxels exhibited distinct linguistic tuning for production and comprehension. To achieve this, we calculated Pearson’s correlations of the separate linguistic model weights for production and comprehension specifically for bimodal voxels (Fig. 4c). The correlations were slightly negative and close to zero (Fig. 4d and see Supplementary Fig. 13 for individual layers and participants), suggesting that bimodal voxels are independently or dissimilarly tuned for the two modalities. In comparison with unimodal voxels—those with the largest unique variance for either production or comprehension—the bimodal voxels exhibited more negative correlations. The LME analysis confirmed this difference (production: t(8.4) = 11.99, P < 0.001, β = 0.038, 95% CI 0.032 to 0.045; comprehension: t(1227) = −21.00, P < 0.001, β = 0.021, 95% CI 0.019 to 0.023). These findings indicate that the bimodal voxels are independently tuned for each modality, reflecting the distinct linguistic demands of production and comprehension.
To ensure that these findings were not influenced by the instruction-tuned GPT model41, we replicated the core analyses using a base GPT model before instruction tuning. The results were consistent, confirming that instruction tuning did not affect the observed effects (Supplementary Fig. 18).
Revealing semantic tuning to interactive languages
To elucidate the linguistic organization underlying conversational content, we conducted principal component analysis (PCA) on the 2,816-dimensional separate linguistic model weights for each modality and participant. Building on previous research that mapped cortical semantic representations during natural speech comprehension21,42, we adapted this framework for natural conversations. Due to variability in conversational content, a PCA was conducted separately for each participant and independently for production and comprehension, yielding modality-specific principal components (PCs). To assess the statistical robustness of the identified PCs, we performed a comparable PCA on GPT embeddings of the corresponding speech stimuli (‘stimulus PCA’). We quantified the variance explained by each PC through 1,000 bootstrap resampling iterations to establish statistical significance.
Our analysis revealed the highest number of significant PCs in the embedding layer (layer 0) at a context length of 1 s for both production and comprehension (Fig. 5b,f). For production, four PCs were identified in four participants, three PCs in three participants and five PCs in one participant (P < 0.001, bootstrap test) (Supplementary Fig. 19). For comprehension, five PCs were identified in four participants, while four PCs were identified in the remaining four participants (P < 0.001, bootstrap test).
Production PC results are shown in a–d; comprehension PC results are shown in e–h. a,e, The variance explained in the separate linguistic model weights by each of the top ten PCs identified for a participant (P7) at one context-layer condition. The grey lines represent the variance explained by the PCs of the conversation content. The stimulus PCs were aligned with weight PCs using the Gale–Shapley stable-match algorithm. The data are presented as the original PCA results, with the error bars representing the standard deviation across 1,000 bootstrap samples. The standard deviations for the model weight PCs are very small. b,f, The mean number of significant PCs across participants (see Supplementary Fig. 17 for individual participant data). c,g, The utterances most strongly correlated with significant PCs for a participant (P7). The numbers in parentheses indicate the correlations between PC coefficients and GPT embeddings. d,h, The common PCs across participants identified for a 1-s context and layer 0 using ChatGPT.
To interpret these significant PCs, we analysed the conversation content most strongly correlated with each PC. The majority of highly correlated content comprised words and phrases characteristic of interactive contexts, such as backchannel responses and conversational fillers (for example, participant P7) (Fig. 5c,g). To identify PCs consistently observed across participants, we utilized ChatGPT to analyse the top 20 most strongly correlated phrases for each PC, summarizing the common patterns. For production, four components emerged: (1) clear and structured speech versus ambiguous and spontaneous speech, (2) active discourse development versus conversational maintenance with backchannels, (3) emotional and empathetic speech versus factual and logical speech and (4) cautious speech with high cognitive load versus fluent and cooperative speech. For comprehension, four components were identified: (1) immediate conversational flow versus deliberate speech planning, (2) emotional and empathetic versus logical and information-driven, (3) passive listening versus active speech leadership and (4) concrete, experience-based versus abstract, conceptual speech. These findings demonstrate that both production and comprehension are tuned to the semantic demands of interactive language, revealing shared lexical–semantic components across participants. This highlights a consistent semantic organization that supports real-time social communication.
Discussion
This study explored the neural representations of conversational content across production and comprehension modalities and multiple timescales. We identified shared linguistic representations exhibiting timescale-dependent topographic organization (Fig. 2). For shorter contexts (1–4 s), corresponding to words and single sentences, shared representations were localized in higher-order association cortices, including the prefrontal, temporal and parietal regions. By contrast, for longer contexts (16–32 s), spanning multiple conversational turns, these shared representations were more distributed and idiosyncratic among participants. Furthermore, modality-specific timescale selectivity revealed enhanced encoding for shorter contexts during production and for longer contexts during comprehension (Fig. 3), suggesting distinct temporal integration processes. We also identified dual linguistic representations in bimodal voxels, encoding modality-specific information for both production and comprehension (Fig. 4). Despite these timescale-specific patterns, our analysis of low-dimensional linguistic representations revealed lexical–semantic components predominantly associated with shorter timescales (Fig. 5).
Theoretical models for the neural mechanism of conversation have proposed a common neural basis for language production and comprehension2,43. Empirical studies have adopted two primary approaches to examine this commonality: (1) the between-subjects approach, which examines the transmission of messages from speaker to listener3,4,44 and (2) the within-subject approach, which investigates shared neural mechanisms within individuals25,26,27,28,29. Our study contributes to the within-subject approach by revealing both shared and distinct neural representations and their modulation by contextual timescales within individual participants.
Previous neuroimaging studies, using the within-subject approach, have manipulated the semantic and syntactic dimensions of stimuli to reveal shared representations25,26,27,28. Recent research utilizing spontaneously generated sentences and conversations has enhanced ecological validity45,46, uncovering shared semantic and syntactic representations during natural language use29,31. For instance, recent research29 used electrocorticography during natural conversations and modelled transient neural activity before and after word onset, identifying overlapping regions for word production and comprehension. However, two critical questions remain unresolved: (1) whether linguistic representations generalize across modalities and (2) how these shared representations vary across timescales. Our study addresses these gaps, demonstrating the generalizability of shared representations and their modulation by the amount of contextual information.
The topographic organization of shared neural representations varied across multiple timescales. At shorter timescales (1–4 s), corresponding to the duration of words and single sentences, shared representations were localized in higher-order brain regions, including the bilateral prefrontal, temporal and parietal cortices. This finding aligns with previous studies that mapped neural representations of intermediate linguistic structures, such as words and single sentences, onto these regions during naturalistic narrative comprehension14,15,19,24. These regions have consistently been associated with sentence-level processing in traditional neuroimaging studies of isolated sentences presented at shorter timescales (less than 6 s)25,27,28. Furthermore, these brain regions are partially overlap with those involved in linguistic knowledge and processes that are shared across both production and comprehension26,47. Therefore, the shared representations observed at shorter timescales suggest the presence of a common neural code for sentence-level linguistic information (‘sentence meaning’).
By contrast, at longer timescales (16–32 s), spanning multiple conversational turns, shared representations were distributed across broader cortical regions, exhibiting notable interindividual variability. Some participants (P1, P2, P4, P6 and P7) demonstrated shared representations extending into brain regions associated with the default mode network and the theory of mind (ToM) network. The default mode network has been implicated in representing higher-order discourse and narrative frameworks by integrating extrinsic information (for example, utterances) with intrinsic information (that is, prior context and memory)7,14,48. Similarly, the ToM network supports reasoning about others’ mental states, a critical function in both language production and comprehension during conversations47,49,50. This network is particularly engaged in inferring the mental states of conversational partners, thereby facilitating pragmatic inferences about that particular individual49,51,52. These findings suggest that shared representations at longer timescales support the integration of incoming conversational content with prior conversational context, as well as with broader social knowledge and beliefs. Such integration may support the formation of a psychological model of the situation, enabling inferences about the interlocutor’s intended meaning (‘speaker meaning’). Individual differences in the spatial distribution of these shared representations may reflect variability in discourse-level integration strategies.
The contrasting timescale selectivity between production and comprehension may reflect their distinct functional demands in processing linguistic input and output. Our findings demonstrate that language comprehension exhibits enhanced encoding for longer timescales, consistent with the requirements of real-world language comprehension. Effective comprehension necessitates the integration of linguistic input with world knowledge, beliefs and memory to extract meaning from extended contexts5,7,48. This is consistent with evidence indicating that the brain prioritizes understanding broader discourse-level and overarching meanings over shorter units, such as individual words or sentences14,19. By contrast, we found that language production is characterized by enhanced encoding for shorter timescales. Production involves extensive preparatory processes, including ideation, lexical selection, syntactic structuring and speech planning, all of which occur before speech output9,32,51,53,54,55,56. Furthermore, production must dynamically adapt to the interlocutor’s immediate reactions, ensuring fluid and responsive communication57,58. These demands suggest that production prioritize responsiveness and flexibility over reliance on extended contextual information.
Dual representations in the bimodal voxels exhibited selectivity for longer timescales (exceeding 8 s), corresponding to the integration of multiple sentences into coherent discourse. These representations probably facilitate the ability to maintain and distinguish perspectives, a critical function during conversation30. Conversations inherently require participants to navigate distinct perspectives, requiring differentiation at the neural level. Such interpersonal cognitive processes, integral to managing multiple perspectives, are probably not limited to external communication but may also underpin internal speech processes59.
Despite this modality-specific timescale selectivity, our PCA identified similar lexical semantic components across modalities within the embedding layer at short timescales (1–4 s). Our results potentially extend the seminal work of Huth and colleagues21, which comprehensively mapped fine-grained semantic representations during natural speech listening using word embeddings. Specifically, that study identified the first PC differentiating between ‘humans and social interaction’ and ‘perceptual descriptions, quantitative descriptions and setting’, thereby separating social content from physical content. Our conversational data offered a unique opportunity to examine the semantic space surrounding social words in greater depth. Notably, our identified PCs reflected social interaction nuances, such as backchannels, confirmations and fillers. These elements require minimal cognitive effort yet are vital for maintaining conversational flow1,10,11,12. By contrast, PCs linked to ‘factual and logical speech’ or ‘logical and information-driven’, such as referring to locations or objects, were identified as opposite axis of the social components (‘emotion and empathetic’). This suggests that interactive language enhances the neural representation of social content, highlighting the interplay between semantic representations and social cognition.
Several limitations of the present study should be noted. We did not conduct functional localizer tasks to delineate specific functional networks, such as the language network and ToM network. Thus, our analysis could not precisely attribute voxel clusters to specific functional networks.
Our findings shed light on temporally hierarchical neural linguistic representations underlying both sentence meaning and speaker meaning during real-world conversations. Modality-aligned representations were primarily localized to brain regions involved in processing word- and sentence-level linguistic information over shorter timescales, while modality-specific representations exhibited distinct timescale selectivity: shorter contexts for production and longer contexts for comprehension. These findings emphasize the importance of investigating the neurobiological basis of language within socially interactive contexts to comprehensively understand human language use.
Methods
Participants
Eight healthy, right-handed native Japanese speakers (P1–P8) participated in the fMRI experiment. The participants comprised five males (P1: age 22, P2: age 22, P3: age 23, P5: age 20 and P8: age 20) and three females (P4: age 22, P6: age 20 and P7: age 20). All participants were confirmed as right-handed through the Edinburgh Handedness Inventory60 (with a laterality quotient score of 75–100), and they had normal hearing as well as normal or corrected-to-normal vision. The experimental protocol was approved by the Ethics and Safety Committee of the National Institute of Information and Communications Technology, Osaka, Japan. Written informed consent was obtained from all participants before the experiment.
Natural dialogue experiment
The experiment consisted of 27 conversation topics, including self-introduction and favourite classes (Supplementary Table 5). These topics were selected to cover a wide range of semantic domains relevant to daily life, such as knowledge, memory, imagination and temporal and spatial cognition, referencing the Corpus of Everyday Japanese. Each fMRI run lasted 7 min and 10 s and focused on a specific topic. The participants engaged in unscripted, natural dialogues, freely expressing their thoughts and emotions while responding in real time to their interlocutor’s input. Speech was delivered and recorded via fMRI-compatible insert earphones and a noise-cancelling microphone, respectively. Both the participants’ and interlocutor’s speech were recorded separately for subsequent analysis. Each participant completed 27 runs across four sessions, except for P3, who completed three sessions. Due to the collection of a single valid run in one session, only three sessions were analysed for P2 and P5, and the analysis included two to ten runs per session (Supplementary Table 4). On average, the participants produced speech during 217.1 ± 26.0 (mean ± standard deviation) fMRI volumes per run (range 170.6–262.1), while comprehending speech during 214.4 ± 11.8 volumes (range 199.2–234.0) (Supplementary Fig. 1).
MRI data acquisition
Magnetic resonance imaging (MRI) data were collected on a 3T MRI scanner at CiNet. Participants P1–P5 were scanned on a Siemens MAGNETOM Prisma, while P6–P8 were scanned on a Siemens MAGNETOM Prisma Fit, both equipped with 64-channel head coils. Functional images were acquired using a T2-weighted gradient echo multiband echo-planar imaging sequence61 in interleaved order, covering the entire brain. The imaging parameters were as follows: TR of 1.0 s, echo time (TE) of 30 ms, flip angle of 60°, matrix size of 96 × 96, field of view of 192 mm × 192 mm, voxel size of 2 mm × 2 mm × 2 mm, slice gap of 0 mm, 72 axial slices, multiband factor of 6. High-resolution anatomical images were obtained using a T1-weighted MPRAGE sequence with the following parameters: TR of 2.53 s, TE of 3.26 ms, flip angle of 9°, matrix size of 256 × 256, field of view of 256 mm × 256 mm, voxel size of 1 mm × 1 mm × 1 mm.
fMRI data preprocessing
The fMRI data were preprocessed using the Statistical Parametric Mapping toolbox (SPM8). Motion correction was applied to each run, aligning all volumes to the first echo-planar imaging frame for each participant. To remove low-frequency drift, we used a median filter with a 120-s window. The response for each voxel was then normalized by subtracting the mean response and scaling to unit variance. The cortical surfaces were identified using FreeSurfer62,63, which registered the anatomical data with the functional data. For each participant, only voxels identified within the cerebral cortex were included in the analysis, ranging from 64,072 to 72,018 voxels per participant. The flatmaps were generated by projecting voxel values onto cortical surfaces using Pycortex64. Cortical anatomical parcellation was performed using the Destrieux Atlas65, and the resulting parcellations were visualized on cortical surface maps.
Transcription and temporal alignment
Conversational speech was transcribed morphologically using the Microsoft Azure Speech-to-Text, followed by manual correction for accuracy. The morphemes were grouped into meaningful semantic chunks, approximating the fMRI TR (1,000 ms) and temporally aligned to the corresponding fMRI volumes using the midpoint of each chunk’s duration.
Contextual embedding extraction
To extract contextual embeddings from the content of conversations, we utilized an instruction-tuned language model (GPT) fine-tuned specifically for Japanese37 (https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft). This model is built on the open-source GPT-NeoX architecture66 and was pretrained to predict the next word on the basis of preceding context using 312.5 billion tokens from various Japanese text datasets: Japanese CC-100, Japanese C4 and Japanese Wikipedia. For comparative purposes, we also replicated our analysis using the non-instruction-tuned version of the model (https://huggingface.co/rinna/japanese-gpt-neox-3.6b) as detailed in Supplementary Fig. 18. Instruction tuning was performed using datasets translated into Japanese, including Anthropic HH RLHF data, FLAN Instruction Tuning data and the Stanford Human Preferences Dataset. The resulting model architecture comprises 36 transformer layers with hidden unit dimensions of 2,816.
We processed transcribed utterances using GPT-NeoX with context lengths of 1, 2, 4, 8, 16 and 32 s, extracting embeddings by averaging the internal representations of all tokens within each utterance. To investigate differences in prediction accuracy across model layers, we extracted embeddings from the input layer (embedding layer), as well as every third layer within the model. As a control to account for predictions potentially driven by low-level sensory or motor brain activity, we generated random normal embeddings4 with the same dimensionality as the GPT embeddings (2,816 features). These embeddings were matched to individual utterance instances corresponding to each TR (for example, ‘something I’m thinking’ in Fig. 1a).
Head motion model construction
To account for BOLD signal variance attributable to head motion, six translational and rotational motion parameters estimated during preprocessing were included as regressors. Frame-wise displacement values, calculated following previous research67, were also incorporated. A distance of 50 mm between the cerebral cortex and the head centre was assumed in accordance with a prior study67.
Separate and unified linguistic model construction
We constructed two linguistic models to evaluate hypotheses regarding the neural representation of language production and comprehension. For the separate linguistic model, it assumes independent representations for production and comprehension, combining contextual embeddings extracted separately for each modality. Each embedding set comprised 2,816 features, derived from combinations of 13 layers (0, 3, …, 36) and 6 context lengths (1, 2, 4, 8, 16, 32), yielding 78 feature combinations per modality and a total of 5,632 features. Identical feature pairs were used across both modalities. For the unified linguistic model, it assumes shared neural representations for production and comprehension. It utilized 2,816 contextual embeddings derived from combined speech content of both modalities within each TR. If only one modality was present, embeddings were derived solely from that modality.
Voxel-wise model estimation and testing
To model cortical activity in individual voxels, we used a FIR model, which accounts for the slow hemodynamic responses and their coupling to neural activity. Although the canonical hemodynamic response function is widely used in fMRI studies, it assumes a uniform HRF shape across cortical voxels. This simplification can result in inaccuracies, given that the shape of the hemodynamic response varies across cortical regions68. To address this variability, we concatenated 5,632 linguistic features with time delays spanning two to seven samples (2–7 s), yielding a total of 34,932 features. We modelled the BOLD responses as a linear combination of these features, with weights estimated using banded ridge regression, implemented via Himalaya package38,39. Regularization parameters were optimized through fivefold cross-validation, exploring ten values between 10−2 and 107. Model testing utilized leave-one-session-out cross-validation, in which one session was withheld for testing while the remaining sessions served as training data. The prediction accuracy was evaluated by calculating the Pearson’s correlation coefficient between observed and predicted BOLD responses in the test dataset. The statistical significance was determined through a one-sided permutation test. A null distribution was generated by permuting 20-TR blocks (20 s) of the left-out test data 1,000 times, recalculating the correlation for each permutation. Multiple comparisons were corrected using the FDR procedure69.
LME model analysis
To explore how timescales and layer positions influence prediction accuracy and weight correlations, we conducted LME model analyses using the lmer function from the lmerTest package (version 3.1-3)70 in R (version 4.3.3). Fixed effects included layer position, context length, their interaction and quadratic terms for both predictors to capture potential non-linear relationships. The models included by-participant random intercepts and random slopes for context length and its quadratic term, allowing for individual variability in the effects of contextual information. To assess the influence of encoding model type or voxel type, we extended the LME model structure by adding type as both a fixed effect and a by-participant random slope. Finally, we simplified the models by stepwise removal of non-significant predictors, selecting the model structure with the lowest Akaike Information Criterion values using the Kenward–Roger approximation. The P values smaller than 2.2 × 10−16 are reported as <2.2 × 10−16, which is the lower limit of the default precision in R.
Variance partitioning
We performed variance partitioning to quantify the unique contributions of linguistic features to BOLD responses in production, comprehension and their intersection. Following methods from previous voxel-wise modelling studies16,40, we used three models: a production-only model, a comprehension-only model and their combination (that is, the separate linguistic model). We used set theory to calculate the unique and common variances explained as follows. Unique variance was calculated as follows
Shared variance was calculated as follows
While variance partitioning is typically reported using R2 values, we report the square roots of these values to align with our primary evaluation metric—correlation coefficients—thereby facilitating direct comparison and consistent interpretation across all reported results. Variance partitioning was applied to all layer-context combinations. In principle, variance partitioning assumes equal sample sizes across conditions. However, in our naturalistic dialogue experiment, individual fMRI frames (TRs) may correspond to production, comprehension, both or neither, resulting in inherent unequal sample sizes across conditions. To preserve the ecological validity of the dataset and avoid imposing artificial constraints, we applied variance partitioning uniformly across all TRs.
PCA
To identify low-dimensional representations of the separate linguistic model weights, we performed a PCA following previous studies21,42 separately for production and comprehension. Model weights, averaged across six delays for each feature (33,792/6 weights = 5,632 mean weights) and sessions, were scaled by prediction accuracy to reduce contributions from voxels with lower prediction accuracy. A PCA was performed separately for each modality on these scaled weights in all cortical voxels (2,816 weights × all cortical voxels), yielding 2,816 orthogonal PCs. We assessed the significance of the first 20 weight PCs by comparing their explained variance with the first 20 stimulus PCs (derived from GPT embeddings) using bootstrapping (1,000 iterations). Correspondence between weight and stimulus PCs was enhanced using the Gale–Shapley stable marriage algorithm. The PCs were deemed significant if the stimulus PC never explained more variance than the corresponding weight PC in all bootstrap samples (P < 0.001).
For our current analysis, we focused on a context length of 1 s for layer 0, which yielded the highest number of significant PCs across participants (Supplementary Fig. 19). Interpretation involved three steps: (1) Identification of correlated utterances: for each PC, the top 20 positively and negatively correlated utterances were identified for each participant (Fig. 5c,g). (2) Interpretation using ChatGPT: utterances and correlation coefficients were input into ChatGPT (GPT-4o) for consistent interpretations across PCs, modalities and participants. (3) Synthesis of common components: ChatGPT synthesized interpretations to identify common components across participants.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The MRI data and preprocessed stimulus features used in the current study are available via OpenNeuro at https://openneuro.org/datasets/ds004669. The Destrieux Atlas can be accessed via the FreeSurfer software package (https://surfer.nmr.mgh.harvard.edu/fswiki/CorticalParcellation). The Corpus of Everyday Japanese is available from the National Institute for Japanese Language and Linguistics (https://www2.ninjal.ac.jp/conversation/cejc-monitor.html). Because the free-form conversations include complex details that could reveal participants’ identities, the raw speech data—after removal of personal identifiers—will be provided only to researchers who (1) contact the corresponding author (S.N.) and (2) sign a data-sharing agreement that complies with the regulations of the relevant ethics committees and with applicable privacy laws.
Code availability
The custom code for this study is available via GitHub at https://github.com/yamashita-lang/dialogue. All model fitting and analyses were performed using custom software written in Python, utilizing libraries such as NumPy71, SciPy72, Scikit-learn73, Matplotlib74, Himalaya38 and Pycortex64. An exception was the LME analysis, which was performed using the lmertTest70 package in R. The GPT models were accessed via Huggingface75.
References
Clark, H. H. & Brennan, S. E. Grounding in communication. In Perspectives on Socially Shared Cognition (eds Resnick, L. B. et al.) 127–149 (American Psychological Association, 1991); https://doi.org/10.1037/10096-006
Pickering, M. J. & Garrod, S. Toward a mechanistic psychology of dialogue. Behav. Brain Sci. 27, 169–190 (2004).
Silbert, L. J., Honey, C. J., Simony, E., Poeppel, D. & Hasson, U. Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proc. Natl Acad. Sci. USA 111, E4687–E4696 (2014).
Zada, Z. et al. A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron 112, 3211–3222.e5 (2024).
Hagoort, P., Hald, L., Bastiaansen, M. & Petersson, K. M. Integration of word meaning and world knowledge in language comprehension. Science 304, 438–441 (2004).
Frank, M. C. & Goodman, N. D. Predicting pragmatic reasoning in language games. Science 336, 998 (2012).
Yeshurun, Y. et al. Same story, different story: the neural representation of interpretive frameworks. Psychol. Sci. 28, 307–319 (2017).
Sacks, H., Schegloff, E. A. & Jefferson, G. in Studies in the Organization of Conversational Interaction (ed. Schenkein, J.) 7–55 (Academic Press, 1978); https://doi.org/10.1016/B978-0-12-623550-0.50008-2
Levinson, S. C. Turn-taking in human communication—origins and implications for language processing. Trends Cogn. Sci. 20, 6–14 (2016).
Clark, H. H. & Fox Tree, J. E. Using uh and um in spontaneous speaking. Cognition 84, 73–111 (2002).
Fox Tree, J. E. Listeners’ uses of um and uh in speech comprehension. Mem. Cognit. 29, 320–326 (2001).
Knudsen, B., Creemers, A. & Meyer, A. S. Forgotten little words: how backchannels and particles may facilitate speech planning in conversation? Front. Psychol. 11, 593671 (2020).
Bellana, B., Mahabal, A. & Honey, C. J. Narrative thinking lingers in spontaneous thought. Nat. Commun. 13, 4585 (2022).
Lerner, Y., Honey, C. J., Silbert, L. J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci. 31, 2906–2915 (2011).
Regev, M., Honey, C. J., Simony, E. & Hasson, U. Selective and invariant neural responses to spoken and written narratives. J. Neurosci. 33, 15978–15988 (2013).
Heer de, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L. & Theunissen, F. E. The hierarchical cortical organization of human speech processing. J. Neurosci. 37, 6539–6557 (2017).
Deniz, F., Nunez-Elizalde, A. O., Huth, A. G. & Gallant, J. L. The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality. J. Neurosci. 39, 7722–7736 (2019).
Popham, S. F. et al. Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci. 24, 1628–1636 (2021).
Deniz, F., Tseng, C., Wehbe, L., Tour la, T. D. & Gallant, J. L. Semantic representations during language comprehension are affected by context. J. Neurosci. 43, 3144–3158 (2023).
Caucheteux, C., Gramfort, A. & King, J.-R. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat. Hum. Behav. 7, 430–441 (2023).
Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).
Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).
Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).
Chen, C., Dupré la Tour, T., Gallant, J. L., Klein, D. & Deniz, F. The cortical representation of language timescales is shared between reading and listening. Commun. Biol. 7, 284 (2024).
Menenti, L., Gierhan, S. M. E., Segaert, K. & Hagoort, P. Shared language: overlap and segregation of the neuronal infrastructure for speaking and listening revealed by functional MRI. Psychol. Sci. 22, 1173–1182 (2011).
Hu, J. et al. Precision fMRI reveals that the language-selective network supports both phrase-structure building and lexical access during language production. Cereb. Cortex 33, 4384–4404 (2022).
Giglio, L., Ostarek, M., Weber, K. & Hagoort, P. Commonalities and asymmetries in the neurobiological infrastructure for language production and comprehension. Cereb. Cortex 32, 1405–1418 (2022).
Patel, T., Morales, M., Pickering, M. J. & Hoffman, P. A common neural code for meaning in discourse production and comprehension. NeuroImage 279, 120295 (2023).
Goldstein, A. et al. A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations. Nat. Hum. Behav. 9, 1041–1055 (2025).
Hasson, U. & Frith, C. D. Mirroring and beyond: coupled dynamics as a generalized framework for modelling social interactions. Philos. Trans. R. Soc. B 371, 20150366 (2016).
Giglio, L., Ostarek, M., Sharoh, D. & Hagoort, P. Diverging neural dynamics for syntactic structure building in naturalistic speaking and listening. Proc. Natl Acad. Sci. USA 121, e2310766121 (2024).
Castellucci, G. A., Kovach, C. K., Howard, M. A., Greenlee, J. D. W. & Long, M. A. A speech planning network for interactive language use. Nature 602, 117–122 (2022).
Naselaris, T., Kay, K. N., Nishimoto, S. & Gallant, J. L. Encoding and decoding in fMRI. NeuroImage 56, 400–410 (2011).
Nishimoto, S. et al. Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. 21, 1641–1646 (2011).
Caucheteux, C., Gramfort, A. & King, J. -R. Disentangling syntax and semantics in the brain with deep networks. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 1336–1348 (PMLR, 2021).
Oota, S., Gupta, M. & Toneva, M. Joint processing of linguistic properties in brains and language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 18001–18014 (Curran Associates, 2023).
Sawada, K. et al. Release of pre-trained models for the Japanese language. Preprint at https://arxiv.org/abs/2404.01657 (2024).
Nunez-Elizalde, A. O., Huth, A. G. & Gallant, J. L. Voxelwise encoding models with non-spherical multivariate normal priors. NeuroImage 197, 482–492 (2019).
Dupré la Tour, T., Eickenberg, M., Nunez-Elizalde, A. O. & Gallant, J. L. Feature-space selection with banded ridge regression. NeuroImage 264, 119728 (2022).
Lescroart, M. D., Stansbury, D. E. & Gallant, J. L. Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Front. Comput. Neurosci. 9, 135 (2015).
Aw, K. L., Montariol, S., AlKhamissi, B., Schrimpf, M. & Bosselut, A. Instruction-tuning aligns LLMs to the human brain. In First Conference on Language Modeling https://openreview.net/forum?id=nXNN0x4wbl (2024).
Huth, A. G., Nishimoto, S., Vu, A. T. & Gallant, J. L. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 76, 1210–1224 (2012).
Pickering, M. J. & Garrod, S. An integrated theory of language production and comprehension. Behav. Brain Sci. 36, 329–347 (2013).
Stephens, G. J., Silbert, L. J. & Hasson, U. Speaker–listener neural coupling underlies successful communication. Proc. Natl Acad. Sci. USA 107, 14425–14430 (2010).
Nastase, S. A., Goldstein, A. & Hasson, U. Keep it real: rethinking the primacy of experimental control in cognitive neuroscience. NeuroImage 222, 117254 (2020).
Hamilton, L. S. & Huth, A. G. The revolution will not be controlled: natural stimuli in speech neuroscience. Lang. Cogn. Neurosci. 35, 573–582 (2018).
Fedorenko, E., Ivanova, A. A. & Regev, T. I. The language network as a natural kind within the broader landscape of the human brain. Nat. Rev. Neurosci. 25, 289–312 (2024).
Yeshurun, Y., Nguyen, M. & Hasson, U. The default mode network: where the idiosyncratic self meets the shared social world. Nat. Rev. Neurosci. 22, 181–192 (2021).
Bašnáková, J., Weber, K., Petersson, K. M., van Berkum, J. & Hagoort, P. Beyond the language given: the neural correlates of inferring speaker meaning. Cereb. Cortex 24, 2572–2578 (2014).
Paunov, A. M. et al. Differential tracking of linguistic vs. mental state content in naturalistic stimuli by language and theory of mind (ToM) brain networks. Neurobiol. Lang. 3, 413–440 (2022).
Kuhlen, A. K., Bogler, C., Brennan, S. E. & Haynes, J.-D. Brains in dialogue: decoding neural preparation of speaking to a conversational partner. Soc. Cogn. Affect. Neurosci. 12, 871–880 (2017).
Olson, H. A., Chen, E. M., Lydic, K. O. & Saxe, R. R. Left-hemisphere cortical language regions respond equally to dialogue and monologue. Neurobiol. Lang. 4, 575–610 (2023).
Indefrey, P. The spatial and temporal signatures of word production components: a critical update. Front. Psychol. 2, 255 (2011).
Bögels, S., Magyari, L. & Levinson, S. C. Neural signatures of response planning occur midway through an incoming question in conversation. Sci. Rep. 5, 12881 (2015).
Bögels, S. & Levinson, S. C. The brain behind the response: insights into turn-taking in conversation from neuroimaging. Res. Lang. Soc. Interact. 50, 71–89 (2017).
Rastelli, C. et al. Neural dynamics of semantic control underlying generative storytelling. Commun. Biol. 8, 513 (2025).
Vanlangendonck, F., Willems, R. M., Menenti, L. & Hagoort, P. An early influence of common ground during speech planning. Lang. Cogn. Neurosci. 31, 741–750 (2016).
Kuhlen, A. K. & Abdel Rahman, R. Beyond speaking: neurocognitive perspectives on language production in social interaction. Philos. Trans. R. Soc. B 378, 20210483 (2023).
Alderson-Day, B. & Fernyhough, C. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychol. Bull. 141, 931–965 (2015).
Oldfield, R. C. The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9, 97–113 (1971).
Moeller, S. et al. Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel imaging with application to high spatial and temporal whole-brain fMRI. Magn. Reson. Med. 63, 1144–1153 (2010).
Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based analysis: I. Segmentation and surface reconstruction. NeuroImage 9, 179–194 (1999).
Fischl, B., Sereno, M. I. & Dale, A. M. Cortical surface-based analysis: II: inflation, flattening, and a surface-based coordinate system. NeuroImage 9, 195–207 (1999).
Gao, J. S., Huth, A. G., Lescroart, M. D. & Gallant, J. L. Pycortex: an interactive surface visualizer for fMRI. Front. Neuroinformatics 9, 23 (2015).
Destrieux, C., Fischl, B., Dale, A. & Halgren, E. Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage 53, 1–15 (2010).
Black, S. et al. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5—Workshop on Challenges & Perspectives in Creating Large Language Models. (eds Fan, A. et al.) 95–136 (2022).
Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L. & Petersen, S. E. Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage 59, 2142–2154 (2012).
Handwerker, D. A., Ollinger, J. M. & D’Esposito, M. Variation of BOLD hemodynamic responses across subjects and brain regions and their effects on statistical analyses. NeuroImage 21, 1639–1651 (2004).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
Kuznetsova, A., Brockhoff, P. B. & Christensen, R. H. B. lmerTest package: tests in linear mixed effects models. J. Stat. Softw. 82, 1–26 (2017).
Harris et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.emnlp-demos.6
Acknowledgements
This work was supported by JST ERATO (grant no. JPMJER1801 to S.N.), AIP Acceleration Research (grant no. JPMJCR24U2 to S.N.), MIRAI (grant no. JPMJMI19D1 to S.N.) and JSPS (grant nos. JP23H05493 and JP24H00619 to S.N.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
S.N. conceptualized the experiment. R.K. and S.N. designed the experiment. R.K. performed the experiment. M.Y., R.K. and S.N. analysed and interpreted the results. M.Y. and S.N. wrote the paper. S.N. supervised the study.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Human Behaviour thanks Evelina Fedorenko, Shailee Jain and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–19, Tables 1–5 and results.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yamashita, M., Kubo, R. & Nishimoto, S. Conversational content is organized across multiple timescales in the brain. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02231-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41562-025-02231-4