Conversational content is organized across multiple timescales in the brain

Yamashita, Masahiro; Kubo, Rieko; Nishimoto, Shinji

doi:10.1038/s41562-025-02231-4

Download PDF

Article
Open access
Published: 11 June 2025

Conversational content is organized across multiple timescales in the brain

Nature Human Behaviour (2025)Cite this article

10k Accesses
3 Citations
81 Altmetric
Metrics details

Subjects

Abstract

The evolution of conversation facilitates the exchange of intricate thoughts and emotions. The meaning is progressively constructed by integrating both produced and perceived speech into hierarchical linguistic structures across multiple timescales, including words, sentences and discourse. However, the neural mechanisms underlying these interactive sense-making processes remain largely unknown. Here we used functional magnetic resonance imaging to measure brain activity during hours of spontaneous conversations, modelling neural representations of conversational content using contextual embeddings derived from a large language model (GPT) at varying timescales. Our results reveal that linguistic representations are both shared and distinct between production and comprehension, distributed across various functional networks. Shared representations, predominantly localized within language-selective regions, were consistently observed at shorter timescales, corresponding to words and single sentences. By contrast, modality-specific representations exhibited opposing timescale selectivity: shorter for production and longer for comprehension, suggesting that distinct mechanisms are involved in contextual integration. These findings suggest that conversational meaning emerges from the interplay between shared linguistic codes and modality-specific temporal integration, facilitating context-dependent comprehension and adaptive speech production.

Evidence of a predictive coding hierarchy in the human brain listening to speech

Article Open access 02 March 2023

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

Article Open access 07 March 2025

Incremental accumulation of linguistic context in artificial and biological neural networks

Article Open access 18 January 2025

Main

Humans can share complex thoughts and emotions through conversation, from casual greetings to formal discussions. Successful communication assumes a shared understanding of the context and goals, which provide a framework for what is talked about (content) and how the conversation unfolds (process)¹. Conversational content can be conveyed not only through linguistic alignment between interlocutors^2,3,4 but also through extralinguistic abilities such as social cognition, world knowledge and situation modelling^5,6,7. By contrast, the process of conversation requires temporal alignment, enabling the seamless alternation between language production and comprehension^8,9. Backchannel responses (for example, ‘yeah’) and fillers (for example, ‘uh’) play critical roles in reinforcing shared understanding and facilitating the coordination of speech planning and listening^10,11,12. Moreover, conveying complex ideas often relies on structured narratives, which form the foundation of effective knowledge sharing¹³. Thus, conversation emerges as a multidimensional (linguistic and extralinguistic) and multitimescale joint activity involving the intricate interplay of language production and comprehension.

The neural underpinnings of language processing have been extensively investigated using functional magnetic resonance imaging (fMRI), particularly in the context of naturalistic narrative comprehension. Comprehension can be viewed as the transformation of low-level sensory inputs (for example, speech sounds and written text) into high-level hierarchical linguistic structures^{14,15,16,17,18,19,20}. These structures are created by integrating linguistic information—both semantic and syntactic—across multiple timescales, ranging from word-level semantics to sentence meaning, ultimately culminating in a coherent narrative^{14,15,16,17,18,19,20,21,22,23,24}. While these studies have highlighted the brain’s capacity to encode hierarchical linguistic structures across various sensory modalities (for example, listening and reading) and timescales (words, sentences, discourse and narrative arcs), the functioning of these mechanisms during real-time conversations remains largely unexplored.

In this study, we address this gap by investigating the neural representation of conversational content, focusing on two key hypotheses. The first posits the existence of a unified linguistic representation that integrates information across production and comprehension. This hypothesis aligns with evidence of shared neural representations for semantic and syntactic information across modalities^{25,26,27,28,29}. The second proposes distinct representations for production and comprehension, enabling differentiation between self-generated speech and that of an interlocutor³⁰. This dual representation aligns with findings of modality-specific syntactic processes³¹ and mechanisms that facilitate concurrent speech planning during listening³², along with the predictive coding of an interlocutor’s speech^20,23.

To test these hypotheses, we used fMRI during natural conversations and utilized voxel-wise encoding modelling^33,34 to characterize neural representations of conversational content across multiple timescales, ranging from 1 to 32 s. Our analyses addressed two key questions: (1) Are linguistic representations shared between production and comprehension? (2) Are certain linguistic representations uniquely associated with production, comprehension or both? Throughout the study, the term linguistic encompasses both semantic and syntactic dimensions, reflecting their tight interdependence^35,36.

Results

Shared representations between production and comprehension

We collected fMRI data from eight native Japanese speakers during spontaneous conversations with an experimenter (Fig. 1a). Each participant talked about 27 topics across 27 runs (Supplementary Table 5), with each run lasting 7 min and 10 s, resulting in approximately 3 h of data per participant over three to four sessions. Despite efforts to balance the amount of speech production and comprehension, individual variation in speech samples was observed (Supplementary Fig. 1). Rigorous preprocessing minimized potential confounds from motion artifacts and peripheral articulatory and auditory processes, effectively isolating blood-oxygen-level-dependent (BOLD) responses associated with higher-level linguistic representations (Supplementary Figs. 2–5 and Supplementary Results).

**Fig. 1: Dialogue experiment and voxel-wise encoding models for predicting brain activity from contextual linguistic embeddings.**

We used voxel-wise encoding modelling, utilizing GPT embeddings extracted from conversational content as linguistic features (Fig. 1b). The transcriptions were transformed into contextual embeddings using an instruction-tuned GPT model³⁷ that was fine-tuned for interactive language tasks. This GPT model comprises an input embedding layer and 36 transformer layers, each containing 2,816 hidden units. We extracted embeddings from 13 hierarchical layers (the input layer and every third transformer layer) across six context lengths (1, 2, 4, 8, 16 and 32 s). We averaged the embeddings across all tokens within each segment (or fMRI volume, repetition time (TR) of 1,000 ms), resulting in 78 feature combinations per modality (13 layers × 6 context lengths). These features were integrated into a joint model, referred to as the separate linguistic model, with 5,632 features (2 modalities × 2,816 features). A finite impulse response (FIR) model was used to predict BOLD responses with delays ranging from 2 to 7 s (5,632 features × 6 delays = 33,792 features). This joint model was fit to BOLD responses using banded ridge regression^38,39 for each voxel. We used the leave-one-session-out cross-validation to train the model on N-1 sessions and test it on one session. The prediction accuracy was evaluated using Pearson’s correlation coefficients between observed and predicted BOLD responses. To account for autocorrelation, the data were divided into 20-s blocks, permuted to estimate the null distribution, and the correlations were calculated across 1,000 permutations to obtain P values. The reported prediction accuracy is derived from averaging across cross-validation folds, and the combined P values were calculated using Fisher’s method.

The separate linguistic model achieved good prediction accuracy across extensive cortical regions. For example, in participant P7, embeddings derived from an 8-s context at layer 18 exhibited high prediction accuracy in the bilateral prefrontal, temporal and parietal cortices (Fig. 2a and see Supplementary Figs. 6 and 9 for results from individual layers and participants). To investigate whether the average prediction accuracy across the cortex was influenced by layer position and context length, we conducted a linear mixed-effects (LME) model analysis. The participants were specified as random effects, permitting variations in the effects of context length and its squared term across participants (Methods). We found an inverted U-shaped relationship across timescales (context length: t(7) = 4.41, P = 0.0031, β = 0.74, 95% confidence interval (CI) 0.39 to 1.08; context length squared: t(7) = −4.54, P = 0.0027, β = −0.43, 95% CI −0.63 to −0.24, summarized in Supplementary Table 1) and across layers (layer position: t(597) = 11.36, P < 0.001, β = 0.15, 95% CI 0.12 to 0.17; layer position squared: t(597) = −17.95, β = −0.26, P < 0.001, 95% CI −0.29 to −0.23). In addition, we found a significant interaction effect between context length and layer position (t(597) = −6.95, P < 0.001, β = −0.09, 95% CI −0.11 to −0.06).

**Fig. 2: Shared linguistic representations between production and comprehension.**

Next, we addressed the question of how neural linguistic representations are shared between production and comprehension. To evaluate this, we assessed cross-modality prediction accuracy by interchanging the model weights between production and comprehension (Fig. 1d). This analysis was restricted to ‘linguistic voxels,’ defined as those exhibiting significant prediction accuracy in the separate linguistic model. We found a notable reduction in cross-modality prediction accuracy (Fig. 2b and see Supplementary Figs. 7 and 9 for individual layers and participants), consistent with findings from an LME model (actual model: t(7) = 7.05, P < 0.001, β = 0.057, 95% CI 0.040 to 0.074). Significantly predicted voxels were scattered across prefrontal, temporal, parietal and occipital cortices (Fig. 2a). Notably, cross-modality prediction accuracy increased with longer context lengths as revealed by significant fixed effects for both context length (t(7) = 4.01, P = 0.0052, β = 0.40, 95% CI 0.15 to 0.66). These findings suggest that while linguistic representations are partially shared between production and comprehension, the topographic organization is modulated by the timescales.

Although these results demonstrated generalizable linguistic representations across modalities, another critical question arises: is a unified linguistic representation sufficient for accurate predictions? To address this, we developed a unified linguistic model that extracted GPT embeddings from combined transcripts (Fig. 1e). Compared with the separate linguistic model, the unified linguistic model showed a slight reduction in prediction accuracy (Fig. 2b and see Supplementary Figs. 8 and 9 for individual layers and participants), as indicated by a significant fixed effect of model type (separate linguistic model: t(1227) = 4.01, P < 0.001, β = 3.4 × 10⁻³, 95% CI 2.7 × 10⁻³ to 4.1 × 10⁻³). Further LME analysis of the unified linguistic model revealed significant effects of layer position (layer position: t(597) = 7.45, P < 0.001, β = 0.08, 95% CI 0.06 to 0.11; layer position squared: t(597) = −22.79, P < 0.001, β = −0.29, 95% CI −0.31 to −0.26) and its interaction with context length (t(597) = −9.45, P < 0.001, β = −0.11, 95% CI −0.13 to −0.08), whereas no significant fixed effect of context length was detected. These findings suggest that the lower prediction performance of the unified linguistic model, relative to the Separate Linguistic model, may be attributable to its limited ability to leverage longer contextual information to enhance predictions.

Next, we quantified the similarities in the linguistic representations across modalities. Given the considerably decreased prediction accuracy from same-modality to cross-modality predictions, we hypothesized that these voxels might exhibit similar yet unique linguistic tuning for each modality (that is, a weak positive correlation). Here, we focus on ‘cross-modal voxels’ that demonstrated robust prediction in both same- and cross-modality conditions (r > 0.05). The weight correlation was calculated using Pearson’s correlation coefficient for the separate linguistic model weights for each voxel. Cross-modal voxels exhibited moderately positive correlations across layers and timescales (Fig. 2d and see Supplementary Fig. 11 for individual layers and participants) and showed higher weight correlations compared with linguistic voxels (cross-modal voxels: t(7) = 11.35, P < 0.001, β = 0.20, 95% CI 0.17 to 0.24). Notably, positively correlated voxels clustered within the prefrontal, temporal and parietal cortices at shorter timescales (1–4 s) (Fig. 2c and see Supplementary Fig. 14 for individual participant data). By contrast, at longer timescales (16–32 s), positively correlated voxels were more diffusely distributed and idiosyncratic among participants. These findings suggest that while cross-modal voxels may share some aspects of linguistic representation, they also exhibit unique tuning across modalities.

Modality-specific timescale selectivity

We next explored modality-specific linguistic representations by fitting the production-only and comprehension-only linguistic models. These models utilized modality-specific contextual embeddings to quantify the variance in BOLD responses that could be uniquely attributed to each modality. Variance partitioning^16,40 was used to assign variance to either production or comprehension using the following equations

$$\begin{array}{l}{\rm{Production\backslash comprehension}}={\rm{production}}\cup {\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\,-{\rm{comprehension}}\end{array}$$

$$\begin{array}{l}{\rm{Comprehension}}\backslash{\rm{production}}={\rm{production}}\cup {\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\;-{\rm{production}}.\end{array}$$

We found that production explained more variance at shorter timescales (1–4 s) (Fig. 3a and see Supplementary Fig. 10 for individual layers and participants). The LME analysis revealed significant effects of context length (t(7) = −3.87, P = 0.0061, β = −0.40, 95% CI −0.62 to −0.18), layer position (layer position: t(597) = 3.72, P < 0.001, β = 0.049, 95% CI 0.023 to 0.075; layer position squared: t(597) = −8.82, P < 0.001, β = −0.13, 95% CI −0.16 to −0.10) and their interaction (t(597) = −5.27, P < 0.001, β = −0.070, 95% CI −0.096 to −0.044). By contrast, comprehension explained more variance at longer timescales (16–32 s). The LME analysis indicated an inverted U-shaped relationship for context length (context length: t(7) = 4.55, P = 0.0026, β = 1.02, 95% CI 0.55 to 1.49; context length squared: t(7) = −2.42, P = 0.046, β = −0.41, 95% CI −0.75 to −0.06) and layer position (layer position: t(598) = 8.33, P < 0.001, β = 0.11, 95% CI 0.08 to 0.13; layer position squared: t(598) = −11.57, P < 0.001, β = −0.16, 95% CI −0.19 to −0.13). Across participants, the context length that maximized prediction accuracy consistently varied between modalities, with production peaking at shorter timescales and comprehension at longer timescales (Fig. 3b). These findings suggest distinct timescale selectivity for production and comprehension.

**Fig. 3: Timescale selectivity for short-context in production and long-context in comprehension.**

To further investigate modality-specific timescale selectivity, we compared the weights of the unified linguistic model weights to those of the separate linguistic models by calculating voxel-wise weight correlations. For production, the LME analysis revealed a U-shaped relationship across timescales (context length: t(7) = −8.59, P < 0.001, β = −1.01, 95% CI −1.26 to −0.77; context length squared: t(7) = 5.59, P < 0.001, β = 0.32, 95% CI 0.20 to 0.44) (Fig. 3c and see Supplementary Fig. 12 for individual layers and participants). A similar U-shaped relationship was observed for comprehension (context length: t(7) = 0.70, P = 0.50, β = 0.12, 95% CI −0.23 to 0.47; context length squared: t(7) = 3.30, P = 0.013, β = 0.23, 95% CI 0.08 to 0.37). The weights of the unified linguistic model were more closely aligned with production at shorter timescales, while they resembled comprehension at longer timescales (Fig. 3d and Supplementary Fig. 12 for individual layers and participants). These results underscore modality-specific timescale selectivity, with production favoring shorter contexts and comprehension benefiting from longer contexts.

To mitigate potential biases in variance partitioning results due to disparities in sample sizes, we examined the correlation between production-to-comprehension sample size ratios and the corresponding variance explained. A significant correlation was observed for early layers at a 1-s context length, with Spearman’s rank correlation rho of 1.00 for layer 0 and 0.93 for layer 3 (P < 0.05, false discovery rate (FDR) corrected). The participants who produced more speech demonstrated greater variance explained by production under these conditions. Importantly, the sample proportions were balanced overall, with four participants producing more speech (P3, P4, P6 and P7) and the remaining four comprehending more (P1, P2, P5 and P8). These findings confirm that variance partitioning results were not systematically biased towards either modality across participants.

Dual linguistic representations in bimodal voxels

After analysing the variance uniquely explained by production and comprehension, we investigated the shared variance explained by both modalities. The shared variance was calculated as follows

$$\begin{array}{l}{\rm{Production}}\cap {\rm{comprehension}}={\rm{production}}+{\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\quad-{\rm{production}}\cup {\rm{comprehension}}.\end{array}$$

We found that shared variance increased progressively with longer context lengths, peaking at an average of 8 s (Fig. 4a and see Supplementary Fig. 10 for individual layers and participants). Notably, for all participants, the context length that maximized shared variance exceeded 8 s (Fig. 4b). The LME analysis revealed an inverted U-shaped relationship across timescales (context length: t(7) = 4.55, P < 0.001, β = 1.02, 95% CI 0.68 to 1.36; context length squared: t(7) = −4.59, P = 0.0025, β = −0.49, 95% CI −0.71 to −0.27). A significant quadratic effect of layer position was also observed (layer position squared: t(597) = −12.30, P < 0.001, β = −0.20, 95% CI −0.23 to −0.17).

**Fig. 4: Dual linguistic representations in bimodal linguistic voxels.**

To map the topographic organization of selectivity to a single modality or shared between both, we created cortical maps depicting the patterns that explained the largest variance for each voxel. These maps revealed that voxels with the largest shared variance, hereafter referred to as ‘bimodal voxels,’ were distributed across various cortical regions (Fig. 4c and see Supplementary Fig. 16 for individual layers and participants). Notably, contextual information spanning 8 s or longer appeared to drive substantial bimodal responses, suggesting the integration of linguistic information across modalities.

We then examined whether bimodal voxels exhibited distinct linguistic tuning for production and comprehension. To achieve this, we calculated Pearson’s correlations of the separate linguistic model weights for production and comprehension specifically for bimodal voxels (Fig. 4c). The correlations were slightly negative and close to zero (Fig. 4d and see Supplementary Fig. 13 for individual layers and participants), suggesting that bimodal voxels are independently or dissimilarly tuned for the two modalities. In comparison with unimodal voxels—those with the largest unique variance for either production or comprehension—the bimodal voxels exhibited more negative correlations. The LME analysis confirmed this difference (production: t(8.4) = 11.99, P < 0.001, β = 0.038, 95% CI 0.032 to 0.045; comprehension: t(1227) = −21.00, P < 0.001, β = 0.021, 95% CI 0.019 to 0.023). These findings indicate that the bimodal voxels are independently tuned for each modality, reflecting the distinct linguistic demands of production and comprehension.

To ensure that these findings were not influenced by the instruction-tuned GPT model⁴¹, we replicated the core analyses using a base GPT model before instruction tuning. The results were consistent, confirming that instruction tuning did not affect the observed effects (Supplementary Fig. 18).

Revealing semantic tuning to interactive languages

To elucidate the linguistic organization underlying conversational content, we conducted principal component analysis (PCA) on the 2,816-dimensional separate linguistic model weights for each modality and participant. Building on previous research that mapped cortical semantic representations during natural speech comprehension^21,42, we adapted this framework for natural conversations. Due to variability in conversational content, a PCA was conducted separately for each participant and independently for production and comprehension, yielding modality-specific principal components (PCs). To assess the statistical robustness of the identified PCs, we performed a comparable PCA on GPT embeddings of the corresponding speech stimuli (‘stimulus PCA’). We quantified the variance explained by each PC through 1,000 bootstrap resampling iterations to establish statistical significance.

Our analysis revealed the highest number of significant PCs in the embedding layer (layer 0) at a context length of 1 s for both production and comprehension (Fig. 5b,f). For production, four PCs were identified in four participants, three PCs in three participants and five PCs in one participant (P < 0.001, bootstrap test) (Supplementary Fig. 19). For comprehension, five PCs were identified in four participants, while four PCs were identified in the remaining four participants (P < 0.001, bootstrap test).

**Fig. 5: PCs highlight conversation-specific linguistic representations.**

To interpret these significant PCs, we analysed the conversation content most strongly correlated with each PC. The majority of highly correlated content comprised words and phrases characteristic of interactive contexts, such as backchannel responses and conversational fillers (for example, participant P7) (Fig. 5c,g). To identify PCs consistently observed across participants, we utilized ChatGPT to analyse the top 20 most strongly correlated phrases for each PC, summarizing the common patterns. For production, four components emerged: (1) clear and structured speech versus ambiguous and spontaneous speech, (2) active discourse development versus conversational maintenance with backchannels, (3) emotional and empathetic speech versus factual and logical speech and (4) cautious speech with high cognitive load versus fluent and cooperative speech. For comprehension, four components were identified: (1) immediate conversational flow versus deliberate speech planning, (2) emotional and empathetic versus logical and information-driven, (3) passive listening versus active speech leadership and (4) concrete, experience-based versus abstract, conceptual speech. These findings demonstrate that both production and comprehension are tuned to the semantic demands of interactive language, revealing shared lexical–semantic components across participants. This highlights a consistent semantic organization that supports real-time social communication.

Discussion

This study explored the neural representations of conversational content across production and comprehension modalities and multiple timescales. We identified shared linguistic representations exhibiting timescale-dependent topographic organization (Fig. 2). For shorter contexts (1–4 s), corresponding to words and single sentences, shared representations were localized in higher-order association cortices, including the prefrontal, temporal and parietal regions. By contrast, for longer contexts (16–32 s), spanning multiple conversational turns, these shared representations were more distributed and idiosyncratic among participants. Furthermore, modality-specific timescale selectivity revealed enhanced encoding for shorter contexts during production and for longer contexts during comprehension (Fig. 3), suggesting distinct temporal integration processes. We also identified dual linguistic representations in bimodal voxels, encoding modality-specific information for both production and comprehension (Fig. 4). Despite these timescale-specific patterns, our analysis of low-dimensional linguistic representations revealed lexical–semantic components predominantly associated with shorter timescales (Fig. 5).

Theoretical models for the neural mechanism of conversation have proposed a common neural basis for language production and comprehension^2,43. Empirical studies have adopted two primary approaches to examine this commonality: (1) the between-subjects approach, which examines the transmission of messages from speaker to listener^3,4,44 and (2) the within-subject approach, which investigates shared neural mechanisms within individuals^{25,26,27,28,29}. Our study contributes to the within-subject approach by revealing both shared and distinct neural representations and their modulation by contextual timescales within individual participants.

Previous neuroimaging studies, using the within-subject approach, have manipulated the semantic and syntactic dimensions of stimuli to reveal shared representations^25,26,27,28. Recent research utilizing spontaneously generated sentences and conversations has enhanced ecological validity^45,46, uncovering shared semantic and syntactic representations during natural language use^29,31. For instance, recent research²⁹ used electrocorticography during natural conversations and modelled transient neural activity before and after word onset, identifying overlapping regions for word production and comprehension. However, two critical questions remain unresolved: (1) whether linguistic representations generalize across modalities and (2) how these shared representations vary across timescales. Our study addresses these gaps, demonstrating the generalizability of shared representations and their modulation by the amount of contextual information.

The topographic organization of shared neural representations varied across multiple timescales. At shorter timescales (1–4 s), corresponding to the duration of words and single sentences, shared representations were localized in higher-order brain regions, including the bilateral prefrontal, temporal and parietal cortices. This finding aligns with previous studies that mapped neural representations of intermediate linguistic structures, such as words and single sentences, onto these regions during naturalistic narrative comprehension^14,15,19,24. These regions have consistently been associated with sentence-level processing in traditional neuroimaging studies of isolated sentences presented at shorter timescales (less than 6 s)^25,27,28. Furthermore, these brain regions are partially overlap with those involved in linguistic knowledge and processes that are shared across both production and comprehension^26,47. Therefore, the shared representations observed at shorter timescales suggest the presence of a common neural code for sentence-level linguistic information (‘sentence meaning’).

By contrast, at longer timescales (16–32 s), spanning multiple conversational turns, shared representations were distributed across broader cortical regions, exhibiting notable interindividual variability. Some participants (P1, P2, P4, P6 and P7) demonstrated shared representations extending into brain regions associated with the default mode network and the theory of mind (ToM) network. The default mode network has been implicated in representing higher-order discourse and narrative frameworks by integrating extrinsic information (for example, utterances) with intrinsic information (that is, prior context and memory)^7,14,48. Similarly, the ToM network supports reasoning about others’ mental states, a critical function in both language production and comprehension during conversations^47,49,50. This network is particularly engaged in inferring the mental states of conversational partners, thereby facilitating pragmatic inferences about that particular individual^49,51,52. These findings suggest that shared representations at longer timescales support the integration of incoming conversational content with prior conversational context, as well as with broader social knowledge and beliefs. Such integration may support the formation of a psychological model of the situation, enabling inferences about the interlocutor’s intended meaning (‘speaker meaning’). Individual differences in the spatial distribution of these shared representations may reflect variability in discourse-level integration strategies.

The contrasting timescale selectivity between production and comprehension may reflect their distinct functional demands in processing linguistic input and output. Our findings demonstrate that language comprehension exhibits enhanced encoding for longer timescales, consistent with the requirements of real-world language comprehension. Effective comprehension necessitates the integration of linguistic input with world knowledge, beliefs and memory to extract meaning from extended contexts^5,7,48. This is consistent with evidence indicating that the brain prioritizes understanding broader discourse-level and overarching meanings over shorter units, such as individual words or sentences^14,19. By contrast, we found that language production is characterized by enhanced encoding for shorter timescales. Production involves extensive preparatory processes, including ideation, lexical selection, syntactic structuring and speech planning, all of which occur before speech output^{9,32,51,53,54,55,56}. Furthermore, production must dynamically adapt to the interlocutor’s immediate reactions, ensuring fluid and responsive communication^57,58. These demands suggest that production prioritize responsiveness and flexibility over reliance on extended contextual information.

Dual representations in the bimodal voxels exhibited selectivity for longer timescales (exceeding 8 s), corresponding to the integration of multiple sentences into coherent discourse. These representations probably facilitate the ability to maintain and distinguish perspectives, a critical function during conversation³⁰. Conversations inherently require participants to navigate distinct perspectives, requiring differentiation at the neural level. Such interpersonal cognitive processes, integral to managing multiple perspectives, are probably not limited to external communication but may also underpin internal speech processes⁵⁹.

Despite this modality-specific timescale selectivity, our PCA identified similar lexical semantic components across modalities within the embedding layer at short timescales (1–4 s). Our results potentially extend the seminal work of Huth and colleagues²¹, which comprehensively mapped fine-grained semantic representations during natural speech listening using word embeddings. Specifically, that study identified the first PC differentiating between ‘humans and social interaction’ and ‘perceptual descriptions, quantitative descriptions and setting’, thereby separating social content from physical content. Our conversational data offered a unique opportunity to examine the semantic space surrounding social words in greater depth. Notably, our identified PCs reflected social interaction nuances, such as backchannels, confirmations and fillers. These elements require minimal cognitive effort yet are vital for maintaining conversational flow^1,10,11,12. By contrast, PCs linked to ‘factual and logical speech’ or ‘logical and information-driven’, such as referring to locations or objects, were identified as opposite axis of the social components (‘emotion and empathetic’). This suggests that interactive language enhances the neural representation of social content, highlighting the interplay between semantic representations and social cognition.

Several limitations of the present study should be noted. We did not conduct functional localizer tasks to delineate specific functional networks, such as the language network and ToM network. Thus, our analysis could not precisely attribute voxel clusters to specific functional networks.

Our findings shed light on temporally hierarchical neural linguistic representations underlying both sentence meaning and speaker meaning during real-world conversations. Modality-aligned representations were primarily localized to brain regions involved in processing word- and sentence-level linguistic information over shorter timescales, while modality-specific representations exhibited distinct timescale selectivity: shorter contexts for production and longer contexts for comprehension. These findings emphasize the importance of investigating the neurobiological basis of language within socially interactive contexts to comprehensively understand human language use.

Methods

Participants

Eight healthy, right-handed native Japanese speakers (P1–P8) participated in the fMRI experiment. The participants comprised five males (P1: age 22, P2: age 22, P3: age 23, P5: age 20 and P8: age 20) and three females (P4: age 22, P6: age 20 and P7: age 20). All participants were confirmed as right-handed through the Edinburgh Handedness Inventory⁶⁰ (with a laterality quotient score of 75–100), and they had normal hearing as well as normal or corrected-to-normal vision. The experimental protocol was approved by the Ethics and Safety Committee of the National Institute of Information and Communications Technology, Osaka, Japan. Written informed consent was obtained from all participants before the experiment.

Natural dialogue experiment

The experiment consisted of 27 conversation topics, including self-introduction and favourite classes (Supplementary Table 5). These topics were selected to cover a wide range of semantic domains relevant to daily life, such as knowledge, memory, imagination and temporal and spatial cognition, referencing the Corpus of Everyday Japanese. Each fMRI run lasted 7 min and 10 s and focused on a specific topic. The participants engaged in unscripted, natural dialogues, freely expressing their thoughts and emotions while responding in real time to their interlocutor’s input. Speech was delivered and recorded via fMRI-compatible insert earphones and a noise-cancelling microphone, respectively. Both the participants’ and interlocutor’s speech were recorded separately for subsequent analysis. Each participant completed 27 runs across four sessions, except for P3, who completed three sessions. Due to the collection of a single valid run in one session, only three sessions were analysed for P2 and P5, and the analysis included two to ten runs per session (Supplementary Table 4). On average, the participants produced speech during 217.1 ± 26.0 (mean ± standard deviation) fMRI volumes per run (range 170.6–262.1), while comprehending speech during 214.4 ± 11.8 volumes (range 199.2–234.0) (Supplementary Fig. 1).

MRI data acquisition

Magnetic resonance imaging (MRI) data were collected on a 3T MRI scanner at CiNet. Participants P1–P5 were scanned on a Siemens MAGNETOM Prisma, while P6–P8 were scanned on a Siemens MAGNETOM Prisma Fit, both equipped with 64-channel head coils. Functional images were acquired using a T2-weighted gradient echo multiband echo-planar imaging sequence⁶¹ in interleaved order, covering the entire brain. The imaging parameters were as follows: TR of 1.0 s, echo time (TE) of 30 ms, flip angle of 60°, matrix size of 96 × 96, field of view of 192 mm × 192 mm, voxel size of 2 mm × 2 mm × 2 mm, slice gap of 0 mm, 72 axial slices, multiband factor of 6. High-resolution anatomical images were obtained using a T1-weighted MPRAGE sequence with the following parameters: TR of 2.53 s, TE of 3.26 ms, flip angle of 9°, matrix size of 256 × 256, field of view of 256 mm × 256 mm, voxel size of 1 mm × 1 mm × 1 mm.

fMRI data preprocessing

The fMRI data were preprocessed using the Statistical Parametric Mapping toolbox (SPM8). Motion correction was applied to each run, aligning all volumes to the first echo-planar imaging frame for each participant. To remove low-frequency drift, we used a median filter with a 120-s window. The response for each voxel was then normalized by subtracting the mean response and scaling to unit variance. The cortical surfaces were identified using FreeSurfer^62,63, which registered the anatomical data with the functional data. For each participant, only voxels identified within the cerebral cortex were included in the analysis, ranging from 64,072 to 72,018 voxels per participant. The flatmaps were generated by projecting voxel values onto cortical surfaces using Pycortex⁶⁴. Cortical anatomical parcellation was performed using the Destrieux Atlas⁶⁵, and the resulting parcellations were visualized on cortical surface maps.

Transcription and temporal alignment

Conversational speech was transcribed morphologically using the Microsoft Azure Speech-to-Text, followed by manual correction for accuracy. The morphemes were grouped into meaningful semantic chunks, approximating the fMRI TR (1,000 ms) and temporally aligned to the corresponding fMRI volumes using the midpoint of each chunk’s duration.

Contextual embedding extraction

To extract contextual embeddings from the content of conversations, we utilized an instruction-tuned language model (GPT) fine-tuned specifically for Japanese³⁷ (https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft). This model is built on the open-source GPT-NeoX architecture⁶⁶ and was pretrained to predict the next word on the basis of preceding context using 312.5 billion tokens from various Japanese text datasets: Japanese CC-100, Japanese C4 and Japanese Wikipedia. For comparative purposes, we also replicated our analysis using the non-instruction-tuned version of the model (https://huggingface.co/rinna/japanese-gpt-neox-3.6b) as detailed in Supplementary Fig. 18. Instruction tuning was performed using datasets translated into Japanese, including Anthropic HH RLHF data, FLAN Instruction Tuning data and the Stanford Human Preferences Dataset. The resulting model architecture comprises 36 transformer layers with hidden unit dimensions of 2,816.

We processed transcribed utterances using GPT-NeoX with context lengths of 1, 2, 4, 8, 16 and 32 s, extracting embeddings by averaging the internal representations of all tokens within each utterance. To investigate differences in prediction accuracy across model layers, we extracted embeddings from the input layer (embedding layer), as well as every third layer within the model. As a control to account for predictions potentially driven by low-level sensory or motor brain activity, we generated random normal embeddings⁴ with the same dimensionality as the GPT embeddings (2,816 features). These embeddings were matched to individual utterance instances corresponding to each TR (for example, ‘something I’m thinking’ in Fig. 1a).

Head motion model construction

To account for BOLD signal variance attributable to head motion, six translational and rotational motion parameters estimated during preprocessing were included as regressors. Frame-wise displacement values, calculated following previous research⁶⁷, were also incorporated. A distance of 50 mm between the cerebral cortex and the head centre was assumed in accordance with a prior study⁶⁷.

Separate and unified linguistic model construction

We constructed two linguistic models to evaluate hypotheses regarding the neural representation of language production and comprehension. For the separate linguistic model, it assumes independent representations for production and comprehension, combining contextual embeddings extracted separately for each modality. Each embedding set comprised 2,816 features, derived from combinations of 13 layers (0, 3, …, 36) and 6 context lengths (1, 2, 4, 8, 16, 32), yielding 78 feature combinations per modality and a total of 5,632 features. Identical feature pairs were used across both modalities. For the unified linguistic model, it assumes shared neural representations for production and comprehension. It utilized 2,816 contextual embeddings derived from combined speech content of both modalities within each TR. If only one modality was present, embeddings were derived solely from that modality.

Voxel-wise model estimation and testing

To model cortical activity in individual voxels, we used a FIR model, which accounts for the slow hemodynamic responses and their coupling to neural activity. Although the canonical hemodynamic response function is widely used in fMRI studies, it assumes a uniform HRF shape across cortical voxels. This simplification can result in inaccuracies, given that the shape of the hemodynamic response varies across cortical regions⁶⁸. To address this variability, we concatenated 5,632 linguistic features with time delays spanning two to seven samples (2–7 s), yielding a total of 34,932 features. We modelled the BOLD responses as a linear combination of these features, with weights estimated using banded ridge regression, implemented via Himalaya package^38,39. Regularization parameters were optimized through fivefold cross-validation, exploring ten values between 10⁻² and 10⁷. Model testing utilized leave-one-session-out cross-validation, in which one session was withheld for testing while the remaining sessions served as training data. The prediction accuracy was evaluated by calculating the Pearson’s correlation coefficient between observed and predicted BOLD responses in the test dataset. The statistical significance was determined through a one-sided permutation test. A null distribution was generated by permuting 20-TR blocks (20 s) of the left-out test data 1,000 times, recalculating the correlation for each permutation. Multiple comparisons were corrected using the FDR procedure⁶⁹.

LME model analysis

To explore how timescales and layer positions influence prediction accuracy and weight correlations, we conducted LME model analyses using the lmer function from the lmerTest package (version 3.1-3)⁷⁰ in R (version 4.3.3). Fixed effects included layer position, context length, their interaction and quadratic terms for both predictors to capture potential non-linear relationships. The models included by-participant random intercepts and random slopes for context length and its quadratic term, allowing for individual variability in the effects of contextual information. To assess the influence of encoding model type or voxel type, we extended the LME model structure by adding type as both a fixed effect and a by-participant random slope. Finally, we simplified the models by stepwise removal of non-significant predictors, selecting the model structure with the lowest Akaike Information Criterion values using the Kenward–Roger approximation. The P values smaller than 2.2 × 10⁻¹⁶ are reported as <2.2 × 10⁻¹⁶, which is the lower limit of the default precision in R.

Variance partitioning

We performed variance partitioning to quantify the unique contributions of linguistic features to BOLD responses in production, comprehension and their intersection. Following methods from previous voxel-wise modelling studies^16,40, we used three models: a production-only model, a comprehension-only model and their combination (that is, the separate linguistic model). We used set theory to calculate the unique and common variances explained as follows. Unique variance was calculated as follows

$$\begin{array}{l}{\rm{Production\backslash comprehension}}={\rm{production}}\cup {\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\,-{\rm{comprehension}}\end{array}$$

$$\begin{array}{l}{\rm{Comprehension}}\backslash{\rm{production}}={\rm{production}}\cup {\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\;-{\rm{production}}.\end{array}$$

Shared variance was calculated as follows

$$\begin{array}{l}{\rm{Production}}\cap {\rm{comprehension}}={\rm{production}}+{\rm{comprehension}}\\\qquad\qquad\qquad\qquad\qquad\qquad\quad-{\rm{production}}\cup {\rm{comprehension}}.\end{array}$$

While variance partitioning is typically reported using R² values, we report the square roots of these values to align with our primary evaluation metric—correlation coefficients—thereby facilitating direct comparison and consistent interpretation across all reported results. Variance partitioning was applied to all layer-context combinations. In principle, variance partitioning assumes equal sample sizes across conditions. However, in our naturalistic dialogue experiment, individual fMRI frames (TRs) may correspond to production, comprehension, both or neither, resulting in inherent unequal sample sizes across conditions. To preserve the ecological validity of the dataset and avoid imposing artificial constraints, we applied variance partitioning uniformly across all TRs.

PCA

To identify low-dimensional representations of the separate linguistic model weights, we performed a PCA following previous studies^21,42 separately for production and comprehension. Model weights, averaged across six delays for each feature (33,792/6 weights = 5,632 mean weights) and sessions, were scaled by prediction accuracy to reduce contributions from voxels with lower prediction accuracy. A PCA was performed separately for each modality on these scaled weights in all cortical voxels (2,816 weights × all cortical voxels), yielding 2,816 orthogonal PCs. We assessed the significance of the first 20 weight PCs by comparing their explained variance with the first 20 stimulus PCs (derived from GPT embeddings) using bootstrapping (1,000 iterations). Correspondence between weight and stimulus PCs was enhanced using the Gale–Shapley stable marriage algorithm. The PCs were deemed significant if the stimulus PC never explained more variance than the corresponding weight PC in all bootstrap samples (P < 0.001).

For our current analysis, we focused on a context length of 1 s for layer 0, which yielded the highest number of significant PCs across participants (Supplementary Fig. 19). Interpretation involved three steps: (1) Identification of correlated utterances: for each PC, the top 20 positively and negatively correlated utterances were identified for each participant (Fig. 5c,g). (2) Interpretation using ChatGPT: utterances and correlation coefficients were input into ChatGPT (GPT-4o) for consistent interpretations across PCs, modalities and participants. (3) Synthesis of common components: ChatGPT synthesized interpretations to identify common components across participants.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The MRI data and preprocessed stimulus features used in the current study are available via OpenNeuro at https://openneuro.org/datasets/ds004669. The Destrieux Atlas can be accessed via the FreeSurfer software package (https://surfer.nmr.mgh.harvard.edu/fswiki/CorticalParcellation). The Corpus of Everyday Japanese is available from the National Institute for Japanese Language and Linguistics (https://www2.ninjal.ac.jp/conversation/cejc-monitor.html). Because the free-form conversations include complex details that could reveal participants’ identities, the raw speech data—after removal of personal identifiers—will be provided only to researchers who (1) contact the corresponding author (S.N.) and (2) sign a data-sharing agreement that complies with the regulations of the relevant ethics committees and with applicable privacy laws.

Code availability

The custom code for this study is available via GitHub at https://github.com/yamashita-lang/dialogue. All model fitting and analyses were performed using custom software written in Python, utilizing libraries such as NumPy⁷¹, SciPy⁷², Scikit-learn⁷³, Matplotlib⁷⁴, Himalaya³⁸ and Pycortex⁶⁴. An exception was the LME analysis, which was performed using the lmertTest⁷⁰ package in R. The GPT models were accessed via Huggingface⁷⁵.

References

Clark, H. H. & Brennan, S. E. Grounding in communication. In Perspectives on Socially Shared Cognition (eds Resnick, L. B. et al.) 127–149 (American Psychological Association, 1991); https://doi.org/10.1037/10096-006
Pickering, M. J. & Garrod, S. Toward a mechanistic psychology of dialogue. Behav. Brain Sci. 27, 169–190 (2004).
Article PubMed Google Scholar
Silbert, L. J., Honey, C. J., Simony, E., Poeppel, D. & Hasson, U. Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proc. Natl Acad. Sci. USA 111, E4687–E4696 (2014).
Article CAS PubMed PubMed Central Google Scholar
Zada, Z. et al. A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron 112, 3211–3222.e5 (2024).
Article CAS PubMed Google Scholar
Hagoort, P., Hald, L., Bastiaansen, M. & Petersson, K. M. Integration of word meaning and world knowledge in language comprehension. Science 304, 438–441 (2004).
Article CAS PubMed Google Scholar
Frank, M. C. & Goodman, N. D. Predicting pragmatic reasoning in language games. Science 336, 998 (2012).
Article CAS PubMed Google Scholar
Yeshurun, Y. et al. Same story, different story: the neural representation of interpretive frameworks. Psychol. Sci. 28, 307–319 (2017).
Article PubMed PubMed Central Google Scholar
Sacks, H., Schegloff, E. A. & Jefferson, G. in Studies in the Organization of Conversational Interaction (ed. Schenkein, J.) 7–55 (Academic Press, 1978); https://doi.org/10.1016/B978-0-12-623550-0.50008-2
Levinson, S. C. Turn-taking in human communication—origins and implications for language processing. Trends Cogn. Sci. 20, 6–14 (2016).
Article PubMed Google Scholar
Clark, H. H. & Fox Tree, J. E. Using uh and um in spontaneous speaking. Cognition 84, 73–111 (2002).
Article PubMed Google Scholar
Fox Tree, J. E. Listeners’ uses of um and uh in speech comprehension. Mem. Cognit. 29, 320–326 (2001).
Article CAS PubMed Google Scholar
Knudsen, B., Creemers, A. & Meyer, A. S. Forgotten little words: how backchannels and particles may facilitate speech planning in conversation? Front. Psychol. 11, 593671 (2020).
Article PubMed PubMed Central Google Scholar
Bellana, B., Mahabal, A. & Honey, C. J. Narrative thinking lingers in spontaneous thought. Nat. Commun. 13, 4585 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lerner, Y., Honey, C. J., Silbert, L. J. & Hasson, U. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. J. Neurosci. 31, 2906–2915 (2011).
Article CAS PubMed PubMed Central Google Scholar
Regev, M., Honey, C. J., Simony, E. & Hasson, U. Selective and invariant neural responses to spoken and written narratives. J. Neurosci. 33, 15978–15988 (2013).
Article CAS PubMed PubMed Central Google Scholar
Heer de, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L. & Theunissen, F. E. The hierarchical cortical organization of human speech processing. J. Neurosci. 37, 6539–6557 (2017).
Article Google Scholar
Deniz, F., Nunez-Elizalde, A. O., Huth, A. G. & Gallant, J. L. The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality. J. Neurosci. 39, 7722–7736 (2019).
Article CAS PubMed PubMed Central Google Scholar
Popham, S. F. et al. Visual and linguistic semantic representations are aligned at the border of human visual cortex. Nat. Neurosci. 24, 1628–1636 (2021).
Article CAS PubMed Google Scholar
Deniz, F., Tseng, C., Wehbe, L., Tour la, T. D. & Gallant, J. L. Semantic representations during language comprehension are affected by context. J. Neurosci. 43, 3144–3158 (2023).
Article CAS PubMed PubMed Central Google Scholar
Caucheteux, C., Gramfort, A. & King, J.-R. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nat. Hum. Behav. 7, 430–441 (2023).
Article PubMed PubMed Central Google Scholar
Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).
Article PubMed PubMed Central Google Scholar
Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, C., Dupré la Tour, T., Gallant, J. L., Klein, D. & Deniz, F. The cortical representation of language timescales is shared between reading and listening. Commun. Biol. 7, 284 (2024).
Article PubMed PubMed Central Google Scholar
Menenti, L., Gierhan, S. M. E., Segaert, K. & Hagoort, P. Shared language: overlap and segregation of the neuronal infrastructure for speaking and listening revealed by functional MRI. Psychol. Sci. 22, 1173–1182 (2011).
Article PubMed Google Scholar
Hu, J. et al. Precision fMRI reveals that the language-selective network supports both phrase-structure building and lexical access during language production. Cereb. Cortex 33, 4384–4404 (2022).
Article PubMed Central Google Scholar
Giglio, L., Ostarek, M., Weber, K. & Hagoort, P. Commonalities and asymmetries in the neurobiological infrastructure for language production and comprehension. Cereb. Cortex 32, 1405–1418 (2022).
Article PubMed Google Scholar
Patel, T., Morales, M., Pickering, M. J. & Hoffman, P. A common neural code for meaning in discourse production and comprehension. NeuroImage 279, 120295 (2023).
Article PubMed Google Scholar
Goldstein, A. et al. A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations. Nat. Hum. Behav. 9, 1041–1055 (2025).
Article PubMed PubMed Central Google Scholar
Hasson, U. & Frith, C. D. Mirroring and beyond: coupled dynamics as a generalized framework for modelling social interactions. Philos. Trans. R. Soc. B 371, 20150366 (2016).
Article Google Scholar
Giglio, L., Ostarek, M., Sharoh, D. & Hagoort, P. Diverging neural dynamics for syntactic structure building in naturalistic speaking and listening. Proc. Natl Acad. Sci. USA 121, e2310766121 (2024).
Article CAS PubMed PubMed Central Google Scholar
Castellucci, G. A., Kovach, C. K., Howard, M. A., Greenlee, J. D. W. & Long, M. A. A speech planning network for interactive language use. Nature 602, 117–122 (2022).
Article CAS PubMed PubMed Central Google Scholar
Naselaris, T., Kay, K. N., Nishimoto, S. & Gallant, J. L. Encoding and decoding in fMRI. NeuroImage 56, 400–410 (2011).
Article PubMed Google Scholar
Nishimoto, S. et al. Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. 21, 1641–1646 (2011).
Article CAS PubMed PubMed Central Google Scholar
Caucheteux, C., Gramfort, A. & King, J. -R. Disentangling syntax and semantics in the brain with deep networks. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 1336–1348 (PMLR, 2021).
Oota, S., Gupta, M. & Toneva, M. Joint processing of linguistic properties in brains and language models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 18001–18014 (Curran Associates, 2023).
Sawada, K. et al. Release of pre-trained models for the Japanese language. Preprint at https://arxiv.org/abs/2404.01657 (2024).
Nunez-Elizalde, A. O., Huth, A. G. & Gallant, J. L. Voxelwise encoding models with non-spherical multivariate normal priors. NeuroImage 197, 482–492 (2019).
Article PubMed Google Scholar
Dupré la Tour, T., Eickenberg, M., Nunez-Elizalde, A. O. & Gallant, J. L. Feature-space selection with banded ridge regression. NeuroImage 264, 119728 (2022).
Article PubMed Google Scholar
Lescroart, M. D., Stansbury, D. E. & Gallant, J. L. Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Front. Comput. Neurosci. 9, 135 (2015).
Article PubMed PubMed Central Google Scholar
Aw, K. L., Montariol, S., AlKhamissi, B., Schrimpf, M. & Bosselut, A. Instruction-tuning aligns LLMs to the human brain. In First Conference on Language Modeling https://openreview.net/forum?id=nXNN0x4wbl (2024).
Huth, A. G., Nishimoto, S., Vu, A. T. & Gallant, J. L. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 76, 1210–1224 (2012).
Article CAS PubMed PubMed Central Google Scholar
Pickering, M. J. & Garrod, S. An integrated theory of language production and comprehension. Behav. Brain Sci. 36, 329–347 (2013).
Article PubMed Google Scholar
Stephens, G. J., Silbert, L. J. & Hasson, U. Speaker–listener neural coupling underlies successful communication. Proc. Natl Acad. Sci. USA 107, 14425–14430 (2010).
Article CAS PubMed PubMed Central Google Scholar
Nastase, S. A., Goldstein, A. & Hasson, U. Keep it real: rethinking the primacy of experimental control in cognitive neuroscience. NeuroImage 222, 117254 (2020).
Article PubMed Google Scholar
Hamilton, L. S. & Huth, A. G. The revolution will not be controlled: natural stimuli in speech neuroscience. Lang. Cogn. Neurosci. 35, 573–582 (2018).
Article PubMed PubMed Central Google Scholar
Fedorenko, E., Ivanova, A. A. & Regev, T. I. The language network as a natural kind within the broader landscape of the human brain. Nat. Rev. Neurosci. 25, 289–312 (2024).
Article CAS PubMed Google Scholar
Yeshurun, Y., Nguyen, M. & Hasson, U. The default mode network: where the idiosyncratic self meets the shared social world. Nat. Rev. Neurosci. 22, 181–192 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bašnáková, J., Weber, K., Petersson, K. M., van Berkum, J. & Hagoort, P. Beyond the language given: the neural correlates of inferring speaker meaning. Cereb. Cortex 24, 2572–2578 (2014).
Article PubMed Google Scholar
Paunov, A. M. et al. Differential tracking of linguistic vs. mental state content in naturalistic stimuli by language and theory of mind (ToM) brain networks. Neurobiol. Lang. 3, 413–440 (2022).
Article Google Scholar
Kuhlen, A. K., Bogler, C., Brennan, S. E. & Haynes, J.-D. Brains in dialogue: decoding neural preparation of speaking to a conversational partner. Soc. Cogn. Affect. Neurosci. 12, 871–880 (2017).
Article PubMed PubMed Central Google Scholar
Olson, H. A., Chen, E. M., Lydic, K. O. & Saxe, R. R. Left-hemisphere cortical language regions respond equally to dialogue and monologue. Neurobiol. Lang. 4, 575–610 (2023).
Article Google Scholar
Indefrey, P. The spatial and temporal signatures of word production components: a critical update. Front. Psychol. 2, 255 (2011).
Article PubMed PubMed Central Google Scholar
Bögels, S., Magyari, L. & Levinson, S. C. Neural signatures of response planning occur midway through an incoming question in conversation. Sci. Rep. 5, 12881 (2015).
Article PubMed PubMed Central Google Scholar
Bögels, S. & Levinson, S. C. The brain behind the response: insights into turn-taking in conversation from neuroimaging. Res. Lang. Soc. Interact. 50, 71–89 (2017).
Article Google Scholar
Rastelli, C. et al. Neural dynamics of semantic control underlying generative storytelling. Commun. Biol. 8, 513 (2025).
Article PubMed PubMed Central Google Scholar
Vanlangendonck, F., Willems, R. M., Menenti, L. & Hagoort, P. An early influence of common ground during speech planning. Lang. Cogn. Neurosci. 31, 741–750 (2016).
Article Google Scholar
Kuhlen, A. K. & Abdel Rahman, R. Beyond speaking: neurocognitive perspectives on language production in social interaction. Philos. Trans. R. Soc. B 378, 20210483 (2023).
Article Google Scholar
Alderson-Day, B. & Fernyhough, C. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychol. Bull. 141, 931–965 (2015).
Article PubMed PubMed Central Google Scholar
Oldfield, R. C. The assessment and analysis of handedness: the Edinburgh inventory. Neuropsychologia 9, 97–113 (1971).
Article CAS PubMed Google Scholar
Moeller, S. et al. Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel imaging with application to high spatial and temporal whole-brain fMRI. Magn. Reson. Med. 63, 1144–1153 (2010).
Article PubMed PubMed Central Google Scholar
Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based analysis: I. Segmentation and surface reconstruction. NeuroImage 9, 179–194 (1999).
Article CAS PubMed Google Scholar
Fischl, B., Sereno, M. I. & Dale, A. M. Cortical surface-based analysis: II: inflation, flattening, and a surface-based coordinate system. NeuroImage 9, 195–207 (1999).
Article CAS PubMed Google Scholar
Gao, J. S., Huth, A. G., Lescroart, M. D. & Gallant, J. L. Pycortex: an interactive surface visualizer for fMRI. Front. Neuroinformatics 9, 23 (2015).
Article Google Scholar
Destrieux, C., Fischl, B., Dale, A. & Halgren, E. Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. NeuroImage 53, 1–15 (2010).
Article PubMed Google Scholar
Black, S. et al. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5—Workshop on Challenges & Perspectives in Creating Large Language Models. (eds Fan, A. et al.) 95–136 (2022).
Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L. & Petersen, S. E. Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. NeuroImage 59, 2142–2154 (2012).
Article PubMed Google Scholar
Handwerker, D. A., Ollinger, J. M. & D’Esposito, M. Variation of BOLD hemodynamic responses across subjects and brain regions and their effects on statistical analyses. NeuroImage 21, 1639–1651 (2004).
Article PubMed Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995).
Article Google Scholar
Kuznetsova, A., Brockhoff, P. B. & Christensen, R. H. B. lmerTest package: tests in linear mixed effects models. J. Stat. Softw. 82, 1–26 (2017).
Article Google Scholar
Harris et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article CAS PubMed PubMed Central Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020); https://doi.org/10.18653/v1/2020.emnlp-demos.6

Download references

Acknowledgements

This work was supported by JST ERATO (grant no. JPMJER1801 to S.N.), AIP Acceleration Research (grant no. JPMJCR24U2 to S.N.), MIRAI (grant no. JPMJMI19D1 to S.N.) and JSPS (grant nos. JP23H05493 and JP24H00619 to S.N.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Graduate School of Frontier Biosciences, The University of Osaka, Suita, Japan
Masahiro Yamashita & Shinji Nishimoto
Center for Information and Neural Networks (CiNet), National Institute of Information and Communications Technology (NICT), Suita, Japan
Masahiro Yamashita, Rieko Kubo & Shinji Nishimoto
Graduate School of Medical and Dental Sciences, Institute of Science Tokyo, Tokyo, Japan
Rieko Kubo

Authors

Masahiro Yamashita
View author publications
Search author on:PubMed Google Scholar
Rieko Kubo
View author publications
Search author on:PubMed Google Scholar
Shinji Nishimoto
View author publications
Search author on:PubMed Google Scholar

Contributions

S.N. conceptualized the experiment. R.K. and S.N. designed the experiment. R.K. performed the experiment. M.Y., R.K. and S.N. analysed and interpreted the results. M.Y. and S.N. wrote the paper. S.N. supervised the study.

Corresponding author

Correspondence to Shinji Nishimoto.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Human Behaviour thanks Evelina Fedorenko, Shailee Jain and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–19, Tables 1–5 and results.

Reporting Summary

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yamashita, M., Kubo, R. & Nishimoto, S. Conversational content is organized across multiple timescales in the brain. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02231-4

Download citation

Received: 17 November 2023
Accepted: 29 April 2025
Published: 11 June 2025
DOI: https://doi.org/10.1038/s41562-025-02231-4