Unveiling the Spatiotemporal Dynamics of Global Brain Circulation: A Comprehensive Corpus (2000–2024)

Hu, Zhiwen; Qiu, Yang; Jiang, Haihua; Ma, Xiao; Han, Lv; Lei, Saihua; Niu, Haojia

doi:10.1038/s41597-025-05268-2

Download PDF

Data Descriptor
Open access
Published: 04 June 2025

Unveiling the Spatiotemporal Dynamics of Global Brain Circulation: A Comprehensive Corpus (2000–2024)

Zhiwen Hu ORCID: orcid.org/0000-0002-0382-7770^1,2,3,
Yang Qiu¹,
Haihua Jiang¹,
Xiao Ma¹,
Lv Han¹,
Saihua Lei¹ &
…
Haojia Niu¹

Scientific Data volume 12, Article number: 938 (2025) Cite this article

2392 Accesses
1 Citations
Metrics details

Subjects

Abstract

The global competition for human capital is fuelled by intricate brain circulation dynamics, where individuals with specialized skills traverse geographic, organizational, and national boundaries to address workforce demands. However, a comprehensive framework for integrating and interpreting heterogeneous data on global brain circulation remains elusive. Here we introduce the Global Brain Circulation Dynamics (GBCD) corpus, a longitudinally integrated repository of geo-information encompassing 223 countries/regions from 2000 to 2024. Garnered from diachronic narrative texts, the GBCD corpus provides granular insights into transnational brain circulation patterns and their interconnections with sociocultural progress. Continuously updated to reflect spatiotemporal dynamics, the GBCD corpus serves as a definitive reference for real-time and ex-post analysis of global brain circulation. Our analysis reveals two pivotal findings: (i) narrative brain circulation closely mirrors physical brain mobility, and (ii) geopolitical relations and spatiotemporal dynamics exhibit distinct patterns across countries/regions. The GBCD corpus establishes a novel benchmark for examining spatiotemporal brain circulation worldwide, empowering policymakers to develop evidence-based strategies for attracting and retaining human capital in rapidly evolving global landscape.

Human cortical dynamics reflect graded contributions of local geometry and network topography

Article Open access 26 December 2025

A multicohort geometric deep learning study of age dependent cortical and subcortical morphologic interactions for fluid intelligence prediction

Article Open access 22 October 2022

Age-related constraints on the spatial geometry of the brain

Article Open access 29 September 2025

Background & Summary

In the contemporary knowledge-based economies^1,2, brain circulation has emerged as a pressing concern in the global competition for human capital³. The phenomenon of brain circulation refers to the complex dynamics of skilled individuals traversing geographic, organizational, or national boundaries to address workforce demands and facilitate the redistribution of expertise. Highly skilled individuals, including researchers, professionals, and practitioners, play a pivotal role in consolidating existing information and disseminating knowledge across various fields⁴. Unlike the traditional concept of brain drain emphasizes the reciprocal and often cyclical nature of talent movement, where individuals may relocate multiple times throughout their careers, creating multi-directional knowledge flows between regions. Recognizing the significance of brain circulation, research in this area has experienced rapid growth, encompassing a wide range of topics, including integration and return intentions^5,6, transnational networks and practices^7,8, and international students⁹. However, existing studies have largely adopted a unilateral perspective, defining developing countries as source countries and economically developed countries as destination countries¹⁰. This approach has resulted in a limited understanding of the bilateral relationships and reciprocal dynamics in brain circulation. Furthermore, while numerous studies have conducted in-depth analyses of brain drain and gain flows to specific countries, they have not explored the global linkages and interconnectedness between countries in terms of brain circulation¹¹. This oversight has left a significant knowledge gap in our understanding of how different countries and regions at various development levels have successfully attracted, retained, and developed highly skilled individuals, underscoring the need for a more nuanced and comprehensive approach to brain circulation research.

Recent research has predominantly relied on geospatial metadata analysis to uncover patterns of brain circulation. Geospatial metadata can be obtained from large-scale sources, including social networks, such as X (formerly Twitter)¹² and Facebook¹³, and mobile devices¹⁴ offering Location-Based Services (LBS). However, this approach has significant limitations. While geospatial metadata analysis offers the advantage of high-precision location information, which can be valuable for studying brain circulation patterns, it also raises concerns about personal privacy¹⁵. The collection of location information can be sensitive, and users may be hesitant to share this geospatial information. Furthermore, geospatial metadata analysis only superficially captures circulation between countries and regions as changes in signal points, neglecting the relevant attributes of the circulating individuals. Consequently, conclusions drawn from geospatial metadata analysis cannot be accurately applied to discussions about highly skilled individuals¹⁶.

Alternatively, the empirical indicator analysis method is a commonly employed approach in brain circulation research. For instance, a recent study utilized the Organisation for Economic Co-operation and Development (OECD)¹⁷ international migration database to map the annual dynamics of the global brain circulation. While this method offers diverse features compared to single signal changes, it has significant limitations when applied to brain circulation research. A major limitation of the empirical indicator analysis method is its inability to effectively handle unquantifiable features, such as geographic entities. Geographic entities, including countries, regions, and cities, are complex and multifaceted, making it challenging to quantify their characteristics and dynamics. As a result, this method may oversimplify or neglect the nuances of geographic entities, leading to incomplete or inaccurate conclusions¹⁸. Additionally, it is influenced by the prolonged time costs and intervals associated with empirical surveys, limiting its ability to reflect the evolving dynamics of brain circulation in a timely manner¹⁹.

The limitations of geospatial metadata analysis and empirical indicator analysis underscore the need for innovative approaches to analysing global brain circulation. As a complex sociocultural phenomenon, brain circulation can be measured in three principal ways: as a behaviour, as a transition, or as a duration²⁰. However, these measures are not directly comparable, highlighting the need for a more nuanced approach that can capture the intricacies of brain circulation. To address these challenges, we propose a novel approach that conceptualizes brain circulation as collective behaviours of highly skilled individuals in diachronic narrative circulation texts²¹. Narrative circulation refers to the textual representation and documentation of talent mobility in various sources (academic papers, news articles, institutional reports, etc.). This concept captures how information about brain circulation is communicated, framed, and disseminated through language. By leveraging large language models (LLMs), we develop a diachronic analysis method that can effectively capture the complexities of brain circulation and provide a more comprehensive understanding of this phenomenon. Diachronic narrative text provides a unique advantage in tracing long-term dynamics due to its capacity to document information over time²², rendering it an ideal source for extensive spatiotemporal research. Furthermore, diachronic narrative text has been successfully applied to various tasks, including word frequency analysis²³ and syntactic parsing²⁴. However, when dealing with complex tasks, conventional diachronic narrative text processing methods, such as dependency parsing²⁵ and part-of-speech tagging²⁶, often fail to capture the intricate relationships between entities, events, and concepts due to their limited ability to understand the context. To overcome these limitations, we employ LLMs, which excel at comprehending semantic relationships in context and are effective in generating outputs (“responses”) from given inputs (“prompts”). LLMs have been demonstrated to be effective in extracting structured information from diachronic narrative texts²⁷, including tasks such as named entity recognition (NER) and relation extraction (RE)²⁸. By harnessing the power of LLMs, we can develop a more sophisticated understanding of brain circulation and its underlying dynamics.

Building on the LLM-based diachronic analysis method, we developed the Global Brain Circulation Dynamics (GBCD) corpus by systematically organizing the extracted features related to transnational brain circulation²⁹. The GBCD corpus enables the reconstruction of transnational circulation behaviours by mapping geographic entities in diachronic narrative texts at the national level. By leveraging the semantic meaning of circulation-related words, the corpus distinguishes between the origin and destination, providing a bilateral description of brain circulation. Spanning a prolonged period from 2000 to 2024, with continuous updates at the forefront of temporal developments, the GBCD corpus serves as a hallmark reference for brain circulation, allowing for real-time or ex-post profiling. To extract features from diachronic narrative texts, we employ a preliminary text classification step and locate keywords that describe the transferred subjects, focusing on skilled individuals. In subsequent fine-tuning of the LLMs, we emphasize attention bias towards skilled individuals in the prompt descriptions, thereby standardizing the research object. This approach ensures that the extracted features are relevant to the analysis of brain circulation and enables the GBCD corpus to provide a comprehensive understanding of this complex phenomenon.

Through a rigorous multifaceted data mining approach on the GBCD corpus, our study reveals several groundbreaking findings that shed new light on the complex dynamics of brain circulation. Specifically, we uncover four key insights that significantly advance our understanding:

(i)
A paradigmatic mapping relationship is observed between the narrative and real-world patterns, indicating a universal connection between narrative and physical brain circulation. This finding highlights the synergistic effects of brain circulation, where virtual and physical patterns converge to shape the global brain circulation landscape.
(ii)
The GBCD corpus reveals that the discursive power of the Global North manifests in the dissemination and discussion of brain circulation topics in the network. This finding underscores the influence of the Global North in shaping the global brain circulation narrative, emphasizing the need for more nuanced and inclusive approaches to understanding brain circulation.
(iii)
Countries and regions with similar geographical heterogeneity exhibit specific patterns in their geographical conditions, highlighting the importance of considering geographical factors in understanding brain circulation dynamics. This finding underscores the need for more spatially aware approaches to brain circulation research.
(iv)
Major international events have a significant impact on brain circulation, with a lagging dynamic trend in the network. The timing of these events can be linked to the underlying societal factors that affect brain circulation, emphasizing the need for more temporally aware approaches to brain circulation research.

Ultimately, our research demonstrates the benefits of applying scientifically informed insights to evidence-based policy making. By leveraging the GBCD corpus, policymakers can gain a deeper understanding of the complex dynamics underlying brain circulation and make informed decisions to address the challenges. This study highlights the potential of data-driven approaches to inform policy and promote more effective brain circulation strategies.

Methods

To construct brain circulation patterns from diachronic narrative texts, it is essential to select an appropriate text processing method that can accurately identify the brain circulation features in the texts. However, relying solely on the part-of-speech and structure of tokens can be insufficient for analysing the hidden brain circulation information in the texts. Our research identifies a type of text that requires contextual understanding to analyse brain circulation information, which we refer to as implicit circulation texts. These texts typically lack characteristic narrative expressions about brain circulation, such as personal initiative behaviour, national geographic entity, or obvious behaviour indicators, presenting a significant challenge to Natural Language Processing (NLP) methods relying on programmatic condition evaluations rather than semantic understanding.

Methodological workflow

To effectively extract brain circulation information, we develop a specialized ensemble of LLMs tailored to this task. We employ a two-stage approach, involving information construction and structural fine-tuning of the models, to enable task-specific extraction of brain circulation behaviours from diachronic narrative texts (Fig. 1).

We differentiated diachronic narrative texts based on explicit and implicit circulation texts conditions and demonstrated the superiority of LLMs (Fig. 1). Using the narrative text from C4 dataset³⁰ as an example, LLMs, due to their own information reserves and semantic analysis capabilities, can effectively complete tasks such as geographic entity mapping and brain circulation subject analysis. To further improve the performance of LLMs in tasks, we selected prompts adapted to the brain circulation extraction task to guide the responses of LLMs, making them sensitive to task-specific extraction and reducing the occurrence of hallucination phenomena³¹. We then filtered out responses containing necessary brain circulation features from the generated text of LLMs and combined them with the LoRa (Low-Rank) method³² to further fine-tune LLMs, enabling LLMs to output structured data.

Inference result validation

To ensure the accuracy and reliability of our brain circulation feature extraction, we employ a multi-faceted approach that considers multiple essential features and mitigates potential biases, including feature selection, timestamp extraction and standardization (Fig. 2). We identify origin and destination as crucial features for positive samples, as they provide critical context for understanding brain circulation patterns. We extract timestamps for brain circulation events to capture the temporal dynamics of this phenomenon. To ensure consistency and precision, we standardize timestamps to the YYYY-MM format, allowing for a monthly granular analysis. Furthermore, recognized geographic entities (e.g. cities, organization names, landmarks, etc.) are normalized and mapped to the country level for extraction. Our national geographic divisions adhere to methods endorsed by the United Nations Statistics Division for international statistical data collection, ensuring consistency and compatibility with global standards³³.

To minimize bias and uncertainty in responses, we align the extraction results from different models within the LLMs ensemble. This is necessary because variation in training data, architectures, and randomness can lead to differing outputs from multiple models, making reliance on a single model potentially risky³⁴. By ignoring valuable insights from other models, a single model may introduce errors or biases that can impact the accuracy of feature extraction³⁵. To address this challenge, we employ ensemble methods to combine the predictions from multiple models. Recent studies on multitask prompted training have shown that aligning outputs across tasks within a single model ensures consistency and accuracy³⁶. In our approach, we take a conservative stance by only retaining outputs that are fully consistent across all models, discarding any discrepancies. This strict filtering reduces semantic bias and enhances the accuracy of feature extraction, as different models may emphasize various aspects of the text³⁷.

After completing the data extraction and alignment process, we obtain the comprehensive GBCD corpus. This multidimensional corpus encompasses a wide range of features, including country information, diachronic narrative texts, source URL, circulation timestamp and locations (Table 1). This categorization enables us to effectively organize and analyze the data, facilitating the identification of patterns and trends in brain circulation. As the diachronic narrative text is updated over time, we will dynamically inject it into the GBCD corpus to ensure the cutting-edge nature of the data. The insights derived from this corpus both contribute to a deeper understanding of global brain circulation and provide valuable support for policy formulation and international cooperation strategies. By capturing the evolving dynamics of brain circulation, we can identify key drivers, as well as regions and sectors most impacted by it. Furthermore, the temporal and spatial resolution of the data allows for a more nuanced exploration of circulation flows, revealing shifts and emerging trends that might otherwise go unnoticed. The GBCD corpus, therefore, serves as a powerful tool for both academic research and practical applications, offering a comprehensive foundation for further investigations into the global brain landscape.

Table 1 Summary information about the GBCD corpus.

Full size table

Data Records

The Global Brain Circulation Dataset (GBCD) corpus, constructed in this study, is publicly available on Figshare repository (https://doi.org/10.6084/m9.figshare.28031471)²⁹. The corpus captures key attributes relevant to brain circulation, including origin, destination, diachronic narrative text, URL, and timestamp (Table 1). Notably, geographic entities are mapped to the global country or region level, facilitating the analysis of transnational brain circulation. The GBCD corpus spans 223 countries and regions worldwide, encompassing 193 UN member states, one observer state, and 29 non-sovereign island territories. Each country or region is accompanied by Countrycode, ISO2, and ISO3 identifiers, enabling multidimensional organization of brain circulation data. Furthermore, we distinguish between origin and destination in geographic entities related to circulation flow, allowing for the representation of brain gain and brain drain, and providing insights into bilateral brain circulation between countries.

The GBCD corpus is a comprehensive dataset comprising 2,904,663,710 tokens, structured into two distinct corpora: diachronic and synchronic. The corpus encompasses 1,564,262 entries related to brain circulation features, with the diachronic corpus accounting for 1,111,644 entries that span a 24-year period (2000–2024). Notably, the diachronic corpus is continuously updated in real-time, ensuring the data remains current and relevant for both real-time and ex-post analyses of brain circulation. In contrast, the synchronic corpus contains 452,618 entries, deliberately excluding timestamp features to facilitate synchronic research.

To maintain data quality and integrity, we employed a rigorous data cleaning process to eliminate redundancy in narrative text and URLs, thereby mitigating the impact of duplicate news stories from multiple sources. Furthermore, geographic entities associated with brain circulation were mapped at the national level to ensure consistency and accuracy. Each data entry is accompanied by two temporal features: the brain circulation timestamp and the timestamp corresponding to the narrative text’s download from the source. These data tuples are consolidated into individual JSON files, enhancing accessibility and facilitating further analysis.

Technical Validation

NER and RE performance

To evaluate the efficacy of task-specific fine-tuning for LLMs in identifying and organizing brain circulation features, we conducted a comprehensive assessment using named entity recognition (NER) and relation extraction (RE) performance tests, along with inference result validation. A random sample of 11,479 narrative texts with responses was selected for performance testing (Table 2).

Table 2 Validation of task-specific fine-tuning for reasoning performance.

Full size table

The results demonstrate that task-specific fine-tuning significantly enhances the compliance rate (CR) of all models, yielding scores above 0.957. This represents a substantial improvement over the highest CR score of 0.051 achieved by the untuned models. The structure of the reasoning results is consequently largely consistent with our task requirements following fine-tuning. Furthermore, the average increase in true positive rate (TPR) scores is approximately 0.09, implying a positive optimization for all models. These findings suggest that fine-tuned LLMs can substantially improve extraction performance, which can be leveraged to enrich existing corpora and enable LLMs to generate more accurate and informed decision-making outputs. The results validate the effectiveness of task-specific fine-tuning for LLMs in recognizing and organizing brain circulation features, highlighting the potential of this approach for enhancing the accuracy and reliability of LLM-driven decision-making.

To assess the accuracy of brain circulation feature extraction, we randomly selected 3,174 responses from the positive samples and validated the inference results of the LLMs. We calculated the F₁ scores for each relation according to the following metrics:

$${recall}=\frac{{No}.{of\; correct\; relations\; retrieved}}{{No}.{of\; relations\; in\; the\; set}}$$

(1)

$${precision}=\frac{{No}.{of\; correct\; relations\; retrieved}}{{No}.{of\; relations\; retrieved}}$$

(2)

$${F}_{1}=\frac{2\left({recall}\,{\rm{\cdot }}\,{precision}\right)}{{recall}+{precision}}$$

(3)

where recall represents the ratio of correct relations retrieved to total relations in the set, and precision is calculated as the ratio of correct relations retrieved to total relations retrieved. Correct relations retrieved are those that are both relevant and correctly identified by the responses. The F₁ score represents the harmonic mean of recall and precision, providing a balanced measure of the LLMs’ performance.

To assess the efficacy of our entity extraction approach, we manually annotated the entity reasoning results of diachronic narrative text based on the calculation formula and grouped the results by token size. We then computed the recall, precision, and F₁ scores (Table 3). Our analysis of the manual scores reveals that the F₁ scores not only meet the expected requirements of the entity extraction task but also exhibit a consistent pattern of variation with respect to token size. Notably, the optimal F₁ score for each entity approximates 0.9, substantially exceeding the performance of other methods commonly employed in recent research for extracting entities with LLMs³⁸. Furthermore, our results indicate that increasing token size has a positive impact on entity extraction performance³⁹. The highest F₁ values for each entity are observed in the control class with longer token sizes, which can be attributed to the rich contextual information present in the narrative data. This provides comprehensive prior information for LLMs to recognize entities, thereby enhancing F₁ scores. These findings suggest that our entity extraction approach is effective in extracting entities from diachronic narrative text, and that increasing token size can improve performance. The results also highlight the importance of contextual information in entity recognition and demonstrate the potential of our approach for extracting entities with high accuracy.

Table 3 Kappa values for the task-specific extraction using LLMs.

Full size table

Our analysis reveals a significant correlation between token size and entity extraction performance. Specifically, when token size exceeds 200 and continues to increase, the F₁ scores of all entity groups show a significant increase of approximately 0.04. However, as token size approaches the processing limit of LLMs, the F₁ scores exhibit a slight decline, resulting in an optimal F₁ performance in the range of 1000–3000 tokens. This forms an inverted U-shaped curve⁴⁰, suggesting that excessive contextual information may lead to redundancy and negatively impact performance. Notably, our findings highlight the potential of the hierarchical attention mechanism to alleviate the redundancy problem caused by excessive context information. By adaptively adjusting the weights of target entities and optimizing the use of context information, this mechanism can mitigate the negative impact of excessive contextual information on entity extraction performance⁴¹.

In conclusion, our performance scores demonstrate the potential of LLMs in entity recognition tasks and underscore the importance of considering token size and contextual information. Our analysis reveals that entity extraction performance exhibits a reverse U-shaped curve with respect to token size, highlighting the importance of token size optimization in entity extraction tasks. By optimizing these factors, we can enhance the accuracy and efficiency of entity extraction, resulting in higher-quality information.

Synergistic effects

To validate the accuracy of brain circulation patterns depicted in diachronic narrative texts, we conducted a comparative analysis with recent studies on human migration based on real-world statistical data. Our objective was to investigate the intrinsic correlations between the GBCD and these studies. We selected two studies as mapping references: the bilateral flows of international migration of scholars (IMS)⁴² and the global record of annual terrestrial Human Footprint (HF)⁴³. These studies investigate circulation patterns of scholars and humans, respectively, which exhibit thematic overlap with the GBCD in terms of research subjects and topics. To facilitate a comparative analysis, we selected the United States as a representative case from each study and applied the Cross Convergent Mapping (CCM) method^44,45 to quantify the corresponding time series spanning the period 2013–2020 (Fig. 3). To account for differences in spatial spans, we adjusted the timestamp step for HF to 2013–2018. This comparative analysis enabled us to explore the synergistic effects between the GBCD and the IMS and HF studies, providing insights into the accuracy and reliability of brain circulation patterns depicted in diachronic narrative texts.

Our analysis reveals a strong and persistent correlation between the GBCD and IMS, characterized by a consistently high correlation coefficient (ρ) that exceeds 0.9 after an initial rapid ascent, indicating a robust and enduring relationship between the two datasets. This strong correlation (ρ > 0.9) indicates a robust intrinsic connection and synergistic relationship between narrative brain circulation patterns and physical migration trends. Notably, GBCD exhibits a slight lead in the relationship, which we hypothesize may be attributed to differences in research subjects. The brain circulation patterns captured by GBCD, focusing on highly skilled individuals, including scholars, provide strong explanatory power for understanding international migration dynamics. In contrast, the correlation coefficient between HF and GBCD is substantially lower (ρ < 0.2). Moreover, when HF is used as the observed value, the estimated values of GBCD display high convergence, suggesting a lack of intrinsic correlation between the two datasets. While the reduced temporal resolution due to the narrower research timeframe of HF may contribute to this disparity, it is unlikely to be the primary cause. Instead, methodological limitations or biases may be responsible for the observed lack of correlation.

Our findings demonstrate a core intrinsic connection between GBCD and real-world migration statistics, supporting the use of the GBCD corpus to investigate brain circulation patterns and draw realistic conclusions. The significant differences in correlation coefficients between GBCD, IMS, and HF highlight the targeted focus of GBCD on highly skilled individuals, enabling the derivation of reliable conclusions about real-world brain circulation paradigms. The results show a notably higher synergy between GBCD and IMS, as compared to the synergy between talent mobility and general HF. This finding underscores that our corpus is well-targeted towards the talent group, and that scientists have a particularly strong connection with the broader category of talent.

Data mining

To ensure the diversity and comprehensiveness of data sources in the GBCD corpus, we conducted a thorough analysis of the domain categories and quantities present in narrative texts from web snapshots. Our analysis involved large-scale distribution experiments on 348,008 global domains, which were systematically categorized and ranked by continent and field (Fig. 4). This approach enabled us to identify potential biases and gaps in our data sources, informing the development of a more comprehensive and representative corpus. By examining the distribution of domains across continents and fields, we were able to assess the geographic and thematic coverage of our data sources. This analysis provides a foundation for evaluating the validity and generalizability of our findings, as well as identifying areas for future improvement and expansion of the GBCD corpus.

Network domain distribution

Our analysis of the network domain distribution related to brain circulation reveals a trend of Western-dominated discussion, with North America at the forefront. However, this dominance may lead to biased perspectives and conclusions. Specifically, we found that North America and Asia are the hubs of brain circulation, with the highest frequency of domains. Notably, Antarctica has a higher ranking than anticipated in terms of domain frequency⁴⁶, likely due to the high-frequency brain circulation in the natural ecology field, which is widely active in Antarctica⁴⁷.

In the domains categorized by field, the economy field is the field with the most direct impact affecting brain circulation. However, within the economy field, we observed that cnbc.com and businessinsider.com, both owned by Global North, make up a significantly high percentage of domains. This suggests that the Global North is at the centre of talent circulation, both in terms of continental distribution and field distribution, which may perpetuate a Western trend in brain circulation⁴⁸. Consequently, North America has absolute discourse power in the network, dominating the global narrative and potentially perpetuating a biased perspective.

Mitigating Western discourse power

The biased perspective is particularly pronounced in studies with limited data sources, which makes it difficult to overcome. A narrow research scope may inevitably fail to mitigate the impact of Western discourse power, leading to biased conclusions and a certain degree of distortion in statistical data and results. To address this issue, our study has made efforts to mitigate the impact of Western discourse power by sampling enough source domains. By doing so, we aim to provide a more comprehensive and balanced understanding of brain circulation, untainted by the dominance of Western discourse power.

Geographical heterogeneity

To further validate and characterize the differences in brain circulation from a global geographical perspective, we conduct a comprehensive analysis of geographical heterogeneity. By expanding the scope of our analysis from continents to individual countries and regions, we aim to capture the nuances of brain circulation dynamics across diverse geographical contexts. This approach enables us to examine the nuances of brain circulation patterns across different regions and countries, providing a more comprehensive understanding of the phenomenon. As a measure of geographical heterogeneity, we employed the Geodetector, which applies a statistical test to evaluate the significance of the difference between the means of two distributions with different variances⁴⁹. This distinction can be expressed as follows:

$${t}_{\bar{{R}_{z=1}}\bar{{R}_{z=2}}}=\frac{\bar{{R}_{z=1}}\bar{{R}_{z=2}}}{{\left[\frac{1}{{n}_{z=1}}{\sigma }_{\bar{{R}_{z=1}}}^{2}+\frac{1}{{n}_{z=2}}{\sigma }_{\bar{{R}_{z=2}}}^{2}\right]}^{1/2}}$$

(4)

where n_z denotes the number of countries in zone z, $\bar{{R}_{z}}$ represents the average score in zone z, and ${\sigma }_{\bar{{R}_{z}}}^{2}$ represents the variance. The statistic is approximately normally distributed with degrees of freedom equal to:

$${df}=\frac{{\left[\frac{1}{{n}_{z=1}}{\sigma }_{\bar{{R}_{z=1}}}^{2}+\frac{1}{{n}_{z=2}}{\sigma }_{\bar{{R}_{z=2}}}^{2}\right]}^{2}}{\frac{1}{{n}_{z=1}-1}{\left[\frac{1}{{n}_{z=1}}{\sigma }_{\bar{{R}_{z=1}}}^{2}\right]}^{2}+\frac{1}{{n}_{z=2}-1}{\left[\frac{1}{{n}_{z=2}}{\sigma }_{\bar{{R}_{z=2}}}^{2}\right]}^{2}}$$

(5)

Following the correction of a typographical error in Eq. 5 of the original article⁵⁰, we recalculated the geographical heterogeneity of brain circulation using the revised Geodetector formula. The resultant quantified distribution reveals distinct patterns of brain drain and gain across countries and regions (Fig. 5). Notably, island nations exhibit elevated levels of brain circulation activity, which we attribute to their limited geographical adjacency, resulting in reduced competition and increased connectivity.

Our analysis also highlights the prominent positions of China and the United States in global brain circulation, with indices exceeding 60. In stark contrast, countries with substantially lower indices are concentrated in Africa⁵¹ and South America⁵², emphasizing the need for targeted policy interventions to bolster talent competitiveness in these regions and mitigate the risk of brain drain. A notable exception in Africa is South Africa, which exhibits a unique geographical heterogeneity in both brain drain and gain patterns, surpassing its continental peers. We propose that South Africa’s strategic location at the southern tip of Africa contributes to its distinctive brain circulation profile⁵³. As a critical hub for international trade, commerce, and cultural exchange, South Africa’s location may facilitate the attraction and retention of high-skilled individuals, thereby driving its exceptional brain circulation patterns.

In conclusion, these findings suggest that geographical location and advantages play a crucial role in shaping brain circulation patterns. Consequently, governments and policymakers should consider these factors when designing policies to attract and retain brain gain. The GBCD corpus provides a valuable resource for analysing geographical heterogeneity, enabling policymakers to identify regions with lower levels of brain circulation and develop strategies to promote regional cooperation and knowledge sharing.

Transnational brain circulation network

To further elucidate the complex patterns of global interaction in brain circulation, we leveraged the GBCD to investigate the distinct tendencies of brain drain and gain in countries exhibiting significant geographical heterogeneity. Focusing on China and the United States as paradigmatic examples, we constructed a transnational brain circulation network by integrating GBCD brain circulation features with international flight data. This network analysis enables the interpretation of intricate brain circulation trajectories between these two countries and the rest of the world, providing valuable insights into the dynamics of global brain circulation (Fig. 6). By analysing brain circulation trajectories, we can identify key routes and hubs of high-skilled migration, providing actionable insights for policymakers and stakeholders.

The results indicate that both China and the United States exert a strong brain attraction effect on other countries and regions, with a diverse distribution of countries across various regions and continents⁵⁴. The net circulation for both countries reveals that they are brain gain nations, with brain gain proportions of 54.6% and 56.1%, respectively. However, the dynamics of brain circulation differ between the two countries. China’s outflow and inflow are primarily concentrated on interactions with North America, accounting for 30.5% and 43.3% of total circulation trajectories, respectively, exhibiting a slight polarization trend⁵⁵. The findings emphasize the necessity for targeted policies to steer brain circulation in a direction that maximizes economic and social benefits. In contrast, the United States presents a more symmetrical global brain circulation profile, with relatively minor differences in the volume of intercontinental flows, indicating a more stable and harmonious distribution of brain across the globe. Asia and Europe account for the largest shares, with inflow proportions of 22.6% and 19.6%, and outflow proportions of 19.4% and 16.7%, respectively, indicating no significant directional bias.

The differences in brain circulation patterns between China and the United States may be attributed to their unique economic, political, and cultural contexts, highlighting the need for tailored policies to address their specific brain circulation challenges and opportunities. These findings underscore the value of the GBCD corpus in informing brain policy guidance and regional development strategies. By analysing brain circulation patterns at the regional level, policymakers can identify areas of strength and weakness and develop targeted strategies to promote regional growth⁵⁶. This emphasizes the importance of using data-driven approaches to inform brain policy decisions and optimize regional development strategies.

Spatiotemporal dynamics of brain circulation

In addition to characterizing the static state of national brain circulation flows, the GBCD also captures the dynamic evolution of brain circulation from a time series perspective, uncovering emerging trends. We grouped the circulation data of each country by timestamps and calculated the flux between inflows and outflows, organizing the data into time series for further analysis. The flux of brain circulation in each country can be expressed as follows:

$${flux}=\frac{\mathop{\sum }\limits_{i=1}^{n}\,{D}_{i}}{\mathop{\sum }\limits_{i=1}^{n}\,{G}_{i}}\times \frac{\mathop{\sum }\limits_{i=1}^{n}({G}_{i}-\bar{G})}{\mathop{\sum }\limits_{i=1}^{n}({D}_{i}-\bar{D})}$$

(6)

where D_i and G_i represent the brain and grain of brain in the country in each year respectively. To prevent the flux of individual countries from fluctuating dramatically in different time periods, we use the variance as a bias to reduce this impact.

By analysing the flux of the top ten countries with the highest total brain circulation, we obtained the temporal evolution trends of brain circulation from 2000 to 2024 (Fig. 7). Our analysis reveals a lack of correlation between the total amount of brain circulation and the change in the flux of drain and gain. Notably, the United States, which exhibits the most active brain circulation dynamics, maintains a relatively balanced inflow and outflow, resulting in a stable flux that fluctuates within a narrow range of 0.55 to 1.33. In contrast, countries like France and Japan experience more significant fluctuations in brain circulation due to an imbalance in the circulation direction. For instance, Japan’s index peaked at 4.52 in 2010, representing a level two to three times higher than that during the period of downward trend. This suggests that policymakers should consider the dynamic evolution of brain circulation when designing policies to attract and retain high-skilled individuals.

Notably, our analysis exposes a precipitous decline in global brain circulation flux around 2020, with the aggregate flux indicator plummeting from 17.88 to 6.83 over a two-year period. This drastic reduction coincides with the onset of the COVID-19 pandemic, suggesting a significant disruption to brain circulation and labor transfer⁵⁷. To contextualize this finding, we examined changes in brain circulation flux during other Public Health Emergencies of International Concern (PHEIC) in the 21^st century, such as the SARS and H1N1 outbreaks. Our results show that during each PHEIC, brain circulation flux either declined or stabilized, with no instances of increase. By comparing the temporal changes in flux across different countries, we observe that national brain circulation flux exhibits varying degrees of sensitivity to PHEIC. Moreover, countries that maintain a long-term balance between brain drain and gain tend to perform better in responding to PHEIC events, with minimal disruptions to their flux indices⁵⁸. For example, the brain circulation flux of China remained relatively stable during the COVID-19 pandemic. This suggests that countries should develop targeted strategies to manage brain circulation, taking into account their unique drain and gain dynamics⁵⁹.

The study highlights the importance of considering the impact of international significant events on brain circulation patterns, particularly in relation to the dynamic changes that occur in response to such events. The temporal dynamics of the GBCD are sensitive to real-world events, and the convergence and divergence of trends exhibit certain lag effects, which can have different impacts on the results at different time scales. This supports both the spatiotemporal scale of the GBCD corpus and its mapping to real-world phenomena, reflecting the robustness and quality of the GBCD corpus. Moreover, the framework is designed to iterate the brain circulation paradigms with the update of the corpus version, ensuring their continued relevance and interpretability to cutting-edge global dynamic trends. This enables policymakers and researchers to stay informed about the latest developments in brain circulation and make data-driven decisions to address the complex challenges associated with brain circulation.

Usage Notes

The GBCD corpus enables the comprehensive assessment and characterization of global brain circulation, facilitating planning and analysis at the national and geographic levels. To ensure high data quality and extensive geographic coverage, specific names, materials, and map layouts have been employed. It is essential to note that these choices do not imply any endorsement or stance by the authors or their respective countries regarding the legal status of any nation, territory, or region. Additionally, the depiction of borders and boundaries on the maps is purely indicative and does not signify formal recognition or acceptance by the publisher. The maps and database are intended to provide a neutral representation of geographic information, and any interpretation or inference of political boundaries or affiliations is explicitly excluded.

Ethical approval

Not applicable as this study did not involve human participants.

Informed consent

This study does not contain any studies with human participants performed by any of the authors.

Consent to participate

All the authors have approved this submission.

Consent for publication

All the authors have approved publication.

Code availability

All code, data, and tools used in this study are openly available on GitHub at https://github.com/Computational-social-science. The repository includes entity extraction algorithms for narrative text and fine-tuning inference methods for LLMs, which can be accessed, referenced, and modified by the research community.

References

Lane, R. E. The decline of politics and ideology in a knowledgeable society. Am. Sociol. Rev. 31, 649–662 (1966).
Article Google Scholar
Stehr, N. Societal transformations, globalisation and the knowledge society. Int. J. Knowl. Learn. 3, 139–153 (2007).
Article Google Scholar
Kerr, S. P., Kerr, W., Ozden, C. & Parsons, C. Global talent flows. J. Econ. Perspect. 30, 83–106 (2016).
Article Google Scholar
Wible, B. Reservoir of foreign talent. Science. 356, 694 (2017).
Article ADS PubMed Google Scholar
Anniste, K. & Tammaru, T. Ethnic differences in integration levels and return migration intentions: A study of Estonian migrants in Finland. Demogr. Res. 30, 377–412 (2014).
Article Google Scholar
Carling, J. & Pettersen, S. V. Return migration intentions in the integration–transnationalism matrix. Int. Migr. 52, 13–30 (2014).
Article Google Scholar
Carling, J. & Erdal, M. B. Return migration and transnationalism: How are the two connected? Int. Migr. 52, 2–12 (2014).
Article Google Scholar
de Haas, H. & Fokkema, T. The effects of integration and transnational ties on international return migration intentions. Demogr. Res. 25, 755–782 (2011).
Article Google Scholar
King, R. & Raghuram, P. International student migration: Mapping the field and new research agendas. Popul. Space Place 19, 127–137 (2013).
Article Google Scholar
Docquier, F., Lohest, O. & Marfouk, A. Brain drain in developing countries. World Bank Econ. Rev. 21, 193–218 (2007).
Article Google Scholar
Ushkalov, I. G. & Malakha, I. A. The “Brain Drain” as a global phenomenon and its characteristics in Russia. Russ. Soc. Sci. Rev. 42, 79–95 (2001).
Article Google Scholar
Hawelka, B. et al. Geo-located Twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci. 41, 260–271 (2014).
Article PubMed PubMed Central Google Scholar
Pötzschke, S. & Braun, M. Migrant sampling using Facebook advertisements: A case study of polish migrants in four European countries. Soc. Sci. Comput. Rev. 35, 633–653 (2017).
Article Google Scholar
Kraemer, M. U. G. et al. Mapping global variation in human mobility. Nat. Hum. Behav. 4, 800–810 (2020).
Article PubMed Google Scholar
Zurbarán, M. A. et al. An evaluation framework for assessing the impact of location privacy on geospatial analysis. IEEE Access 8, 158224–158236 (2020).
Article Google Scholar
Alamri, S. The geospatial crowd: emerging trends and challenges in crowdsourced spatial analytics. ISPRS Int. J. Geo-Information 13, 168 (2024).
Article Google Scholar
Willekens, F., Massey, D., Raymer, J. & Beauchemin, C. International migration under the microscope. Science. 352, 897–899 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Abel, G. J. & Sander, N. Quantifying global international migration flows. Science. 343, 1520–1522 (2014).
Article ADS CAS PubMed Google Scholar
Cui, Z. et al. DyGCN: Efficient dynamic graph embedding with graph convolutional network. IEEE Trans. Neural Networks Learn. Syst. 35, 4635–4646 (2024).
Article Google Scholar
Bell, M. et al. Cross-national Comparison of Internal Migration: Issues and Measures. J. R. Stat. Soc. Ser. A (Statistics Soc. 165, 435–464 (2002).
Article MathSciNet Google Scholar
Ilyinova, E. & Kochetova, L. Diachronic perspective in text and discourse studies: Review of approaches. Vestn. Volgogr. Gos. Univ. Ser. 2. Jazyk. 15, 18–25 (2016).
Article Google Scholar
Pearce, N., Weller, M., Scanlon, E. & Ashleigh, M. Digital scholarship considered: How new technologies could transform academic work nick pearce, martin weller, eileen scanlon, and melanie ashleigh. Educ. 16, 33–44 (2010).
Google Scholar
Camacho-Collados, J. & Pilehvar, M. T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 63, 743–788 (2018).
Article MathSciNet Google Scholar
Kiperwasser, E. & Goldberg, Y. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Trans. Assoc. Comput. Linguist. 4, 313–327 (2016).
Article Google Scholar
Tian, Y., Song, Y. & Xia, F. Enhancing structure-aware encoder with extremely limited data for graph-based dependency parsing. Proc. 29th Int. Conf. Comput. Linguist. 29, 5438–5449 (2022).
Google Scholar
Gómez-olmos, B. J. Part-of-speech tagging with rule-based data preprocessing and transformer. Electronics 2, 113–120 (2022).
Google Scholar
Zheng, Y. et al. Large language models for medicine: a survey. Int. J. Mach. Learn. Cybern. 18 (2024).
Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Informatics Assoc. 31, 1812–1820 (2024).
Article Google Scholar
Qiu, Y. Unveiling the spatiotemporal dynamics of global brain circulation: A comprehensive corpus (2000–2024). figshare. Dataset. https://doi.org/10.6084/m9.figshare.28031471 (2024).
Zhu, W. et al. Multimodal C4: An open, billion-scale corpus of images interleaved with text. Adv. Neural Inf. Process. Syst. 36 (2023).
Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27, 1–2 (2023).
Article Google Scholar
Hu, E. et al. Lora: Low-Rank Adaptation of Large Language Models. ICLR 2022 - 10th Int. Conf. Learn. Represent. 1–26 (2022).
Kaminska, O. & Lynn, P. Survey-based cross-country comparisons where countries vary in sample design: Issues and solutions. J. Off. Stat. 33, 123–136 (2017).
Article Google Scholar
Molnar, C. et al. General pitfalls of model-agnostic interpretation methods for machine learning models. Lect. Notes Comput. Sci. 1320, 39–68 (2022).
Article Google Scholar
Raiaan, M. A. K. et al. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 12, 26839–26874 (2024).
Article Google Scholar
Jiang, D., Ren, X. & Lin, B. Y. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. Proc. Annu. Meet. Assoc. Comput. Linguist. 1, 14165–14178 (2023).
Google Scholar
Jiang, Z., Xu, F. F., Araki, J. & Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 8, 423–438 (2020).
Article Google Scholar
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Younes, Y. & Scherp, A. Question answering versus named entity recognition for extracting unknown datasets. IEEE Access 11, 92775–92787 (2023).
Article Google Scholar
Zhang, X. et al. ∞ bench: Extending long context evaluation beyond 100k tokens. Proc. Annu. Meet. Assoc. Comput. Linguist. 6, 15262–15277 (2024).
Zeng, J., Xiong, D. & Liu, Y. A hierarchy-to-sequence attentional neural. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 623–632 (2018).
Article Google Scholar
Akbaritabar, A., Theile, T. & Zagheni, E. Bilateral flows and rates of international migration of scholars for 210 countries for the period 1998–2020. Sci. Data 11, 1–14 (2024).
Article Google Scholar
Mu, H. et al. A global record of annual terrestrial Human Footprint dataset from 2000 to 2018. Sci. Data 9, 176 (2022).
Article PubMed PubMed Central Google Scholar
Sugihara, G. et al. Detecting causality in complex ecosystems. Science. 338, 496–500 (2012).
Article ADS CAS PubMed Google Scholar
Frank, M. R. et al. Detecting reciprocity at a global scale. Sci. Adv. 4, 1–7 (2018).
Article Google Scholar
Niva, V. et al. World’s human migration patterns in 2000–2019 unveiled by high-resolution data. Nat. Hum. Behav. 7, 2023–2037 (2023).
Article PubMed PubMed Central Google Scholar
Tin, T. et al. Impacts of local human activities on the Antarctic environment. Antarct. Sci. 21, 3–33 (2009).
Article ADS Google Scholar
Bailey, A. & Mulder, C. H. Highly skilled migration between the Global North and South: gender, life courses and institutions. J. Ethn. Migr. Stud. 43, 2689–2703 (2017).
Article Google Scholar
Barber, R. M. et al. Estimating global, regional, and national daily and cumulative infections with SARS-CoV-2 through Nov 14, 2021: a statistical analysis. Lancet 399, 2351–2380 (2022).
Article CAS Google Scholar
Proctor, E. K. & Geng, E. A new lane for science. Science. 374, 659–659 (2021).
Article ADS CAS PubMed Google Scholar
Adesote, S. A. & Osunkoya, O. A. The brain drain, skilled labour migration and its impact on Africa’s development, 1990s–2000s. Africology J. Pan African Stud. 12, 395–420 (2018).
Google Scholar
Pellegrino, A. Trends in Latin American skilled migration: “brain drain” or “brain exchange”? Int. Migr. 39, 111–132 (2001).
Article Google Scholar
Birt, M., Wallis, T. & Winternitz, G. Talent retention in a changing workplace: An investigation of variables considered important to South African talent. South African J. Bus. Manag. 35, 25–32 (2004).
Article Google Scholar
Li, W., Bakshi, K., Tan, Y. & Huang, X. Policies for recruiting talented professionals from the diaspora: India and China compared. Int. Migr. 57, 373–391 (2019).
Article Google Scholar
Yuping, M. A. & Suyan, P. A. N. Chinese returnees from overseas study: An understanding of brain gain and brain circulation in the age of globalization. Front. Educ. China 10, 306–329 (2015).
Article Google Scholar
Peri, G. Skills and talent of immigrants: a comparison between the European Union and the United States. Inst. Eur. Stud. 15, 250–260 (2013).
Google Scholar
Sah, R. P. et al. Impact of water deficit stress in maize: Phenology and yield components. Sci. Rep. 10, 1–15 (2020).
Article ADS Google Scholar
Lee, J. Y., Yahiaoui, D., Lee, K. P. & Cooke, F. L. Global talent management and multinational subsidiaries’ resilience in the Covid-19 crisis: Moderating roles of regional headquarters’ support and headquarters–subsidiary friction. Hum. Resour. Manage. 61, 355–372 (2022).
Article PubMed PubMed Central Google Scholar
Chamie, J. International digration amid a world in crisis. J. Migr. Hum. Secur. 8, 230–245 (2020).
Article Google Scholar

Download references

Acknowledgements

The work was supported by the Natural Science Foundation of Zhejiang Province (LZ21F020004) and the Major Project of Digital and Cutting-edge Disciplines Construction, Zhejiang Gongshang University (SZJ2022B007).

Author information

Authors and Affiliations

School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, 310018, China
Zhiwen Hu, Yang Qiu, Haihua Jiang, Xiao Ma, Lv Han, Saihua Lei & Haojia Niu
Collaborative Innovation Center of Computational Social Science, Zhejiang Gongshang University, Hangzhou, 310018, China
Zhiwen Hu
Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, 310018, China
Zhiwen Hu

Authors

Zhiwen Hu
View author publications
Search author on:PubMed Google Scholar
Yang Qiu
View author publications
Search author on:PubMed Google Scholar
Haihua Jiang
View author publications
Search author on:PubMed Google Scholar
Xiao Ma
View author publications
Search author on:PubMed Google Scholar
Lv Han
View author publications
Search author on:PubMed Google Scholar
Saihua Lei
View author publications
Search author on:PubMed Google Scholar
Haojia Niu
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.W.H. conceived of the research and supervised the project. Z.W.H. and Y.Q. performed the experiments and analysed the data. Z.W.H. and Y.Q. wrote the manuscript. All authors discussed the results and commented on the manuscript.

Corresponding author

Correspondence to Zhiwen Hu.

Ethics declarations

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Hu, Z., Qiu, Y., Jiang, H. et al. Unveiling the Spatiotemporal Dynamics of Global Brain Circulation: A Comprehensive Corpus (2000–2024). Sci Data 12, 938 (2025). https://doi.org/10.1038/s41597-025-05268-2

Download citation

Received: 30 December 2024
Accepted: 22 May 2025
Published: 04 June 2025
Version of record: 04 June 2025
DOI: https://doi.org/10.1038/s41597-025-05268-2

Subjects

Abstract

Similar content being viewed by others

Human cortical dynamics reflect graded contributions of local geometry and network topography

A multicohort geometric deep learning study of age dependent cortical and subcortical morphologic interactions for fluid intelligence prediction

Age-related constraints on the spatial geometry of the brain

Background & Summary

Methods

Methodological workflow

Inference result validation

Data Records

Technical Validation

NER and RE performance

Synergistic effects

Data mining

Network domain distribution

Mitigating Western discourse power

Geographical heterogeneity

Transnational brain circulation network

Spatiotemporal dynamics of brain circulation

Usage Notes

Ethical approval

Informed consent

Consent to participate

Consent for publication

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links