Abstract
The global competition for human capital is fuelled by intricate brain circulation dynamics, where individuals with specialized skills traverse geographic, organizational, and national boundaries to address workforce demands. However, a comprehensive framework for integrating and interpreting heterogeneous data on global brain circulation remains elusive. Here we introduce the Global Brain Circulation Dynamics (GBCD) corpus, a longitudinally integrated repository of geo-information encompassing 223 countries/regions from 2000 to 2024. Garnered from diachronic narrative texts, the GBCD corpus provides granular insights into transnational brain circulation patterns and their interconnections with sociocultural progress. Continuously updated to reflect spatiotemporal dynamics, the GBCD corpus serves as a definitive reference for real-time and ex-post analysis of global brain circulation. Our analysis reveals two pivotal findings: (i) narrative brain circulation closely mirrors physical brain mobility, and (ii) geopolitical relations and spatiotemporal dynamics exhibit distinct patterns across countries/regions. The GBCD corpus establishes a novel benchmark for examining spatiotemporal brain circulation worldwide, empowering policymakers to develop evidence-based strategies for attracting and retaining human capital in rapidly evolving global landscape.
Similar content being viewed by others
Background & Summary
In the contemporary knowledge-based economies1,2, brain circulation has emerged as a pressing concern in the global competition for human capital3. The phenomenon of brain circulation refers to the complex dynamics of skilled individuals traversing geographic, organizational, or national boundaries to address workforce demands and facilitate the redistribution of expertise. Highly skilled individuals, including researchers, professionals, and practitioners, play a pivotal role in consolidating existing information and disseminating knowledge across various fields4. Unlike the traditional concept of brain drain emphasizes the reciprocal and often cyclical nature of talent movement, where individuals may relocate multiple times throughout their careers, creating multi-directional knowledge flows between regions. Recognizing the significance of brain circulation, research in this area has experienced rapid growth, encompassing a wide range of topics, including integration and return intentions5,6, transnational networks and practices7,8, and international students9. However, existing studies have largely adopted a unilateral perspective, defining developing countries as source countries and economically developed countries as destination countries10. This approach has resulted in a limited understanding of the bilateral relationships and reciprocal dynamics in brain circulation. Furthermore, while numerous studies have conducted in-depth analyses of brain drain and gain flows to specific countries, they have not explored the global linkages and interconnectedness between countries in terms of brain circulation11. This oversight has left a significant knowledge gap in our understanding of how different countries and regions at various development levels have successfully attracted, retained, and developed highly skilled individuals, underscoring the need for a more nuanced and comprehensive approach to brain circulation research.
Recent research has predominantly relied on geospatial metadata analysis to uncover patterns of brain circulation. Geospatial metadata can be obtained from large-scale sources, including social networks, such as X (formerly Twitter)12 and Facebook13, and mobile devices14 offering Location-Based Services (LBS). However, this approach has significant limitations. While geospatial metadata analysis offers the advantage of high-precision location information, which can be valuable for studying brain circulation patterns, it also raises concerns about personal privacy15. The collection of location information can be sensitive, and users may be hesitant to share this geospatial information. Furthermore, geospatial metadata analysis only superficially captures circulation between countries and regions as changes in signal points, neglecting the relevant attributes of the circulating individuals. Consequently, conclusions drawn from geospatial metadata analysis cannot be accurately applied to discussions about highly skilled individuals16.
Alternatively, the empirical indicator analysis method is a commonly employed approach in brain circulation research. For instance, a recent study utilized the Organisation for Economic Co-operation and Development (OECD)17 international migration database to map the annual dynamics of the global brain circulation. While this method offers diverse features compared to single signal changes, it has significant limitations when applied to brain circulation research. A major limitation of the empirical indicator analysis method is its inability to effectively handle unquantifiable features, such as geographic entities. Geographic entities, including countries, regions, and cities, are complex and multifaceted, making it challenging to quantify their characteristics and dynamics. As a result, this method may oversimplify or neglect the nuances of geographic entities, leading to incomplete or inaccurate conclusions18. Additionally, it is influenced by the prolonged time costs and intervals associated with empirical surveys, limiting its ability to reflect the evolving dynamics of brain circulation in a timely manner19.
The limitations of geospatial metadata analysis and empirical indicator analysis underscore the need for innovative approaches to analysing global brain circulation. As a complex sociocultural phenomenon, brain circulation can be measured in three principal ways: as a behaviour, as a transition, or as a duration20. However, these measures are not directly comparable, highlighting the need for a more nuanced approach that can capture the intricacies of brain circulation. To address these challenges, we propose a novel approach that conceptualizes brain circulation as collective behaviours of highly skilled individuals in diachronic narrative circulation texts21. Narrative circulation refers to the textual representation and documentation of talent mobility in various sources (academic papers, news articles, institutional reports, etc.). This concept captures how information about brain circulation is communicated, framed, and disseminated through language. By leveraging large language models (LLMs), we develop a diachronic analysis method that can effectively capture the complexities of brain circulation and provide a more comprehensive understanding of this phenomenon. Diachronic narrative text provides a unique advantage in tracing long-term dynamics due to its capacity to document information over time22, rendering it an ideal source for extensive spatiotemporal research. Furthermore, diachronic narrative text has been successfully applied to various tasks, including word frequency analysis23 and syntactic parsing24. However, when dealing with complex tasks, conventional diachronic narrative text processing methods, such as dependency parsing25 and part-of-speech tagging26, often fail to capture the intricate relationships between entities, events, and concepts due to their limited ability to understand the context. To overcome these limitations, we employ LLMs, which excel at comprehending semantic relationships in context and are effective in generating outputs (“responses”) from given inputs (“prompts”). LLMs have been demonstrated to be effective in extracting structured information from diachronic narrative texts27, including tasks such as named entity recognition (NER) and relation extraction (RE)28. By harnessing the power of LLMs, we can develop a more sophisticated understanding of brain circulation and its underlying dynamics.
Building on the LLM-based diachronic analysis method, we developed the Global Brain Circulation Dynamics (GBCD) corpus by systematically organizing the extracted features related to transnational brain circulation29. The GBCD corpus enables the reconstruction of transnational circulation behaviours by mapping geographic entities in diachronic narrative texts at the national level. By leveraging the semantic meaning of circulation-related words, the corpus distinguishes between the origin and destination, providing a bilateral description of brain circulation. Spanning a prolonged period from 2000 to 2024, with continuous updates at the forefront of temporal developments, the GBCD corpus serves as a hallmark reference for brain circulation, allowing for real-time or ex-post profiling. To extract features from diachronic narrative texts, we employ a preliminary text classification step and locate keywords that describe the transferred subjects, focusing on skilled individuals. In subsequent fine-tuning of the LLMs, we emphasize attention bias towards skilled individuals in the prompt descriptions, thereby standardizing the research object. This approach ensures that the extracted features are relevant to the analysis of brain circulation and enables the GBCD corpus to provide a comprehensive understanding of this complex phenomenon.
Through a rigorous multifaceted data mining approach on the GBCD corpus, our study reveals several groundbreaking findings that shed new light on the complex dynamics of brain circulation. Specifically, we uncover four key insights that significantly advance our understanding:
-
(i)
A paradigmatic mapping relationship is observed between the narrative and real-world patterns, indicating a universal connection between narrative and physical brain circulation. This finding highlights the synergistic effects of brain circulation, where virtual and physical patterns converge to shape the global brain circulation landscape.
-
(ii)
The GBCD corpus reveals that the discursive power of the Global North manifests in the dissemination and discussion of brain circulation topics in the network. This finding underscores the influence of the Global North in shaping the global brain circulation narrative, emphasizing the need for more nuanced and inclusive approaches to understanding brain circulation.
-
(iii)
Countries and regions with similar geographical heterogeneity exhibit specific patterns in their geographical conditions, highlighting the importance of considering geographical factors in understanding brain circulation dynamics. This finding underscores the need for more spatially aware approaches to brain circulation research.
-
(iv)
Major international events have a significant impact on brain circulation, with a lagging dynamic trend in the network. The timing of these events can be linked to the underlying societal factors that affect brain circulation, emphasizing the need for more temporally aware approaches to brain circulation research.
Ultimately, our research demonstrates the benefits of applying scientifically informed insights to evidence-based policy making. By leveraging the GBCD corpus, policymakers can gain a deeper understanding of the complex dynamics underlying brain circulation and make informed decisions to address the challenges. This study highlights the potential of data-driven approaches to inform policy and promote more effective brain circulation strategies.
Methods
To construct brain circulation patterns from diachronic narrative texts, it is essential to select an appropriate text processing method that can accurately identify the brain circulation features in the texts. However, relying solely on the part-of-speech and structure of tokens can be insufficient for analysing the hidden brain circulation information in the texts. Our research identifies a type of text that requires contextual understanding to analyse brain circulation information, which we refer to as implicit circulation texts. These texts typically lack characteristic narrative expressions about brain circulation, such as personal initiative behaviour, national geographic entity, or obvious behaviour indicators, presenting a significant challenge to Natural Language Processing (NLP) methods relying on programmatic condition evaluations rather than semantic understanding.
Methodological workflow
To effectively extract brain circulation information, we develop a specialized ensemble of LLMs tailored to this task. We employ a two-stage approach, involving information construction and structural fine-tuning of the models, to enable task-specific extraction of brain circulation behaviours from diachronic narrative texts (Fig. 1).
Methodological workflow for task-specific feature extraction and refinement of LLMs. The workflow utilizes narrative datasets as input, applying LLM inference to derive comprehensive brain circulation features. The ChatGPT shaped label represents a typical LLM. The conditions in case studies are used to illustrate the significant difference between explicit and implicit circulation in circulation texts, specifically whether the semantics can directly reflect a cross-border circulation behavior with person as the theme, informing the underlying motivations of LLMs. Targeted prompts and responses that meet predefined requirements are then used to augment the task-specific adaptation and reasoning capabilities of LLMs, allowing fine-tuned models to accurately identify entities in narrative texts and establish meaningful connections between them and brain circulation features.
We differentiated diachronic narrative texts based on explicit and implicit circulation texts conditions and demonstrated the superiority of LLMs (Fig. 1). Using the narrative text from C4 dataset30 as an example, LLMs, due to their own information reserves and semantic analysis capabilities, can effectively complete tasks such as geographic entity mapping and brain circulation subject analysis. To further improve the performance of LLMs in tasks, we selected prompts adapted to the brain circulation extraction task to guide the responses of LLMs, making them sensitive to task-specific extraction and reducing the occurrence of hallucination phenomena31. We then filtered out responses containing necessary brain circulation features from the generated text of LLMs and combined them with the LoRa (Low-Rank) method32 to further fine-tune LLMs, enabling LLMs to output structured data.
Inference result validation
To ensure the accuracy and reliability of our brain circulation feature extraction, we employ a multi-faceted approach that considers multiple essential features and mitigates potential biases, including feature selection, timestamp extraction and standardization (Fig. 2). We identify origin and destination as crucial features for positive samples, as they provide critical context for understanding brain circulation patterns. We extract timestamps for brain circulation events to capture the temporal dynamics of this phenomenon. To ensure consistency and precision, we standardize timestamps to the YYYY-MM format, allowing for a monthly granular analysis. Furthermore, recognized geographic entities (e.g. cities, organization names, landmarks, etc.) are normalized and mapped to the country level for extraction. Our national geographic divisions adhere to methods endorsed by the United Nations Statistics Division for international statistical data collection, ensuring consistency and compatibility with global standards33.
Methodological framework for integrating named entity recognition (NER) and relation extraction (RE) in structural alignment. This framework outlines a systematic approach for identifying brain circulation behaviors in narrative texts, enabling the distinction between positive and negative responses. The workflow involves three key steps: (1) NER-based identification and extraction of relevant entities, including origin and destination, from narrative texts; (2) RE-facilitated extraction of temporal and spatial relationships between identified entities; and (3) structural alignment of extracted entities and relationships with corresponding timestamps, allowing for the differentiation of diachronic and synchronic data.
To minimize bias and uncertainty in responses, we align the extraction results from different models within the LLMs ensemble. This is necessary because variation in training data, architectures, and randomness can lead to differing outputs from multiple models, making reliance on a single model potentially risky34. By ignoring valuable insights from other models, a single model may introduce errors or biases that can impact the accuracy of feature extraction35. To address this challenge, we employ ensemble methods to combine the predictions from multiple models. Recent studies on multitask prompted training have shown that aligning outputs across tasks within a single model ensures consistency and accuracy36. In our approach, we take a conservative stance by only retaining outputs that are fully consistent across all models, discarding any discrepancies. This strict filtering reduces semantic bias and enhances the accuracy of feature extraction, as different models may emphasize various aspects of the text37.
After completing the data extraction and alignment process, we obtain the comprehensive GBCD corpus. This multidimensional corpus encompasses a wide range of features, including country information, diachronic narrative texts, source URL, circulation timestamp and locations (Table 1). This categorization enables us to effectively organize and analyze the data, facilitating the identification of patterns and trends in brain circulation. As the diachronic narrative text is updated over time, we will dynamically inject it into the GBCD corpus to ensure the cutting-edge nature of the data. The insights derived from this corpus both contribute to a deeper understanding of global brain circulation and provide valuable support for policy formulation and international cooperation strategies. By capturing the evolving dynamics of brain circulation, we can identify key drivers, as well as regions and sectors most impacted by it. Furthermore, the temporal and spatial resolution of the data allows for a more nuanced exploration of circulation flows, revealing shifts and emerging trends that might otherwise go unnoticed. The GBCD corpus, therefore, serves as a powerful tool for both academic research and practical applications, offering a comprehensive foundation for further investigations into the global brain landscape.
Data Records
The Global Brain Circulation Dataset (GBCD) corpus, constructed in this study, is publicly available on Figshare repository (https://doi.org/10.6084/m9.figshare.28031471)29. The corpus captures key attributes relevant to brain circulation, including origin, destination, diachronic narrative text, URL, and timestamp (Table 1). Notably, geographic entities are mapped to the global country or region level, facilitating the analysis of transnational brain circulation. The GBCD corpus spans 223 countries and regions worldwide, encompassing 193 UN member states, one observer state, and 29 non-sovereign island territories. Each country or region is accompanied by Countrycode, ISO2, and ISO3 identifiers, enabling multidimensional organization of brain circulation data. Furthermore, we distinguish between origin and destination in geographic entities related to circulation flow, allowing for the representation of brain gain and brain drain, and providing insights into bilateral brain circulation between countries.
The GBCD corpus is a comprehensive dataset comprising 2,904,663,710 tokens, structured into two distinct corpora: diachronic and synchronic. The corpus encompasses 1,564,262 entries related to brain circulation features, with the diachronic corpus accounting for 1,111,644 entries that span a 24-year period (2000–2024). Notably, the diachronic corpus is continuously updated in real-time, ensuring the data remains current and relevant for both real-time and ex-post analyses of brain circulation. In contrast, the synchronic corpus contains 452,618 entries, deliberately excluding timestamp features to facilitate synchronic research.
To maintain data quality and integrity, we employed a rigorous data cleaning process to eliminate redundancy in narrative text and URLs, thereby mitigating the impact of duplicate news stories from multiple sources. Furthermore, geographic entities associated with brain circulation were mapped at the national level to ensure consistency and accuracy. Each data entry is accompanied by two temporal features: the brain circulation timestamp and the timestamp corresponding to the narrative text’s download from the source. These data tuples are consolidated into individual JSON files, enhancing accessibility and facilitating further analysis.
Technical Validation
NER and RE performance
To evaluate the efficacy of task-specific fine-tuning for LLMs in identifying and organizing brain circulation features, we conducted a comprehensive assessment using named entity recognition (NER) and relation extraction (RE) performance tests, along with inference result validation. A random sample of 11,479 narrative texts with responses was selected for performance testing (Table 2).
The results demonstrate that task-specific fine-tuning significantly enhances the compliance rate (CR) of all models, yielding scores above 0.957. This represents a substantial improvement over the highest CR score of 0.051 achieved by the untuned models. The structure of the reasoning results is consequently largely consistent with our task requirements following fine-tuning. Furthermore, the average increase in true positive rate (TPR) scores is approximately 0.09, implying a positive optimization for all models. These findings suggest that fine-tuned LLMs can substantially improve extraction performance, which can be leveraged to enrich existing corpora and enable LLMs to generate more accurate and informed decision-making outputs. The results validate the effectiveness of task-specific fine-tuning for LLMs in recognizing and organizing brain circulation features, highlighting the potential of this approach for enhancing the accuracy and reliability of LLM-driven decision-making.
To assess the accuracy of brain circulation feature extraction, we randomly selected 3,174 responses from the positive samples and validated the inference results of the LLMs. We calculated the F1 scores for each relation according to the following metrics:
where recall represents the ratio of correct relations retrieved to total relations in the set, and precision is calculated as the ratio of correct relations retrieved to total relations retrieved. Correct relations retrieved are those that are both relevant and correctly identified by the responses. The F1 score represents the harmonic mean of recall and precision, providing a balanced measure of the LLMs’ performance.
To assess the efficacy of our entity extraction approach, we manually annotated the entity reasoning results of diachronic narrative text based on the calculation formula and grouped the results by token size. We then computed the recall, precision, and F1 scores (Table 3). Our analysis of the manual scores reveals that the F1 scores not only meet the expected requirements of the entity extraction task but also exhibit a consistent pattern of variation with respect to token size. Notably, the optimal F1 score for each entity approximates 0.9, substantially exceeding the performance of other methods commonly employed in recent research for extracting entities with LLMs38. Furthermore, our results indicate that increasing token size has a positive impact on entity extraction performance39. The highest F1 values for each entity are observed in the control class with longer token sizes, which can be attributed to the rich contextual information present in the narrative data. This provides comprehensive prior information for LLMs to recognize entities, thereby enhancing F1 scores. These findings suggest that our entity extraction approach is effective in extracting entities from diachronic narrative text, and that increasing token size can improve performance. The results also highlight the importance of contextual information in entity recognition and demonstrate the potential of our approach for extracting entities with high accuracy.
Our analysis reveals a significant correlation between token size and entity extraction performance. Specifically, when token size exceeds 200 and continues to increase, the F1 scores of all entity groups show a significant increase of approximately 0.04. However, as token size approaches the processing limit of LLMs, the F1 scores exhibit a slight decline, resulting in an optimal F1 performance in the range of 1000–3000 tokens. This forms an inverted U-shaped curve40, suggesting that excessive contextual information may lead to redundancy and negatively impact performance. Notably, our findings highlight the potential of the hierarchical attention mechanism to alleviate the redundancy problem caused by excessive context information. By adaptively adjusting the weights of target entities and optimizing the use of context information, this mechanism can mitigate the negative impact of excessive contextual information on entity extraction performance41.
In conclusion, our performance scores demonstrate the potential of LLMs in entity recognition tasks and underscore the importance of considering token size and contextual information. Our analysis reveals that entity extraction performance exhibits a reverse U-shaped curve with respect to token size, highlighting the importance of token size optimization in entity extraction tasks. By optimizing these factors, we can enhance the accuracy and efficiency of entity extraction, resulting in higher-quality information.
Synergistic effects
To validate the accuracy of brain circulation patterns depicted in diachronic narrative texts, we conducted a comparative analysis with recent studies on human migration based on real-world statistical data. Our objective was to investigate the intrinsic correlations between the GBCD and these studies. We selected two studies as mapping references: the bilateral flows of international migration of scholars (IMS)42 and the global record of annual terrestrial Human Footprint (HF)43. These studies investigate circulation patterns of scholars and humans, respectively, which exhibit thematic overlap with the GBCD in terms of research subjects and topics. To facilitate a comparative analysis, we selected the United States as a representative case from each study and applied the Cross Convergent Mapping (CCM) method44,45 to quantify the corresponding time series spanning the period 2013–2020 (Fig. 3). To account for differences in spatial spans, we adjusted the timestamp step for HF to 2013–2018. This comparative analysis enabled us to explore the synergistic effects between the GBCD and the IMS and HF studies, providing insights into the accuracy and reliability of brain circulation patterns depicted in diachronic narrative texts.
Synergistic effects among the GBCD corpus, Human Footprint (HF), and international migration of scholars (IMS). (a,b) Correlation analysis using the Convergent Cross Mapping (CCM) method reveals a strong association between GBCD and IMS, with a correlation coefficient exceeding 0.9 from 2013 to 2020. In contrast, the correlation coefficient between GBCD and HF diverges significantly, dropping below 0.2, indicating no evident intrinsic relationship. (c–f) When GBCD and IMS are used as observed values for each other, their estimated value distributions exhibit a high degree of similarity and a linear pattern. However, when HF serves as the observed value, the GBCD distribution becomes highly convergent, consistent with the trends observed in the correlation coefficient. These findings highlight the synergistic effects between GBCD and IMS, while suggesting a lack of association between GBCD and HF.
Our analysis reveals a strong and persistent correlation between the GBCD and IMS, characterized by a consistently high correlation coefficient (ρ) that exceeds 0.9 after an initial rapid ascent, indicating a robust and enduring relationship between the two datasets. This strong correlation (ρ > 0.9) indicates a robust intrinsic connection and synergistic relationship between narrative brain circulation patterns and physical migration trends. Notably, GBCD exhibits a slight lead in the relationship, which we hypothesize may be attributed to differences in research subjects. The brain circulation patterns captured by GBCD, focusing on highly skilled individuals, including scholars, provide strong explanatory power for understanding international migration dynamics. In contrast, the correlation coefficient between HF and GBCD is substantially lower (ρ < 0.2). Moreover, when HF is used as the observed value, the estimated values of GBCD display high convergence, suggesting a lack of intrinsic correlation between the two datasets. While the reduced temporal resolution due to the narrower research timeframe of HF may contribute to this disparity, it is unlikely to be the primary cause. Instead, methodological limitations or biases may be responsible for the observed lack of correlation.
Our findings demonstrate a core intrinsic connection between GBCD and real-world migration statistics, supporting the use of the GBCD corpus to investigate brain circulation patterns and draw realistic conclusions. The significant differences in correlation coefficients between GBCD, IMS, and HF highlight the targeted focus of GBCD on highly skilled individuals, enabling the derivation of reliable conclusions about real-world brain circulation paradigms. The results show a notably higher synergy between GBCD and IMS, as compared to the synergy between talent mobility and general HF. This finding underscores that our corpus is well-targeted towards the talent group, and that scientists have a particularly strong connection with the broader category of talent.
Data mining
To ensure the diversity and comprehensiveness of data sources in the GBCD corpus, we conducted a thorough analysis of the domain categories and quantities present in narrative texts from web snapshots. Our analysis involved large-scale distribution experiments on 348,008 global domains, which were systematically categorized and ranked by continent and field (Fig. 4). This approach enabled us to identify potential biases and gaps in our data sources, informing the development of a more comprehensive and representative corpus. By examining the distribution of domains across continents and fields, we were able to assess the geographic and thematic coverage of our data sources. This analysis provides a foundation for evaluating the validity and generalizability of our findings, as well as identifying areas for future improvement and expansion of the GBCD corpus.
Domain name distribution by continents and fields. (a) Geographically, domain names are predominantly associated with North America and Asia, exhibiting a substantial gap with the underrepresented continents. Notably, comprehensive website domain names appear frequently within each group. (b) Domain name topic classification indicates that economy, politics, and culture are the most prevalent categories. Furthermore, cutting-edge domain names within these groups tend to originate from North America. The distribution reveals quantitative differences between domain names across various categories, with both inter-group and intra-group comparisons sorted by frequency of occurrence.
Network domain distribution
Our analysis of the network domain distribution related to brain circulation reveals a trend of Western-dominated discussion, with North America at the forefront. However, this dominance may lead to biased perspectives and conclusions. Specifically, we found that North America and Asia are the hubs of brain circulation, with the highest frequency of domains. Notably, Antarctica has a higher ranking than anticipated in terms of domain frequency46, likely due to the high-frequency brain circulation in the natural ecology field, which is widely active in Antarctica47.
In the domains categorized by field, the economy field is the field with the most direct impact affecting brain circulation. However, within the economy field, we observed that cnbc.com and businessinsider.com, both owned by Global North, make up a significantly high percentage of domains. This suggests that the Global North is at the centre of talent circulation, both in terms of continental distribution and field distribution, which may perpetuate a Western trend in brain circulation48. Consequently, North America has absolute discourse power in the network, dominating the global narrative and potentially perpetuating a biased perspective.
Mitigating Western discourse power
The biased perspective is particularly pronounced in studies with limited data sources, which makes it difficult to overcome. A narrow research scope may inevitably fail to mitigate the impact of Western discourse power, leading to biased conclusions and a certain degree of distortion in statistical data and results. To address this issue, our study has made efforts to mitigate the impact of Western discourse power by sampling enough source domains. By doing so, we aim to provide a more comprehensive and balanced understanding of brain circulation, untainted by the dominance of Western discourse power.
Geographical heterogeneity
To further validate and characterize the differences in brain circulation from a global geographical perspective, we conduct a comprehensive analysis of geographical heterogeneity. By expanding the scope of our analysis from continents to individual countries and regions, we aim to capture the nuances of brain circulation dynamics across diverse geographical contexts. This approach enables us to examine the nuances of brain circulation patterns across different regions and countries, providing a more comprehensive understanding of the phenomenon. As a measure of geographical heterogeneity, we employed the Geodetector, which applies a statistical test to evaluate the significance of the difference between the means of two distributions with different variances49. This distinction can be expressed as follows:
where nz denotes the number of countries in zone z, \(\bar{{R}_{z}}\) represents the average score in zone z, and \({\sigma }_{\bar{{R}_{z}}}^{2}\) represents the variance. The statistic is approximately normally distributed with degrees of freedom equal to:
Following the correction of a typographical error in Eq. 5 of the original article50, we recalculated the geographical heterogeneity of brain circulation using the revised Geodetector formula. The resultant quantified distribution reveals distinct patterns of brain drain and gain across countries and regions (Fig. 5). Notably, island nations exhibit elevated levels of brain circulation activity, which we attribute to their limited geographical adjacency, resulting in reduced competition and increased connectivity.
Geographical heterogeneity of national brain circulation frequency. Using the number and average distance of a country’s neighbours as weights, the geographic map depicts the distribution of transfer frequencies around the world. Notably, every continent exhibits geographical heterogeneity, with countries like South Africa, China, and the United States demonstrating exceptional performance in brain circulation, while the United Kingdom bucks the European trend of stagnation in brain gain.
Our analysis also highlights the prominent positions of China and the United States in global brain circulation, with indices exceeding 60. In stark contrast, countries with substantially lower indices are concentrated in Africa51 and South America52, emphasizing the need for targeted policy interventions to bolster talent competitiveness in these regions and mitigate the risk of brain drain. A notable exception in Africa is South Africa, which exhibits a unique geographical heterogeneity in both brain drain and gain patterns, surpassing its continental peers. We propose that South Africa’s strategic location at the southern tip of Africa contributes to its distinctive brain circulation profile53. As a critical hub for international trade, commerce, and cultural exchange, South Africa’s location may facilitate the attraction and retention of high-skilled individuals, thereby driving its exceptional brain circulation patterns.
In conclusion, these findings suggest that geographical location and advantages play a crucial role in shaping brain circulation patterns. Consequently, governments and policymakers should consider these factors when designing policies to attract and retain brain gain. The GBCD corpus provides a valuable resource for analysing geographical heterogeneity, enabling policymakers to identify regions with lower levels of brain circulation and develop strategies to promote regional cooperation and knowledge sharing.
Transnational brain circulation network
To further elucidate the complex patterns of global interaction in brain circulation, we leveraged the GBCD to investigate the distinct tendencies of brain drain and gain in countries exhibiting significant geographical heterogeneity. Focusing on China and the United States as paradigmatic examples, we constructed a transnational brain circulation network by integrating GBCD brain circulation features with international flight data. This network analysis enables the interpretation of intricate brain circulation trajectories between these two countries and the rest of the world, providing valuable insights into the dynamics of global brain circulation (Fig. 6). By analysing brain circulation trajectories, we can identify key routes and hubs of high-skilled migration, providing actionable insights for policymakers and stakeholders.
Geographical trajectory network of transnational brain circulation. (a) Brain circulation pattern of China overall presents a brain gain state, but there is a converging trend at both the outflow and inflow: both flows are mainly concentrated in North America, while the remaining parts have similar proportions in other continents. (b) In comparison to China, the brain circulation pattern in the United States also presents a brain gain state, but the distribution of flows is more evenly spread: on the inflow side, the United States shows strong attraction to Asia and Europe, but the combined inflows from these regions do not exceed half of the total inflows, unlike the monopolar inflow phenomenon observed in China. On the outflow side, the distribution across continents is more balanced. The world map is slightly shrunk relative to geographic heterogeneity to focus on circulation between countries.
The results indicate that both China and the United States exert a strong brain attraction effect on other countries and regions, with a diverse distribution of countries across various regions and continents54. The net circulation for both countries reveals that they are brain gain nations, with brain gain proportions of 54.6% and 56.1%, respectively. However, the dynamics of brain circulation differ between the two countries. China’s outflow and inflow are primarily concentrated on interactions with North America, accounting for 30.5% and 43.3% of total circulation trajectories, respectively, exhibiting a slight polarization trend55. The findings emphasize the necessity for targeted policies to steer brain circulation in a direction that maximizes economic and social benefits. In contrast, the United States presents a more symmetrical global brain circulation profile, with relatively minor differences in the volume of intercontinental flows, indicating a more stable and harmonious distribution of brain across the globe. Asia and Europe account for the largest shares, with inflow proportions of 22.6% and 19.6%, and outflow proportions of 19.4% and 16.7%, respectively, indicating no significant directional bias.
The differences in brain circulation patterns between China and the United States may be attributed to their unique economic, political, and cultural contexts, highlighting the need for tailored policies to address their specific brain circulation challenges and opportunities. These findings underscore the value of the GBCD corpus in informing brain policy guidance and regional development strategies. By analysing brain circulation patterns at the regional level, policymakers can identify areas of strength and weakness and develop targeted strategies to promote regional growth56. This emphasizes the importance of using data-driven approaches to inform brain policy decisions and optimize regional development strategies.
Spatiotemporal dynamics of brain circulation
In addition to characterizing the static state of national brain circulation flows, the GBCD also captures the dynamic evolution of brain circulation from a time series perspective, uncovering emerging trends. We grouped the circulation data of each country by timestamps and calculated the flux between inflows and outflows, organizing the data into time series for further analysis. The flux of brain circulation in each country can be expressed as follows:
where Di and Gi represent the brain and grain of brain in the country in each year respectively. To prevent the flux of individual countries from fluctuating dramatically in different time periods, we use the variance as a bias to reduce this impact.
By analysing the flux of the top ten countries with the highest total brain circulation, we obtained the temporal evolution trends of brain circulation from 2000 to 2024 (Fig. 7). Our analysis reveals a lack of correlation between the total amount of brain circulation and the change in the flux of drain and gain. Notably, the United States, which exhibits the most active brain circulation dynamics, maintains a relatively balanced inflow and outflow, resulting in a stable flux that fluctuates within a narrow range of 0.55 to 1.33. In contrast, countries like France and Japan experience more significant fluctuations in brain circulation due to an imbalance in the circulation direction. For instance, Japan’s index peaked at 4.52 in 2010, representing a level two to three times higher than that during the period of downward trend. This suggests that policymakers should consider the dynamic evolution of brain circulation when designing policies to attract and retain high-skilled individuals.
Dynamic indicators of national brain circulation flux. The streamgraph illustrates the dynamic evolution of the mobility ratio over the 21st century, with data referring to the top ten countries with high mobility frequency. The distribution order is sorted according to total brain circulation volume, with a clear hysteretic trend observed in 2020, linked to the COVID-19 event, and other PHEIC events marked to reflect the general law of change in transfer rates under global major epidemics. Taking the PHEIC as clue, we marked the start and end time of each event, reflecting the general law of the change of the transfer rate of countries under the global major epidemic.
Notably, our analysis exposes a precipitous decline in global brain circulation flux around 2020, with the aggregate flux indicator plummeting from 17.88 to 6.83 over a two-year period. This drastic reduction coincides with the onset of the COVID-19 pandemic, suggesting a significant disruption to brain circulation and labor transfer57. To contextualize this finding, we examined changes in brain circulation flux during other Public Health Emergencies of International Concern (PHEIC) in the 21st century, such as the SARS and H1N1 outbreaks. Our results show that during each PHEIC, brain circulation flux either declined or stabilized, with no instances of increase. By comparing the temporal changes in flux across different countries, we observe that national brain circulation flux exhibits varying degrees of sensitivity to PHEIC. Moreover, countries that maintain a long-term balance between brain drain and gain tend to perform better in responding to PHEIC events, with minimal disruptions to their flux indices58. For example, the brain circulation flux of China remained relatively stable during the COVID-19 pandemic. This suggests that countries should develop targeted strategies to manage brain circulation, taking into account their unique drain and gain dynamics59.
The study highlights the importance of considering the impact of international significant events on brain circulation patterns, particularly in relation to the dynamic changes that occur in response to such events. The temporal dynamics of the GBCD are sensitive to real-world events, and the convergence and divergence of trends exhibit certain lag effects, which can have different impacts on the results at different time scales. This supports both the spatiotemporal scale of the GBCD corpus and its mapping to real-world phenomena, reflecting the robustness and quality of the GBCD corpus. Moreover, the framework is designed to iterate the brain circulation paradigms with the update of the corpus version, ensuring their continued relevance and interpretability to cutting-edge global dynamic trends. This enables policymakers and researchers to stay informed about the latest developments in brain circulation and make data-driven decisions to address the complex challenges associated with brain circulation.
Usage Notes
The GBCD corpus enables the comprehensive assessment and characterization of global brain circulation, facilitating planning and analysis at the national and geographic levels. To ensure high data quality and extensive geographic coverage, specific names, materials, and map layouts have been employed. It is essential to note that these choices do not imply any endorsement or stance by the authors or their respective countries regarding the legal status of any nation, territory, or region. Additionally, the depiction of borders and boundaries on the maps is purely indicative and does not signify formal recognition or acceptance by the publisher. The maps and database are intended to provide a neutral representation of geographic information, and any interpretation or inference of political boundaries or affiliations is explicitly excluded.
Ethical approval
Not applicable as this study did not involve human participants.
Informed consent
This study does not contain any studies with human participants performed by any of the authors.
Consent to participate
All the authors have approved this submission.
Consent for publication
All the authors have approved publication.
Code availability
All code, data, and tools used in this study are openly available on GitHub at https://github.com/Computational-social-science. The repository includes entity extraction algorithms for narrative text and fine-tuning inference methods for LLMs, which can be accessed, referenced, and modified by the research community.
References
Lane, R. E. The decline of politics and ideology in a knowledgeable society. Am. Sociol. Rev. 31, 649–662 (1966).
Stehr, N. Societal transformations, globalisation and the knowledge society. Int. J. Knowl. Learn. 3, 139–153 (2007).
Kerr, S. P., Kerr, W., Ozden, C. & Parsons, C. Global talent flows. J. Econ. Perspect. 30, 83–106 (2016).
Wible, B. Reservoir of foreign talent. Science. 356, 694 (2017).
Anniste, K. & Tammaru, T. Ethnic differences in integration levels and return migration intentions: A study of Estonian migrants in Finland. Demogr. Res. 30, 377–412 (2014).
Carling, J. & Pettersen, S. V. Return migration intentions in the integration–transnationalism matrix. Int. Migr. 52, 13–30 (2014).
Carling, J. & Erdal, M. B. Return migration and transnationalism: How are the two connected? Int. Migr. 52, 2–12 (2014).
de Haas, H. & Fokkema, T. The effects of integration and transnational ties on international return migration intentions. Demogr. Res. 25, 755–782 (2011).
King, R. & Raghuram, P. International student migration: Mapping the field and new research agendas. Popul. Space Place 19, 127–137 (2013).
Docquier, F., Lohest, O. & Marfouk, A. Brain drain in developing countries. World Bank Econ. Rev. 21, 193–218 (2007).
Ushkalov, I. G. & Malakha, I. A. The “Brain Drain” as a global phenomenon and its characteristics in Russia. Russ. Soc. Sci. Rev. 42, 79–95 (2001).
Hawelka, B. et al. Geo-located Twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci. 41, 260–271 (2014).
Pötzschke, S. & Braun, M. Migrant sampling using Facebook advertisements: A case study of polish migrants in four European countries. Soc. Sci. Comput. Rev. 35, 633–653 (2017).
Kraemer, M. U. G. et al. Mapping global variation in human mobility. Nat. Hum. Behav. 4, 800–810 (2020).
Zurbarán, M. A. et al. An evaluation framework for assessing the impact of location privacy on geospatial analysis. IEEE Access 8, 158224–158236 (2020).
Alamri, S. The geospatial crowd: emerging trends and challenges in crowdsourced spatial analytics. ISPRS Int. J. Geo-Information 13, 168 (2024).
Willekens, F., Massey, D., Raymer, J. & Beauchemin, C. International migration under the microscope. Science. 352, 897–899 (2016).
Abel, G. J. & Sander, N. Quantifying global international migration flows. Science. 343, 1520–1522 (2014).
Cui, Z. et al. DyGCN: Efficient dynamic graph embedding with graph convolutional network. IEEE Trans. Neural Networks Learn. Syst. 35, 4635–4646 (2024).
Bell, M. et al. Cross-national Comparison of Internal Migration: Issues and Measures. J. R. Stat. Soc. Ser. A (Statistics Soc. 165, 435–464 (2002).
Ilyinova, E. & Kochetova, L. Diachronic perspective in text and discourse studies: Review of approaches. Vestn. Volgogr. Gos. Univ. Ser. 2. Jazyk. 15, 18–25 (2016).
Pearce, N., Weller, M., Scanlon, E. & Ashleigh, M. Digital scholarship considered: How new technologies could transform academic work nick pearce, martin weller, eileen scanlon, and melanie ashleigh. Educ. 16, 33–44 (2010).
Camacho-Collados, J. & Pilehvar, M. T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 63, 743–788 (2018).
Kiperwasser, E. & Goldberg, Y. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Trans. Assoc. Comput. Linguist. 4, 313–327 (2016).
Tian, Y., Song, Y. & Xia, F. Enhancing structure-aware encoder with extremely limited data for graph-based dependency parsing. Proc. 29th Int. Conf. Comput. Linguist. 29, 5438–5449 (2022).
Gómez-olmos, B. J. Part-of-speech tagging with rule-based data preprocessing and transformer. Electronics 2, 113–120 (2022).
Zheng, Y. et al. Large language models for medicine: a survey. Int. J. Mach. Learn. Cybern. 18 (2024).
Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Informatics Assoc. 31, 1812–1820 (2024).
Qiu, Y. Unveiling the spatiotemporal dynamics of global brain circulation: A comprehensive corpus (2000–2024). figshare. Dataset. https://doi.org/10.6084/m9.figshare.28031471 (2024).
Zhu, W. et al. Multimodal C4: An open, billion-scale corpus of images interleaved with text. Adv. Neural Inf. Process. Syst. 36 (2023).
Azamfirei, R., Kudchadkar, S. R. & Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 27, 1–2 (2023).
Hu, E. et al. Lora: Low-Rank Adaptation of Large Language Models. ICLR 2022 - 10th Int. Conf. Learn. Represent. 1–26 (2022).
Kaminska, O. & Lynn, P. Survey-based cross-country comparisons where countries vary in sample design: Issues and solutions. J. Off. Stat. 33, 123–136 (2017).
Molnar, C. et al. General pitfalls of model-agnostic interpretation methods for machine learning models. Lect. Notes Comput. Sci. 1320, 39–68 (2022).
Raiaan, M. A. K. et al. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 12, 26839–26874 (2024).
Jiang, D., Ren, X. & Lin, B. Y. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. Proc. Annu. Meet. Assoc. Comput. Linguist. 1, 14165–14178 (2023).
Jiang, Z., Xu, F. F., Araki, J. & Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 8, 423–438 (2020).
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).
Younes, Y. & Scherp, A. Question answering versus named entity recognition for extracting unknown datasets. IEEE Access 11, 92775–92787 (2023).
Zhang, X. et al. ∞ bench: Extending long context evaluation beyond 100k tokens. Proc. Annu. Meet. Assoc. Comput. Linguist. 6, 15262–15277 (2024).
Zeng, J., Xiong, D. & Liu, Y. A hierarchy-to-sequence attentional neural. IEEE/ACM Trans. Audio, Speech, Lang. Process. 26, 623–632 (2018).
Akbaritabar, A., Theile, T. & Zagheni, E. Bilateral flows and rates of international migration of scholars for 210 countries for the period 1998–2020. Sci. Data 11, 1–14 (2024).
Mu, H. et al. A global record of annual terrestrial Human Footprint dataset from 2000 to 2018. Sci. Data 9, 176 (2022).
Sugihara, G. et al. Detecting causality in complex ecosystems. Science. 338, 496–500 (2012).
Frank, M. R. et al. Detecting reciprocity at a global scale. Sci. Adv. 4, 1–7 (2018).
Niva, V. et al. World’s human migration patterns in 2000–2019 unveiled by high-resolution data. Nat. Hum. Behav. 7, 2023–2037 (2023).
Tin, T. et al. Impacts of local human activities on the Antarctic environment. Antarct. Sci. 21, 3–33 (2009).
Bailey, A. & Mulder, C. H. Highly skilled migration between the Global North and South: gender, life courses and institutions. J. Ethn. Migr. Stud. 43, 2689–2703 (2017).
Barber, R. M. et al. Estimating global, regional, and national daily and cumulative infections with SARS-CoV-2 through Nov 14, 2021: a statistical analysis. Lancet 399, 2351–2380 (2022).
Proctor, E. K. & Geng, E. A new lane for science. Science. 374, 659–659 (2021).
Adesote, S. A. & Osunkoya, O. A. The brain drain, skilled labour migration and its impact on Africa’s development, 1990s–2000s. Africology J. Pan African Stud. 12, 395–420 (2018).
Pellegrino, A. Trends in Latin American skilled migration: “brain drain” or “brain exchange”? Int. Migr. 39, 111–132 (2001).
Birt, M., Wallis, T. & Winternitz, G. Talent retention in a changing workplace: An investigation of variables considered important to South African talent. South African J. Bus. Manag. 35, 25–32 (2004).
Li, W., Bakshi, K., Tan, Y. & Huang, X. Policies for recruiting talented professionals from the diaspora: India and China compared. Int. Migr. 57, 373–391 (2019).
Yuping, M. A. & Suyan, P. A. N. Chinese returnees from overseas study: An understanding of brain gain and brain circulation in the age of globalization. Front. Educ. China 10, 306–329 (2015).
Peri, G. Skills and talent of immigrants: a comparison between the European Union and the United States. Inst. Eur. Stud. 15, 250–260 (2013).
Sah, R. P. et al. Impact of water deficit stress in maize: Phenology and yield components. Sci. Rep. 10, 1–15 (2020).
Lee, J. Y., Yahiaoui, D., Lee, K. P. & Cooke, F. L. Global talent management and multinational subsidiaries’ resilience in the Covid-19 crisis: Moderating roles of regional headquarters’ support and headquarters–subsidiary friction. Hum. Resour. Manage. 61, 355–372 (2022).
Chamie, J. International digration amid a world in crisis. J. Migr. Hum. Secur. 8, 230–245 (2020).
Acknowledgements
The work was supported by the Natural Science Foundation of Zhejiang Province (LZ21F020004) and the Major Project of Digital and Cutting-edge Disciplines Construction, Zhejiang Gongshang University (SZJ2022B007).
Author information
Authors and Affiliations
Contributions
Z.W.H. conceived of the research and supervised the project. Z.W.H. and Y.Q. performed the experiments and analysed the data. Z.W.H. and Y.Q. wrote the manuscript. All authors discussed the results and commented on the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hu, Z., Qiu, Y., Jiang, H. et al. Unveiling the Spatiotemporal Dynamics of Global Brain Circulation: A Comprehensive Corpus (2000–2024). Sci Data 12, 938 (2025). https://doi.org/10.1038/s41597-025-05268-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-05268-2