Background & Summary

As economic globalization and informatization progress, cities are becoming increasingly interconnected through the flow of people, information, goods, and capital, etc1,2,3. The emergence of the “space of flows” has shifted urban research from hierarchical systems toward a network-based approach that emphasizes spatial interactions between cities4,5,6. Enterprises, as key drivers of urban economic growth, play a crucial role in shaping economic connections between cities2,7,8. Constructing networks based on enterprise-enterprise linkage has become a major focus in recent urban network research. During China’s rapid urbanization these decades, industries have expanded significantly, and a large number of enterprises have been established annually9,10, forming increasingly close ties that reshape economic interactions and functional organizations between cities. To capture this dynamic process, there is an urgent need for accurate data to quantitatively describe the intercity network of economic connections, providing a scientific foundation for further research and policy development.

Traditional studies on urban networks have primarily relied on data such as population mobility, transportation flows, technology/knowledge exchanges, and logistics11,12,13,14,15,16,17. While valuable, these sources often overlook actual economic flows like capital investments. Current approaches for constructing urban networks based on enterprise data generally fall into several categories. One method uses interlocking network model, which establishes a service value matrix between cities to construct urban networks6,18. This method, although widely used in world city network research, is complex and does not directly capture the city-to-city connections19,20. Another widely used approach focuses on the headquarter-branch relationship, emphasizing the control exerted by headquarters over branch locations in different cities19,21,22,23,24. While this approach effectively presents vertical and intra-enterprise connections, it is limited in scope as it focuses on enterprises with headquarters, neglecting the broader spectrum of intercity economic connections.

Recently, constructing IIN based on inter-enterprise investment activities has gained traction. These activities typically refer to cross-regional investments, where capital flows from one city to another2. By analyzing intercity investment behaviors and equity relationships, cities can establish complex capital connection networks that represent real economic ties. Inter-enterprise investments reflect not just capital flows, but also the transfer of information, technology, equipment, and human resources, etc10,20. As such, enterprise investment data provide a more accurate picture of intercity economic linkages. Several empirical studies have explored IIN, particularly in China. For example, Li et al.2 analyzed the spatial patterns and influencing factors of the IIN in the Yangtze River Delta based on historical investment data from 3,698 listed firms. Similarly, Guo et al.25 constructed an intercity investment network for 53 coastal cities in China from 2006 to 2016. Other studies, such as Lu & Sun26, constructed national-scale networks based on headquarters-branch and inter-enterprise investments, while Zhang et al.20 focused on the influence of institutions and markets on the network structure using large-scale enterprise investment records. Additionally, some studies have explored venture capital networks in Chinese cities16,27.

Despite the increasing recognition of the importance of IIN in China, existing research still faces several limitations, primarily due to data availability and quality. Many studies either focus on developed regions such as the Yangtze River Delta, which has a high degree of marketization28, or on specific groups such as listed companies or venture capital networks, as these datasets are more easily accessible. Even the very few studies that have constructed urban networks based on large-scale enterprise investment data often lack a focus on the data itself, do not include validation, and fail to make the data publicly available. Moreover, the limited temporal scope of existing datasets restricts the ability to depict the long-term structural dynamics of IIN, especially in rapidly urbanizing regions like China. In reality, a comprehensive characterization of IIN requires micro-level data on investment activities between enterprises across all types of industries and regions20, which entails an enormous volume of data. However, due to the challenges of accessing enterprise-level data and concerns over privacy, there is currently no publicly available dataset capable of capturing the spatiotemporal dynamics of IIN. This is the critical issue that our study seeks to address.

Our study addresses these gaps by providing a publicly available and validated dataset of China’s IIN. Using a large-scale enterprise registration dataset covering the period from 2000 to 2020, we extracted 11,954,035 inter-enterprise investment records from a total of 17,273,411 enterprise entries. These enterprise-enterprise investment records were then transformed into enterprise-city connections, and ultimately aggregated into city-city investment connections. The resulting IIN represents cities as nodes, with investment frequency between cities serving as weighted edges. Our dataset includes 367 cities nationwide, with 134,666 directed and 92,994 undirected city dyads, ensuring its representativeness. The dataset consists of directed and undirected weighted IINs for every five-year interval from 2000 to 2020, alongside key network metrics such as indegree, outdegree, and node degree, and we also provide an additional version of the IIN that incorporates intracity investment. All data are publicly available for download on the Figshare platform29. To ensure data quality, we conducted several technical validations, including random sampling and manual verification with official records, comparing the dataset with urban networks derived from China’s population mobility data and economic gravity model, testing the distance decay effect of intercity investment, and cross-referencing results from previous studies.

Methods

Overview

Our research involves several steps to establish the IIN, as illustrated in Fig. 1, which include five main stages: (1) Collecting original enterprise registration records from an enterprise registration information platform; (2) Extracting records of investor and investee enterprises involved in investment activities; (3) Adding information on the cities where the investor and investee enterprises are located; (4) Aggregating investment frequency between enterprises to the city level and removing records where both the investor and investee enterprises belong to the same city; and (5) Constructing IIN dataset and validating it from four aspects.

Fig. 1
figure 1

Workflow of this study. NECIPS: National Enterprise Credit Information Publicity System.

Data sources and cleaning

The National Enterprise Credit Information Publicity System (NECIPS) provides detailed registration information for all enterprises established in China, including rich historical information such as enterprise name, address, registered capital, industry classification, and outward investment records30. The outward investment records contain detailed information such as name of the investor enterprise, name of the investee enterprise, registered capital, paid-in capital, and investment date. However, due to privacy restrictions, it is no longer possible to obtain large-scale data directly from the official system in bulk. Therefore, we alternatively used another third-party query platform, Qichacha (www.qcc.com), to collect raw data. Qichacha is one of China’s most authoritative enterprise information platforms, having obtained certification from the People’s Bank of China for enterprise credit investigation and being officially registered with NECIPS31. Its fundamental data originate from NECIPS, where all enterprise registrations in China must be filed. Beyond this baseline, Qichacha employs advanced technologies including big data mining and artificial intelligence to process and structure massive amounts of additional information from sources such as enterprise annual reports, bidding documents, and other official records, with automated timely updates (see http://www.ixy360.com/). The platform has accumulated registration information for over 200 million enterprises across 8,000 industries (see http://www.ixy360.com/), effectively covering nearly all registered businesses in mainland China. This comprehensive approach results in broader coverage compared to NECIPS alone, while maintaining the reliability of the official system, making it particularly advantageous for our research purposes.

On the Qichacha platform, each enterprise has a dedicated webpage, where we can extract three key sections related to enterprise investment (see the example named “webpage example.pdf” in Figshare repository29). These three essential sections include basic registration information, outward investment records, and shareholder information. We specifically focus on enterprises with outward investment records, from which we can obtain all investee enterprise information through both current and historical outward investments, including enterprise names and place of registration. We then query the shareholder information of these investee enterprises to retrieve the actual paid-in capital amount and investment date. In summary, the integration of information across these three sections allows us to systematically construct inter-enterprise investment relationships. Section 1 provides information about the investor enterprise and its place of registration, Section 2 reveals the investee enterprise and its place of registration, and Section 3 helps us determine the investment year through paid-in capital dates. While the platform also provides investment amount data, we chose not to use it as network weights in our subsequent analysis. This decision was made because we discovered significant missing data, particularly in earlier years, and anomalous values that could potentially bias our dataset and compromise its representativeness (see detailed discussion in Usage Notes section).

Following the above procedure, we leveraged web crawling on the Qichacha platform to collect a total of 17,273,411 enterprise registration records involving investment activities across all industries in all cities of China from 2000 to 2020 at five-year intervals. We then removed duplicate records and excluded data from the registered cities outside mainland China. The next task was to fill in the information on the cities where the investor and investee enterprises were registered. In our study, cities are defined based on China’s administrative divisions as of 2020, covering 367 cities, including 4 municipalities, 293 prefecture-level cities, 30 county-level cities or counties under direct provincial control, 30 autonomous prefectures, 7 regions, and 3 leagues32. While the majority of enterprises had their registration city directly available in their records (place of registration), some lacked this information. To address these cases, we implemented a three-step strategy: (1) For records with address information, we first used Python’s “cpca” package33 to extract city names; if this failed, we used the address and enterprise name to call the Amap API for geocoding34, thereby obtaining city information; (2) If both methods failed or the original data lack address, we manually searched for the enterprise name on Internet to fill in missing field. Records that could not be resolved using the aforementioned methods were deleted. To avoid issues with city name duplication, all city data were processed using their respective unique administrative division codes (see “CityInfo.xlsx” in the data repository29). Ultimately, we obtained 11,954,035 inter-enterprise investment records, with 754,438 in 2000, 1,103,421 in 2005, 1,551,400 in 2010, 2,649,989 in 2015, and 5,894,787 in 2020. Considering privacy concerns, we provide a sample of anonymized enterprise names in the file “Inter-enterprise investment records (sample).xlsx”, available on Figshare29.

Construction of intercity investment network

The aim of this study is to construct an IIN that captures the economic connections between cities. To achieve this, we need to aggregate the inter-enterprise investments from the micro level to the city level. The microdata consist of enterprise–enterprise connections across all industries in China. A single enterprise may invest in multiple other enterprises, and the investee enterprises may be located either in the same city as the investor enterprise or in different cities. Therefore, we aggregated these data based on the city where each enterprise is registered, resulting in “enterprise-city” connections, which were then transformed into “city-city” connections. In this relationship, the strength of connection between cities is represented by the total number of investments between them, defined as investment frequency here. Since our focus is on IIN, we dropped “city-city” connections where both the investor and investee enterprises belong to the same city.

Formally, we construct the IIN by defining cities as network nodes, the connections between cities as edges, and the investment frequency between cities as the edge weights. Accordingly, we can form two types of networks, namely directed and undirected weighted IINs. In the undirected weighted network, the strength of connection between nodes (cities) is given by:

$${S}_{{ij}}={I}_{i\to j}+{I}_{j\to i}\left(i\ne j\right)$$
(1)

where \({S}_{{ij}}\) represents the connection strength, i.e., the edge weight in the undirected weighted network; \({I}_{i\to j}\) and \({I}_{j\to i}\) denote the directed investment frequency between cities i and j, which are the edge weights in the directed IIN and are directly derived from the previously mentioned aggregated “city-city” connections data. Notably, when we include the case where \(i=j\), the IIN incorporates intracity investment, which is useful for understanding the localization characteristics of the network. Therefore, in the Data Records section, we have also provided datasets that include intracity investment.

Based on the constructed IIN, we can calculate three essential network metrics for each city. The first metric is outdegree, which represents the total number of investments a city makes in other cities. This metric reflects the city’s ability to exert capital control over other cities in the network. The second metric is indegree, which represents the total number of investments received by a city from other cities, providing an indication of the city’s ability to attract capital in the network19. The third metric is node degree, defined as the sum of outdegree and indegree, which reflects the overall influence of a city within the IIN. These metrics are formally expressed as follows:

$${Indegre}{e}_{i}={\sum }_{j}{N}_{i\to j}\left(i\ne j\right)$$
(2)
$${Outdegre}{e}_{i}={\sum }_{j}{N}_{j\to i}\left(i\ne j\right)$$
(3)
$${Node}{\rm{\_}}{degre}{e}_{i}={Indegre}{e}_{i}+{Outdegre}{e}_{i}$$
(4)

where \({N}_{i\to j}\) denotes the number of investments made by city i in city j, and \({N}_{j\to i}\) represents the number of investments city i receives from city j.

As a result, we derived a directed IIN dataset covering 367 cities and 134,666 city dyad edges from 2000 to 2020, and we also constructed an undirected IIN dataset with 92,994 city dyad edges. The descriptive statistics of these networks are summarized in Table 1.

Table 1 Descriptive statistics of intercity investment network on investment frequency.

Characteristics of intercity investment network

Using the geographic coordinates (latitude and longitude) of each city’s centroid, we visualized the IINs onto the map of China to provide an intuitive network view. Here, we merely present the undirected IIN to observe the dynamic economic interactions between Chinese cities from 2000 to 2020 (Fig. 2). It is evident that, over this period, Chinese cities have become increasingly interconnected through enterprise investments, forming a broad network of economic interactions. The overall structure shows a diamond-shaped pattern with Beijing, Shanghai, Shenzhen, and Chengdu serving as the key vertices. Moreover, the investment network has grown progressively more complex, indicating stronger economic interactions between cities over time. However, we also observe that the most densely connected regions are concentrated in a few urban agglomerations, such as Beijing-Tianjin-Hebei, Yangtze River Delta, and Pearl River Delta.

Fig. 2
figure 2

The undirected intercity investment networks from 2000 to 2020. (a,b,c,d,e) respectively show the network of 2000, 2005, 2010, 2015 and 2020. Maps are created using ArcGIS Pro software (version 3.0.2).

To better understand the network structure, Fig. 3 visualizes the distribution of network metrics (outdegree, indegree, and node degree) across cities. It is revealed that cities with the highest node degree, such as Beijing, Shanghai, and Shenzhen, have consistently dominated the network, reflecting their leading role in China’s economic landscape. An interesting pattern emerges when examining the composition of degree metrics: large cities, especially in eastern regions, typically show higher outdegree than indegree, while smaller cities generally exhibit higher indegree than outdegree. This suggests that large eastern cities tend to function as investment initiators while smaller cities are more likely to be investment recipients.

Fig. 3
figure 3

Spatial distribution of outdegree, indegree and node degree between 2000 and 2020. Maps are created using ArcGIS Pro software (version 3.0.2).

Further, based on the undirected IIN, we employed the Leiden algorithm35 for community detection to analyze city clustering patterns. As shown in Fig. 4, the analysis reveals distinct spatial communities that evolved over time. In 2000, the network exhibited 12 communities, with clear regional boundaries largely corresponding to China’s traditional economic regions. Over time, the number of communities gradually decreased to 9 by 2020, suggesting increased integration of the investment network. Notably, three major economic regions - the Beijing-Tianjin-Hebei region, Yangtze River Delta, and Pearl River Delta - consistently formed their own distinct communities throughout the study period.

Fig. 4
figure 4

Community detection results using the Leiden algorithm from 2000 to 2020.

Lastly, utilizing the IIN dataset that includes intracity investment, we examined the frequency of intracity investment and outward investment for each city to assess whether a city’s investment is more localized or outward-oriented. As shown in Fig. 5, the investment networks exhibit a significant localization tendency, with intracity investment frequency substantially higher than outward investment frequency. However, we also observed that this localization phenomenon has gradually weakened over time, indicating an evolving trend toward broader geographical investment connections.

Fig. 5
figure 5

Comparison of intracity and outward investment frequency of IINs (2000–2020).

Data Records

The dataset we produced, along with its supporting code and additional data, can be accessed on Figshare29. The dataset consists of three parts, all stored in Excel format. The first part includes two versions of the directed IIN dataset: one excluding intracity investments (“Directed intercity investment network dataset.xlsx”) and another including intracity investments (“Directed intercity investment network dataset (including intracity investment).xlsx”). Both versions contain information such as the investor city, investee city, investment frequency, and the latitude and longitude of the cities’ centroids. The data fields and descriptions are as follows (Table 2).

Table 2 Field description of the directed intercity investment network dataset.

Similarly, the second part also provides two versions of the undirected IIN dataset: the standard version (“Undirected intercity investment network dataset.xlsx”) and the comprehensive version (“Undirected intercity investment network dataset (including intracity investment).xlsx”). Both datasets document information about the cities involved in the investment activities and the investment frequency between them. The data fields and descriptions are as follows (Table 3).

Table 3 Field description of the undirected intercity investment network dataset.

The third part is the dataset of three network metrics at the city node level, titled “Three network metric dataset.xlsx”. It includes three important network metrics: outdegree, indegree, and node degree, as well as the city rankings for these metrics across different years. The data fields and descriptions are as follows (Table 4).

Table 4 Field description of three network metrics dataset.

In addition, we provide several datasets used to support the production and validation (see Technical Validation section) of the main dataset. These include “CityInfo.xlsx” (Information of 367 cities), “Inter-enterprise investment records (sample).xlsx”, “Population mobility in 2020.xlsx”, “Urban GDP (2000–2020).xlsx” and “Investment records of specialized investment institutions in 2020.xlsx”. Detailed descriptions for each dataset can be found in the notes of the individual Excel files.

Technical Validation

To verify the reliability and validity of our dataset, we conducted validation through four distinct ways. First, we performed a consistency check between the collected inter-enterprise investment records and data queried from the official NECIPS. Second, we compared the produced IIN data with the urban networks derived from China’s population mobility data and economic gravity model. Third, we tested the distance decay effect of intercity investments. Fourth, we compared our results behind the data with findings from extant related studies.

Consistency check with official query from national enterprise credit information publicity system

Our initial data were obtained from the Qichacha platform. Although Qichacha’s data are sourced from NECIPS enterprise registration records, we conducted an additional verification by cross-checking our inter-enterprise investment records with data queried directly from NECIPS. Since NECIPS does not support bulk queries, we performed a manual, one-by-one verification process. We randomly sampled 100 records each for 2000, 2005, 2010, 2015, and 2020, resulting in a total of 500 samples. Five co-authors of this paper manually verified these records over the course of a week by entering either the investor or investee enterprise into the NECIPS query and checking the “Shareholder and Investment Information” subsection to confirm the accuracy of each investment record. Our checks show that 100% of the records that could be found in the NECIPS system matched our intercity investment data. However, we also noticed that some investment records obtained from Qichacha could not be found in the NECIPS system, suggesting that the official system may not be updated as frequently. This indicates that our data collection is more comprehensive, while still ensuring consistency with official records.

Comparison with population mobility networks and economic gravidity model

Since the dataset we produced is unique, it is challenging to find direct references for detailed comparison and validation. Therefore, we opted for an indirect validation of the dataset. IIN not only reflects economic interactions between cities but may also be related to the flow of human resources20. Thus, we validated the dataset indirectly by examining the population mobility network and the economic gravity model (albeit imperfectly). This choice is based on the following theoretical assumptions. For population mobility, the investment activity is often accompanied by the movement of labor, and it may create more job opportunities, attracting population flows36. Conversely, population mobility may also bring new investment opportunities, further promoting capital flows. Therefore, if our dataset is valid, we would expect a significant correlation and similar distribution between the IIN and the population mobility network. Regarding the economic gravity model, it assumes that the intensity of interactions between cities is proportional to their economic size (e.g., GDP)37. This suggests that cities with larger economies may interact more strongly. As a form of economic interaction, intercity investments should theoretically align with the gravity model’s expectations. If our dataset is reliable, the predicted economic interaction levels from the gravity model should exhibit similar characteristics to the IIN in terms of interaction intensity and distribution.

For the population mobility data, we obtained it from the Amap Population Migration Data Platform38 in the year of 2020, corresponding to our dataset’s timeframe (as earlier years do not have available data). We aggregated the daily migration flow index to annual values, resulting in a population mobility network between 367 cities.

To compare the IIN with the population mobility network, we employed two methods. First, we can create scatter plots to directly compare the two datasets and observe the fit between them. A positive slope in the fitted curve will indicate a correlation between the two networks. Second, following the method of Ref. 39, we can use a Q-Q (quantile-quantile) plot to assess the similarity in distribution of the two datasets. If the distributions of the two network metrics tend to be the same one, their data points should align along a line defined by \(y={kx}\), where x and y represent the percentiles of the two network metrics, and k is the coefficient. Due to differences in data sources and scales, we first scaled both the intercity investment data and the population mobility data using the following formula:

$${flow}\_{scaled}=\frac{{flow}-\mu }{\sigma }$$
(5)

where flow denotes the investment frequency or population mobility volume between cities, and flow_scaled is the standardized value; μ and σ are the sample mean and standard deviation, respectively. In this study, we compared both the directed and undirected networks in 2020. To account for the potential bias introduced by city distance, we also examined the correlation between investment frequency and population mobility volume across three distance intervals: 0–100 km, 100–500 km, and greater than 500 km. The 0–100 km range approximates the radius of metropolitan regions in China, while the 100–500 km range corresponds to the distance between cities within urban agglomerations. Typically, cities within a metropolitan region exhibit the strongest linkage40, followed by those within urban agglomerations. Therefore, we expect a stronger correlation between investment frequency and mobility flow within the metropolitan regions.

Figure 6a,c demonstrate a significant positive correlation between investment frequency and population mobility volume (both log-transformed) in both the undirected and directed networks. The undirected network shows a better fit, with an R² of 0.33, compared to the directed network, which has an R² of 0.26. This indicates that the investment frequency between cities is closely related to population mobility, and the undirected network better captures this relationship. When considering different distance intervals, cities within shorter distances exhibit a stronger model fit between investment frequency and mobility volume. For cities within the 0–100 km range (roughly corresponding to the radius of metropolitan regions in China), the undirected network achieves an R² as high as 0.74, while the directed network shows an R² of 0.69. As the distance increases, the model fit weakens, suggesting that investment flows and population mobility are more closely linked within the scale of metropolitan regions.

Fig. 6
figure 6

Comparison between intercity investment network and population mobility network in 2020. (a) scatter plot for undirected network. (b) Q-Q plot for undirected network. (c) scatter plot for directed network. (d) Q-Q plot for directed network.

Figure 6b,d further illustrate this relationship through Q-Q plots, where the investment frequency and population mobility volume form a nearly perfect line y = x. The R2 values are 0.96 for the undirected network and 0.98 for the directed network, indicating a high degree of alignment between the distributions of two datasets. Although a few data points deviate from the fitted line in the high-investment range, these outliers are minimal and do not significantly affect the overall data distribution. In conclusion, the strong correlation and consistent distribution between the IIN and the population mobility network confirm the reliability of our dataset. The data suggest that intercity investment flows are closely linked to population mobility, particularly within metropolitan regions, further validating the robustness of the IIN dataset we constructed.

On the other hand, the gravity model has been widely used to estimate economic interactions between cities, namely economic gravity model37,41,42,43. Specifically, the theory suggests that the economic interaction between two cities is more likely to be stronger if their economic sizes are larger and their geographical distance is shorter. This can be expressed by the following formula:

$${E}_{{ij}}=k\frac{{GD}{P}_{i}\times {GD}{P}_{j}}{{D}_{{ij}}^{\beta }}$$
(6)

where \({E}_{{ij}}\) refers to economic gravity between cities i and j, mirroring the strength of economic interaction; k is a gravitational constant; \({GD}{P}_{i}\) and \({GD}{P}_{j}\) are the economic sizes of cities i and j, respectively; \({D}_{{ij}}\) is the geographical distance between the centroids of cities i and j; and β is the distance decay coefficient. To estimate parameters k and β, following existing lierature39,44, we employed PSO (Particle Swarm Optimization) algorithm to minimize the RMSE (Root Mean Square Error) between estimated economic gravity and actual investment frequency. Specifically, we collected GDP data for the years 2000, 2005, 2010, 2015, and 2020 from the China City Statistical Yearbook. The distance between cities was calculated as the straight-line distance between their geographic centroids. Using the “pso” package in R45, we set the search ranges for k and β to 0–1 and 0–2, respectively, and obtained the parameter estimates for each year (Table 5). Based on these parameters, we calculated the economic gravity values using Eq. (6) and then computed their Pearson’s correlation with the actual investment frequency. We also calculated the correlation coefficients for different distance intervals (Table 5).

Table 5 Estimation results of gravity model parameters and Pearson’s correlation between economic gravity and investment frequency.

The results show strong correlations (overall above 0.6) that increase over time (despite a slight fluctuation in 2015). Notably, the strongest correlations were observed in the 0–100 km distance range, reaching above 0.9 in 2020, providing preliminary validation of our dataset’s effectiveness. Following the analysis procedure for the IIN and population mobility networks, we examined the model fit and distributional similarity between economic gravity and investment frequency.

Figure 7a,c,e,g,i show a significant positive correlation between economic gravity and investment frequency (both log-transformed), with the R² value for the fit increasing over time, reaching 0.47 in 2020. Similar to the population mobility network, the best fit between economic gravity and investment frequency is observed in the 0–100 km distance range. In 2020, the R² for this distance range was 0.67, indicating that the IIN better mirrors economic connections between cities at shorter distances. From the Q-Q plots (Fig. 7b,d,f,h,j), we can see that the quantiles of investment frequency and economic gravity are almost perfectly aligned along the y = x line across all years, with only a very few outliers at high values. The R² for all years exceeds 0.80, demonstrating a strong fit. This indicates a high degree of similarity in the data distribution between intercity investment frequency and economic interaction from economic gravity model across different years. To conclude, these findings provide further evidence of the scientific validity of the dataset we produced. The strong correlation between the IIN and the economic gravity supports the reliability of our data.

Fig. 7
figure 7

Comparison between intercity investment network and economic gravity model. (a,c,e,g,i) are scatter plots for the years of 2000, 2005, 2010, 2015, 2020. (b,d,f,h,j) are Q-Q plots for the years of 2000, 2005, 2010, 2015, 2020.

Test on the distance decay effect of intercity investments

Distance decay effect indicates that as distance between two cities increases, the intensity of economic activities such as investments tends to decrease46,47. This is due to higher transaction costs, including transportation and communication, as well as the increasing difficulty in maintaining efficient information flow. As a result, cities that are geographically closer to each other tend to engage in more frequent investment activities, while distant cities experience a reduction in such interactions. To verify if our data satisfy such principle, we calculate the proportion of total investment frequency within different distance ranges relative to the total number of intercity investments for each year, and then plot a scatter graph (Fig. 8). Clearly, investment intensity decreases as distance increases. Over time, the absolute value of the slope of the fitted line has decreased, suggesting that the constraint of distance on investment is weakening. This demonstrates that our data aligns with theoretical principles, thereby indirectly validating the reliability of the dataset to some extent.

Fig. 8
figure 8

The relationship between geographical distance and intercity investments. (a,b,c,d,e) respectively show the scatter plots and fitted lines of 2000, 2005, 2010, 2015 and 2020.

Comparison with existing research findings

Here, we compare the findings of our dataset with results from existing related studies. First, the overall “diamond” structure of the IIN, as visualized on the map of China (Fig. 2), is similar to that observed in a few other nationwide studies using intercity investment data, such as Lu & Sun26, Zhang & Tang48, Zhang et al.20, and Wu & Yao24, although there may be differences in data sources. Additionally, we examined the nodal influence of cities nodes in our network and compared our findings with results reported in comparable studies. Since different studies use varying metrics to measure the importance of cities within a network, we focused on the overlap in the top 20 most influential cities, instead of comparing the specific influencing values. In our study, city influence is measured based on node degree. Due to the limited availability of specific city information in other studies, we were able to obtain comparable data only from the studies of Wu & Yao24 and Zhang et al.20, which primarily cover the years 2000, 2005, and 2010 (see Table 6). As shown in Table 6, about 15 cities consistently overlap with those in our study, accounting for the majority of the top cities in both studies. This high level of overlap suggests that our dataset is reasonable and aligns well with previous studies, reinforcing the validity of our IIN dataset.

Table 6 Comparison of the top 20 cities in node influence of intercity investment network in our study with other similar works.

Usage Notes

Our dataset provides investment frequency between city dyads as well as their geographic coordinates (latitude and longitude), making it easy for potential users to visualize the network using GIS software. For example, in ArcGIS Pro, the “XY To Line” tool can be used to generate the investment network. Additionally, advanced network visualization tools like Gephi can be employed to create visually appealing and detailed network graphs. Beyond visualization, users can apply social network analysis methods49 to further explore the dataset. In the context of this study, several aspects of social network analysis hold potential. For example, users can examine centrality measures (such as betweenness and closeness) to identify key cities that act as critical hubs in the IIN. Another avenue is analyzing network modularity to investigate how cohesive sub-networks (city clusters) evolve over time, potentially offering insights into how urban regions are economically structured. Additionally, users can combine our network metrics (such as node degree, outdegree, and indegree) with other variables to explore interaction mechanisms. For instance, as Zhang et al.20 demonstrated, institutions and markets can influence the evolution of China’s IIN.

While we have invested considerable effort in producing and validating the IIN dataset, like any dataset, it has certain limitations that require acknowledgment. First, our choice to use investment frequency rather than investment amount as network weights warrants specific discussion. While investment amount could indeed better capture capital flows, several important factors influenced our methodological choice. First, there are significant issues with data completeness and quality. In our dataset, the paid-in capital data from investor enterprise to investee enterprise is largely missing, especially for the years 2000, 2005, and 2010, which could lead to a lack of representativeness. Second, we observed several anomalous capital flow data between certain city dyads. For instance, in 2020, while the investment frequency between Shenzhen and Sansha was relatively low (70 occurrences, ranking 2,639th among all city dyads), their total capital flow reached 240.5 billion RMB (ranking third overall) - a clear outlier that could distort network analysis. Third, by using frequency data, we ensure robust comparability across our long time series (2000–2020) as our dataset is not affected by such anomalous investment amount. Nevertheless, our IIN dataset is sufficient to capture the economic connectivity between cities, with strong comparability, and the visualized dataset aligns well with the actual situation in China (see Section Characteristics of intercity investment network).

To address this concern empirically, we conducted a detailed comparison in a small sample using 2020 data from the Yangtze River Delta (YRD) and Pearl River Delta (PRD) regions, where data quality is higher and market mechanisms are more mature50. As shown in our new analysis (Fig. 9), we compared networks constructed using investment frequency versus investment amount. The results show that the structures of the two networks are highly similar, with high-frequency investments concentrated among a few large cities, and this concentration is slightly more pronounced in the network constructed using investment amount. However, the overall differences are minimal. The scatter plots (Fig. 9c) also demonstrate a high correlation, with an R² value of 0.91 for both regions. This confirms the validity of our IIN dataset constructed based on investment frequency, despite its imperfections. It is important to note, however, that if researchers are interested in specific investment amount among enterprises rather than the macro-level patterns of investment connectivity, our dataset may not fully meet their needs.

Fig. 9
figure 9

Comparison of intercity investment networks based on investment frequency and investment amount in the Yangtze River Delta and Pearl River Delta regions (2020).

Another limitation concerns the investment by specialized investment institutions (SII), such as venture capital (VC) and private equity (PE) institutions. In China’s investment landscape, SIIs often establish separate investment entities that may be located in different cities from their operational controls, typically in areas offering tax incentives or favorable policies. To examine this potential bias, we collected data from PEDATA of Zero2IPO Group (https://max.pedata.cn/), a leading integrated service provider in China’s private equity industry. Using their comprehensive database of SII investment deals from 2020, supplemented with city location information from the Qichacha platform (see detailed records in “Investment records of specialized investment institutions in 2020.xlsx” of the data repository29), we identified a complex three-layer structure of investment relationships. As shown in Table 7, while 53.6% of investment institutions establish their investment entities in the same city (Group A), the proportion of intracity investments drops to 28.7% when examining the relationship between investment entities and investees (Group B). When considering the indirect relationship between investment institutions and their ultimate investees (Group C), only 26.2% occur within the same city. However, while 46.4% (4,500) of SII investment events in our analysis are intercity investments, these potentially biased cases only account for 0.25% of our dataset’s 1,808,636 intercity investment relationships in 2020. Therefore, unless these SII investments are highly concentrated in specific cities, this limitation is unlikely to significantly affect the overall network structure. Nevertheless, researchers should exercise caution when using this dataset for micro-level analysis of enterprise-to-enterprise investment patterns.

Table 7 Distribution of intracity and intercity investment relationships across different layers of specialized investment institutions in 2020.

Despite the limitations, our IIN dataset focuses on intercity economic connections, combined with the scientific validation, ensures the dataset is robust and reliable. For users investigating economic connectivity between cities, this database should sufficiently meet their needs.