Abstract
Socio-economic data with fine-grained spatial resolution forms the basis of socio-spatial analysis and policymaking. In response to the limited availability of such data in China, this study provides an open-access, community-level dataset on education percentile rank — a more accurate indicator of social status than years of education. Our dataset comprises 122,126 communities, covering 97.9% of prefecture-level administrative units and 81.8% of county-level administrative units. The data is estimated using an XGBoost machine learning model based on the relationship between mean education percentile rank and the characteristics of the built environment, including functions and facilities, street scene elements, vitality, human perception, physical disorder, and topography at the community level. Multi-source data, including the Chinese General Social Survey, points of interest, road networks, night-time lighting, and street view images processed using computer vision techniques such as semantic segmentation, object detection, and image regression, are used for model training and inference. Our final education predictions are highly accurate at prefecture, county, and community levels. This dataset enables fine-grained socio-spatial analyses across disciplines.
Similar content being viewed by others
Data availability
The datasets of community-level education percentile rank estimation in China are openly available on Figshare at https://doi.org/10.6084/m9.figshare.2965459156.
Code availability
The community-level education percentile rank dataset was created using Python 3.9.7 and ArcGIS 10.6 software platform. The code for our extreme gradient boosting (XGBoost) machine learning algorithm, which is used to predict community education percentile ranks, is available at the public repository Figshare (https://doi.org/10.6084/m9.figshare.29648798)57.
References
Ganzeboom, H. B. G., Graaf, P. M. D. & Treiman, D. J. A standard international socio-economic index of occupational status. Social Science Research. 21, 1–56 (1992).
Xiao, Y. & Bian, Y. The influence of hukou and college education in China’s labour market. Urban Studies. 55, 1504–1524 (2018).
Xie, Y., Dong, H., Zhou, X. & Song, X. Trends in social mobility in postrevolution China. Proceedings of the National Academy of Sciences of the United States of America. 119, e2117471119 (2022).
Walder, A. G., Li, B. & Treiman, D. J. Politics and life chances in a state socialist regime: Dual career paths into the urban Chinese elite, 1949 to 1996. American Sociological Review. 65, 191–209 (2000).
Nee, V. A theory of market transition: From redistribution to markets in state socialism. American Sociological Review. 54, 663–681 (1989).
Yan, W. & Deng, X. Intergenerational income mobility and transmission channels in a transition economy: Evidence from China. Economics of Transition and Institutional Change. 30, 183–207 (2022).
Wu, X. & Treiman, D. J. Inequality and equality under Chinese socialism: The Hukou system and intergenerational occupational mobility. American Journal of Sociology. 113, 415–445 (2007).
Goodman, D. S. G. Middle class China: Dreams and aspirations. Journal of Chinese Political Science. 19, 49–67 (2014).
Ponzini, A. Educating the new Chinese middle-class youth: The role of quality education on ideas of class and status. The Journal of Chinese Sociology. 7, 1–18 (2020).
Sampson, R. J. Great American city: Chicago and the enduring neighborhood effect. (University of Chicago Press, 2013).
He, Q., Musterd, S. & Boterman, W. Understanding different levels of segregation in urban China: A comparative study among 21 cities in Guangdong province. Urban Geography. 43, 1036–1061 (2022).
Zhang, Y., Wang, J. & Kan, C. Temporal variation in activity-space-based segregation: A case study of Beijing using location-based service data. Journal of Transport Geography. 98, 103239 (2022).
Chen, Y., He, J., Wei, W., Zhu, N. & Yu, C. A multi-model approach for user portrait. Future Internet. 13, 147 (2021).
Zhang, F. et al. Urban visual intelligence: Studying cities with artificial intelligence and street-level imagery. Annals of the American Association of Geographers. 114, 876–897 (2024).
Gebru, T. et al. Using deep learning and Google street view to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences of the United States of America. 114, 13108–13113 (2017).
Suel, E., Polak, J. W., Bennett, J. E. & Ezzati, M. Measuring social, environmental and health inequalities using deep learning and street imagery. Scientific Reports. 9, 6229 (2019).
Suel, E., Bhatt, S., Brauer, M., Flaxman, S. & Ezzati, M. Multimodal deep learning from satellite and street-level imagery for measuring income, overcrowding, and environmental deprivation in urban areas. Remote Sensing of Environment. 257, 112339 (2021).
Fan, Z., Zhang, F., Loo, B. P. Y. & Ratti, C. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences of the United States of America. 120, e2220417120 (2023).
Naik, N. et al. Computer vision uncovers predictors of urban change. Proceedings of the National Academy of Sciences of the United States of America. 114, 7571–7576 (2017).
Rossetti, T., Lobel, H., Rocco, V. & Hurtubia, R. Explaining subjective perceptions of public spaces as a function of the built environment: A massive data approach. Landscape and Urban Planning. 181, 169–178 (2019).
Zhang, Y. et al. Quantifying physical and psychological perceptions of urban scenes using deep learning. Land Use Policy. 111, 105762 (2021).
Xie, Y. & Zhang, C. The long-term impact of the Communist Revolution on social stratification in contemporary China. Proceedings of the National Academy of Sciences of the United States of America. 116, 19392–19397 (2019).
Li, X. et al. Mapping global urban boundaries from the global artificial impervious area (GAIA) data. Environmental Research Letters. 15, 094044 (2020).
Bian, Y. & Li, L. The Chinese General Social Survey (2003–2008): Sample designs and data evaluation. Chinese Sociological Review. 45, 70–97 (2012).
Chen, Y., Naidu, S., Yu, T. & Yuchtman, N. Intergenerational mobility and institutional change in 20th century China. Explorations in Economic History. 58, 44–73 (2015).
Li, M. & Cao, J. Multi-generational educational mobility in China in the twentieth century. China Economic Review. 80, 101990 (2023).
Shi, Z. et al. A data-driven framework for analyzing spatial distribution of the elderly cardholders by using smart card data. ISPRS International Journal of Geo-Information. 10, 728 (2021).
Wang, D. & Li, S. Socio-economic differentials and stated housing preferences in Guangzhou, China. Habitat International. 30, 305–326 (2006).
Cervero, R. & Kockelman, K. Travel demand and the 3Ds: Density, diversity and design. Transportation Research Part D-Transport And Environment. 2, 199–219 (1997).
Sung, H., Lee, S. & Cheon, S. Operationalizing Jane Jacobs’s urban design theory: Empirical verification from the great city of Seoul, Korea. Journal of Planning Education and Research. 35, 117–130 (2015).
Che, Y. et al. 3D-GloBFP: The first global three-dimensional building footprint dataset. Earth System Science Data. 16, 5357–5374 (2024).
Huang, S., Tang, L., Hupy, J. P. & Shao, G. A commentary review on the use of normalized difference vegetation index (NDVI) in the era of popular remote sensing. Journal of Forestry Research. 32, 1–6 (2021).
Liu, Z. et al. Swin Transformer: Hierarchical vision transformer using shifted windows. (IEEE/CVF Conference on Computer Vision. 2021).
Zhou, B. et al. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision. 127, 302–321 (2019).
Paszkowski, W. & Sobiech, M. The modeling of the acoustic condition of urban environment using noise annoyance assessment. Environmental Modeling & Assessment. 24, 319–330 (2019).
Chen, L., Zhao, L., Xiao, Y. & Lu, Y. Investigating the spatiotemporal pattern between the built environment and urban vibrancy using big data in Shenzhen, China. Computers, Environment and Urban Systems. 95, 101827 (2022).
Lan, F., Gong, X., Da, H. & Wen, H. How do population inflow and social infrastructure affect urban vitality? Evidence from 35 large- and medium-sized cities in China. Cities. 100, 102454 (2020).
Xia, C., Yeh, A. G. & Zhang, A. Analyzing spatial relationships between urban land use intensity and urban vitality at street block level: A case study of five Chinese megacities. Landscape and Urban Planning. 193, 103669 (2020).
Lebakula, V. et al. LandScan global 30 arcsecond annual global gridded population datasets from 2000 to 2022. Scientific Data. 12, 495 (2025).
Wilson, J. Q. & Kelling, G. L. Broken windows: The police and neighborhood safety. Atlantic Monthly. 249, 29–38 (1982).
Sampson, R. J. & Raudenbush, S. W. Systematic social observation of public spaces: A new look at disorder in urban neighborhoods. American Journal of Sociology. 105, 603–651 (1999).
Bader, M. D., Mooney, S. J., Bennett, B. & Rundle, G. A. The promise, practicalities, and perils of virtually auditing neighborhoods using Google Street view. The ANNALS of the American Academy of Political and Social Science. 669, 18–40 (2017).
Hwang, J. & Naik, N. Systematic social observation at scale: Using crowdsourcing and computer vision to measure visible neighborhood conditions. Sociological Methodology. 53, 183–216 (2023).
Hoeben, E. M., Steenbeek, W. & Pauwels, L. J. R. Measuring disorder: Observer bias in systematic social observations at streets and neighborhoods. Journal of Quantitative Criminology. 34, 221–249 (2018).
Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. (IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023).
Yao, Y. et al. Discovering the homogeneous geographic domain of human perceptions from street view images. Landscape and Urban Planning. 212, 104125 (2021).
Salesses, P., Schechtner, K. & Hidalgo, C. A. The collaborative image of the city: Mapping the inequality of urban perception. PLOS ONE. 8, e68400 (2013).
Zhang, F. et al. Measuring human perceptions of a large-scale urban region using machine learning. Landscape and Urban Planning. 180, 148–160 (2018).
Tan, M. & Le, Q. EfficientNetV2: Smaller models and faster training. International Conference on Machine Learning. 139, 7102–7110 (2021).
Rubin, D. B. Inference and missing data. Biometrika. 63, 581–590 (1976).
van Buuren, S. & Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software. 45, 1–67 (2011).
Rácz, A. & Gere, A. Comparison of missing value imputation tools for machine learning models based on product development cases studies. LWT-Food Science And Technology. 221, 117585 (2025).
Bergstra, J., Bardenet, R., Bengio, Y. & Kegl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems. 24, 2546–2554 (2011).
Nishio, M. et al. Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization. PlOS one. 13, e0195875 (2018).
Echabarri, S., Do, P., Vu, H. C. & Bornand, B. Machine learning and Bayesian optimization for performance prediction of proton-exchange membrane fuel cells. Energy and AI. 17, 100380 (2024).
Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. Datasets of community-level education percentile rank estimation in China. figshare. Dataset. https://doi.org/10.6084/m9.figshare.29654591 (2025).
Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. XGBoost regressor for estimating community-level education percentile rank in China. figshare. Dataset. https://doi.org/10.6084/m9.figshare.29648798 (2025).
Acknowledgements
The authors would like to thank Guangwen Song of Guangzhou University for providing some of the street view images and for their technical support. We would like to express our gratitude to the editor and the anonymous reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Contributions
Y.J. Zhang conceived the original idea and supervised the research. Z.Y. Pan and B. Qin collected the data and performed data cleaning. Z.Y. Pan and Y.Y. You developed methodology framework, produced this dataset, and analyzed the results. Y.J. Zhang and L. Cai wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, Y., Pan, Z., You, Y. et al. Community-level education percentile rank estimation in China using multi-source big data and machine learning. Sci Data (2026). https://doi.org/10.1038/s41597-026-06664-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06664-y


