Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
Community-level education percentile rank estimation in China using multi-source big data and machine learning
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 26 January 2026

Community-level education percentile rank estimation in China using multi-source big data and machine learning

  • Yanji Zhang  ORCID: orcid.org/0000-0003-1652-49441,
  • Zhenyu Pan  ORCID: orcid.org/0009-0004-9053-65181,
  • Yongyi You  ORCID: orcid.org/0009-0001-4875-56272,
  • Liang Cai  ORCID: orcid.org/0000-0002-5599-41833 &
  • …
  • Bo Qin  ORCID: orcid.org/0000-0002-3020-63714 

Scientific Data , Article number:  (2026) Cite this article

  • 290 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Education
  • Geography
  • Socioeconomic scenarios

Abstract

Socio-economic data with fine-grained spatial resolution forms the basis of socio-spatial analysis and policymaking. In response to the limited availability of such data in China, this study provides an open-access, community-level dataset on education percentile rank — a more accurate indicator of social status than years of education. Our dataset comprises 122,126 communities, covering 97.9% of prefecture-level administrative units and 81.8% of county-level administrative units. The data is estimated using an XGBoost machine learning model based on the relationship between mean education percentile rank and the characteristics of the built environment, including functions and facilities, street scene elements, vitality, human perception, physical disorder, and topography at the community level. Multi-source data, including the Chinese General Social Survey, points of interest, road networks, night-time lighting, and street view images processed using computer vision techniques such as semantic segmentation, object detection, and image regression, are used for model training and inference. Our final education predictions are highly accurate at prefecture, county, and community levels. This dataset enables fine-grained socio-spatial analyses across disciplines.

Similar content being viewed by others

The influence of higher education based on machine learning on subjective well-being

Article Open access 08 October 2025

Network based analysis of student self governance networks and predictive role in civic participation outcomes

Article Open access 27 January 2026

Navigating cognitive boundaries: the impact of CognifyNet AI-powered educational analytics on student improvement

Article Open access 23 June 2025

Data availability

The datasets of community-level education percentile rank estimation in China are openly available on Figshare at https://doi.org/10.6084/m9.figshare.2965459156.

Code availability

The community-level education percentile rank dataset was created using Python 3.9.7 and ArcGIS 10.6 software platform. The code for our extreme gradient boosting (XGBoost) machine learning algorithm, which is used to predict community education percentile ranks, is available at the public repository Figshare (https://doi.org/10.6084/m9.figshare.29648798)57.

References

  1. Ganzeboom, H. B. G., Graaf, P. M. D. & Treiman, D. J. A standard international socio-economic index of occupational status. Social Science Research. 21, 1–56 (1992).

    Google Scholar 

  2. Xiao, Y. & Bian, Y. The influence of hukou and college education in China’s labour market. Urban Studies. 55, 1504–1524 (2018).

    Google Scholar 

  3. Xie, Y., Dong, H., Zhou, X. & Song, X. Trends in social mobility in postrevolution China. Proceedings of the National Academy of Sciences of the United States of America. 119, e2117471119 (2022).

    Google Scholar 

  4. Walder, A. G., Li, B. & Treiman, D. J. Politics and life chances in a state socialist regime: Dual career paths into the urban Chinese elite, 1949 to 1996. American Sociological Review. 65, 191–209 (2000).

    Google Scholar 

  5. Nee, V. A theory of market transition: From redistribution to markets in state socialism. American Sociological Review. 54, 663–681 (1989).

    Google Scholar 

  6. Yan, W. & Deng, X. Intergenerational income mobility and transmission channels in a transition economy: Evidence from China. Economics of Transition and Institutional Change. 30, 183–207 (2022).

    Google Scholar 

  7. Wu, X. & Treiman, D. J. Inequality and equality under Chinese socialism: The Hukou system and intergenerational occupational mobility. American Journal of Sociology. 113, 415–445 (2007).

    Google Scholar 

  8. Goodman, D. S. G. Middle class China: Dreams and aspirations. Journal of Chinese Political Science. 19, 49–67 (2014).

    Google Scholar 

  9. Ponzini, A. Educating the new Chinese middle-class youth: The role of quality education on ideas of class and status. The Journal of Chinese Sociology. 7, 1–18 (2020).

    Google Scholar 

  10. Sampson, R. J. Great American city: Chicago and the enduring neighborhood effect. (University of Chicago Press, 2013).

  11. He, Q., Musterd, S. & Boterman, W. Understanding different levels of segregation in urban China: A comparative study among 21 cities in Guangdong province. Urban Geography. 43, 1036–1061 (2022).

    Google Scholar 

  12. Zhang, Y., Wang, J. & Kan, C. Temporal variation in activity-space-based segregation: A case study of Beijing using location-based service data. Journal of Transport Geography. 98, 103239 (2022).

    Google Scholar 

  13. Chen, Y., He, J., Wei, W., Zhu, N. & Yu, C. A multi-model approach for user portrait. Future Internet. 13, 147 (2021).

    Google Scholar 

  14. Zhang, F. et al. Urban visual intelligence: Studying cities with artificial intelligence and street-level imagery. Annals of the American Association of Geographers. 114, 876–897 (2024).

    Google Scholar 

  15. Gebru, T. et al. Using deep learning and Google street view to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences of the United States of America. 114, 13108–13113 (2017).

    Google Scholar 

  16. Suel, E., Polak, J. W., Bennett, J. E. & Ezzati, M. Measuring social, environmental and health inequalities using deep learning and street imagery. Scientific Reports. 9, 6229 (2019).

    Google Scholar 

  17. Suel, E., Bhatt, S., Brauer, M., Flaxman, S. & Ezzati, M. Multimodal deep learning from satellite and street-level imagery for measuring income, overcrowding, and environmental deprivation in urban areas. Remote Sensing of Environment. 257, 112339 (2021).

    Google Scholar 

  18. Fan, Z., Zhang, F., Loo, B. P. Y. & Ratti, C. Urban visual intelligence: Uncovering hidden city profiles with street view images. Proceedings of the National Academy of Sciences of the United States of America. 120, e2220417120 (2023).

    Google Scholar 

  19. Naik, N. et al. Computer vision uncovers predictors of urban change. Proceedings of the National Academy of Sciences of the United States of America. 114, 7571–7576 (2017).

    Google Scholar 

  20. Rossetti, T., Lobel, H., Rocco, V. & Hurtubia, R. Explaining subjective perceptions of public spaces as a function of the built environment: A massive data approach. Landscape and Urban Planning. 181, 169–178 (2019).

    Google Scholar 

  21. Zhang, Y. et al. Quantifying physical and psychological perceptions of urban scenes using deep learning. Land Use Policy. 111, 105762 (2021).

    Google Scholar 

  22. Xie, Y. & Zhang, C. The long-term impact of the Communist Revolution on social stratification in contemporary China. Proceedings of the National Academy of Sciences of the United States of America. 116, 19392–19397 (2019).

    Google Scholar 

  23. Li, X. et al. Mapping global urban boundaries from the global artificial impervious area (GAIA) data. Environmental Research Letters. 15, 094044 (2020).

    Google Scholar 

  24. Bian, Y. & Li, L. The Chinese General Social Survey (2003–2008): Sample designs and data evaluation. Chinese Sociological Review. 45, 70–97 (2012).

    Google Scholar 

  25. Chen, Y., Naidu, S., Yu, T. & Yuchtman, N. Intergenerational mobility and institutional change in 20th century China. Explorations in Economic History. 58, 44–73 (2015).

    Google Scholar 

  26. Li, M. & Cao, J. Multi-generational educational mobility in China in the twentieth century. China Economic Review. 80, 101990 (2023).

    Google Scholar 

  27. Shi, Z. et al. A data-driven framework for analyzing spatial distribution of the elderly cardholders by using smart card data. ISPRS International Journal of Geo-Information. 10, 728 (2021).

    Google Scholar 

  28. Wang, D. & Li, S. Socio-economic differentials and stated housing preferences in Guangzhou, China. Habitat International. 30, 305–326 (2006).

    Google Scholar 

  29. Cervero, R. & Kockelman, K. Travel demand and the 3Ds: Density, diversity and design. Transportation Research Part D-Transport And Environment. 2, 199–219 (1997).

    Google Scholar 

  30. Sung, H., Lee, S. & Cheon, S. Operationalizing Jane Jacobs’s urban design theory: Empirical verification from the great city of Seoul, Korea. Journal of Planning Education and Research. 35, 117–130 (2015).

    Google Scholar 

  31. Che, Y. et al. 3D-GloBFP: The first global three-dimensional building footprint dataset. Earth System Science Data. 16, 5357–5374 (2024).

    Google Scholar 

  32. Huang, S., Tang, L., Hupy, J. P. & Shao, G. A commentary review on the use of normalized difference vegetation index (NDVI) in the era of popular remote sensing. Journal of Forestry Research. 32, 1–6 (2021).

    Google Scholar 

  33. Liu, Z. et al. Swin Transformer: Hierarchical vision transformer using shifted windows. (IEEE/CVF Conference on Computer Vision. 2021).

  34. Zhou, B. et al. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision. 127, 302–321 (2019).

    Google Scholar 

  35. Paszkowski, W. & Sobiech, M. The modeling of the acoustic condition of urban environment using noise annoyance assessment. Environmental Modeling & Assessment. 24, 319–330 (2019).

    Google Scholar 

  36. Chen, L., Zhao, L., Xiao, Y. & Lu, Y. Investigating the spatiotemporal pattern between the built environment and urban vibrancy using big data in Shenzhen, China. Computers, Environment and Urban Systems. 95, 101827 (2022).

    Google Scholar 

  37. Lan, F., Gong, X., Da, H. & Wen, H. How do population inflow and social infrastructure affect urban vitality? Evidence from 35 large- and medium-sized cities in China. Cities. 100, 102454 (2020).

    Google Scholar 

  38. Xia, C., Yeh, A. G. & Zhang, A. Analyzing spatial relationships between urban land use intensity and urban vitality at street block level: A case study of five Chinese megacities. Landscape and Urban Planning. 193, 103669 (2020).

    Google Scholar 

  39. Lebakula, V. et al. LandScan global 30 arcsecond annual global gridded population datasets from 2000 to 2022. Scientific Data. 12, 495 (2025).

    Google Scholar 

  40. Wilson, J. Q. & Kelling, G. L. Broken windows: The police and neighborhood safety. Atlantic Monthly. 249, 29–38 (1982).

    Google Scholar 

  41. Sampson, R. J. & Raudenbush, S. W. Systematic social observation of public spaces: A new look at disorder in urban neighborhoods. American Journal of Sociology. 105, 603–651 (1999).

    Google Scholar 

  42. Bader, M. D., Mooney, S. J., Bennett, B. & Rundle, G. A. The promise, practicalities, and perils of virtually auditing neighborhoods using Google Street view. The ANNALS of the American Academy of Political and Social Science. 669, 18–40 (2017).

    Google Scholar 

  43. Hwang, J. & Naik, N. Systematic social observation at scale: Using crowdsourcing and computer vision to measure visible neighborhood conditions. Sociological Methodology. 53, 183–216 (2023).

    Google Scholar 

  44. Hoeben, E. M., Steenbeek, W. & Pauwels, L. J. R. Measuring disorder: Observer bias in systematic social observations at streets and neighborhoods. Journal of Quantitative Criminology. 34, 221–249 (2018).

    Google Scholar 

  45. Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. (IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023).

  46. Yao, Y. et al. Discovering the homogeneous geographic domain of human perceptions from street view images. Landscape and Urban Planning. 212, 104125 (2021).

    Google Scholar 

  47. Salesses, P., Schechtner, K. & Hidalgo, C. A. The collaborative image of the city: Mapping the inequality of urban perception. PLOS ONE. 8, e68400 (2013).

    Google Scholar 

  48. Zhang, F. et al. Measuring human perceptions of a large-scale urban region using machine learning. Landscape and Urban Planning. 180, 148–160 (2018).

    Google Scholar 

  49. Tan, M. & Le, Q. EfficientNetV2: Smaller models and faster training. International Conference on Machine Learning. 139, 7102–7110 (2021).

    Google Scholar 

  50. Rubin, D. B. Inference and missing data. Biometrika. 63, 581–590 (1976).

    Google Scholar 

  51. van Buuren, S. & Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software. 45, 1–67 (2011).

    Google Scholar 

  52. Rácz, A. & Gere, A. Comparison of missing value imputation tools for machine learning models based on product development cases studies. LWT-Food Science And Technology. 221, 117585 (2025).

    Google Scholar 

  53. Bergstra, J., Bardenet, R., Bengio, Y. & Kegl, B. Algorithms for hyper-parameter optimization. Advances in neural information processing systems. 24, 2546–2554 (2011).

    Google Scholar 

  54. Nishio, M. et al. Computer-aided diagnosis of lung nodule using gradient tree boosting and Bayesian optimization. PlOS one. 13, e0195875 (2018).

    Google Scholar 

  55. Echabarri, S., Do, P., Vu, H. C. & Bornand, B. Machine learning and Bayesian optimization for performance prediction of proton-exchange membrane fuel cells. Energy and AI. 17, 100380 (2024).

    Google Scholar 

  56. Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. Datasets of community-level education percentile rank estimation in China. figshare. Dataset. https://doi.org/10.6084/m9.figshare.29654591 (2025).

  57. Zhang, Y., Pan, Z., You, Y., Cai, L. & Qin, B. XGBoost regressor for estimating community-level education percentile rank in China. figshare. Dataset. https://doi.org/10.6084/m9.figshare.29648798 (2025).

Download references

Acknowledgements

The authors would like to thank Guangwen Song of Guangzhou University for providing some of the street view images and for their technical support. We would like to express our gratitude to the editor and the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

  1. Department of Sociology, School of Humanities and Social Sciences, Fuzhou University, Fuzhou, 350108, China

    Yanji Zhang & Zhenyu Pan

  2. Department of Landscape Architecture, School of Architecture, South China University of Technology, Guangzhou, 510641, China

    Yongyi You

  3. Department of Sociology, University of Chicago, Chicago, IL, 60637, USA

    Liang Cai

  4. Department of Urban Planning and Management, School of Public Administration and Policy, Renmin University of China, Beijing, 100872, China

    Bo Qin

Authors
  1. Yanji Zhang
    View author publications

    Search author on:PubMed Google Scholar

  2. Zhenyu Pan
    View author publications

    Search author on:PubMed Google Scholar

  3. Yongyi You
    View author publications

    Search author on:PubMed Google Scholar

  4. Liang Cai
    View author publications

    Search author on:PubMed Google Scholar

  5. Bo Qin
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Y.J. Zhang conceived the original idea and supervised the research. Z.Y. Pan and B. Qin collected the data and performed data cleaning. Z.Y. Pan and Y.Y. You developed methodology framework, produced this dataset, and analyzed the results. Y.J. Zhang and L. Cai wrote the manuscript.

Corresponding author

Correspondence to Yongyi You.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Pan, Z., You, Y. et al. Community-level education percentile rank estimation in China using multi-source big data and machine learning. Sci Data (2026). https://doi.org/10.1038/s41597-026-06664-y

Download citation

  • Received: 06 August 2025

  • Accepted: 20 January 2026

  • Published: 26 January 2026

  • DOI: https://doi.org/10.1038/s41597-026-06664-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing