Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
Large-scale modeling for housing condition prediction using machine learning algorithms
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 11 March 2026

Large-scale modeling for housing condition prediction using machine learning algorithms

  • Kyusik Kim  ORCID: orcid.org/0000-0003-3753-31961,2,
  • Tisha Holmes3,
  • Emily Powell4 &
  • …
  • Christopher K. Uejio1 

Scientific Data , Article number:  (2026) Cite this article

  • 1151 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Research data
  • Social sciences

Abstract

While housing price prediction is well-studied, the prediction of large-scale housing conditions remains underexplored due to data limitations. This paper addresses this gap by developing a machine-learning model to predict housing conditions across the United States. We integrated property-level data from the Warren Group with neighborhood characteristics from the U.S. Census Bureau’s American Community Survey and trained three gradient-boosting algorithms: CatBoost, LightGBM, and XGBoost. Despite XGBoost’s slightly higher balanced accuracy, CatBoost was selected as the best model due to its superior resistance to overfitting. The final model’s predictions were aggregated to census tracts, ZIP code tabulation areas, and a 36.13 km2 resolution hexagonal grid for national-scale spatial analysis. The resulting comprehensive dataset can serve as a valuable resource for researchers and practitioners to analyze the geography of housing quality with applications in urban planning, disaster management, community resilience, public health, and more.

Similar content being viewed by others

Nowcasting the next hour of residential load using boosting ensemble machines

Article Open access 28 February 2025

Active learning-based machine learning approach for enhancing environmental sustainability in green building energy consumption

Article Open access 27 August 2024

Combining deep learning and crowd-sourcing images to predict housing quality in rural China

Article Open access 15 November 2022

Data availability

The Housing Condition Scores dataset is available at the Figshare repository https://doi.org/10.6084/m9.figshare.29606177.v1.

Code availability

The codes utilized to create the dataset are available on GitHub (https://github.com/kim-kyusik/housing_condition_scores) and are compatible with Python 3.12.4.

References

  1. Bonnefoy, X. Inadequate housing and health: an overview. IJEP 30, 411 (2007).

    Google Scholar 

  2. Rolfe, S. et al. Housing as a social determinant of health and wellbeing: developing an empirically-informed realist theoretical framework. BMC Public Health 20, 1138 (2020).

    Google Scholar 

  3. Rauh, V. A., Landrigan, P. J. & Claudio, L. Housing and Health: Intersection of Poverty and Environmental Exposures. Annals of the New York Academy of Sciences 1136, 276–288 (2008).

    Google Scholar 

  4. Damiens, J. The impact of housing conditions on mortality in Belgium (1991–2016). J Pop Research 37, 391–421 (2020).

    Google Scholar 

  5. Garg, R. et al. Low housing quality, unmet social needs, stress and depression among low-income smokers. Preventive Medicine Reports 27, 101767 (2022).

    Google Scholar 

  6. Hiscock, R., Kearns, A., Macintyre, S. & Ellaway, A. Ontological Security and Psycho-Social Benefits from the Home: Qualitative Evidence on Issues of Tenure. Housing, theory and society 18, 50–66 (2001).

    Google Scholar 

  7. Thomson, H., Petticrew, M. & Morrison, D. Health effects of housing improvement: systematic review of intervention studies. BMJ 323, 187–190 (2001).

    Google Scholar 

  8. Chimed‐Ochir, O. et al. Effect of housing condition on quality of life. Indoor Air 31, 1029–1037 (2021).

    Google Scholar 

  9. Wang, D. & Wang, F. Contributions of the Usage and Affective Experience of the Residential Environment to Residential Satisfaction. Housing Studies 31, 42–60 (2016).

    Google Scholar 

  10. Xu, W. et al. Combining deep learning and crowd-sourcing images to predict housing quality in rural China. Sci Rep 12, 19558 (2022).

    Google Scholar 

  11. US Census Bureau. American Housing Survey (AHS). Census.gov https://www.census.gov/programs-surveys/ahs.html (2025).

  12. U.S. Department of Housing and Urban Development. Consolidated Planning/CHAS Data | HUD USER. https://www.huduser.gov/portal/datasets/cp.html (2024).

  13. Emrath, P. & Taylor, H. Housing Value, Costs, and Measures of Physical Adequacy. Cityscape 99, 125 (2012).

    Google Scholar 

  14. Ho, W. K. O., Tang, B.-S. & Wong, S. W. Predicting property prices with machine learning algorithms. Journal of Property Research 38, 48–70 (2021).

    Google Scholar 

  15. Soltani, A., Heydari, M., Aghaei, F. & Pettit, C. J. Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Cities 131, 103941 (2022).

    Google Scholar 

  16. Truong, Q., Nguyen, M., Dang, H. & Mei, B. Housing Price Prediction via Improved Machine Learning Techniques. Procedia Computer Science 174, 433–442 (2020).

    Google Scholar 

  17. Uber Engineering. H3: Uber’s Hexagonal Hierarchical Spatial Index. Uber Engineering Blog https://eng.uber.com/h3/ (2018).

  18. Cutter, S. L., Boruff, B. J. & Shirley, W. L. Social Vulnerability to Environmental Hazards. Social Science Q 84, 242–261 (2003).

    Google Scholar 

  19. Flanagan, B. E., Gregory, E. W., Hallisey, E. J., Heitgerd, J. L. & Lewis, B. A Social Vulnerability Index for Disaster Management. Journal of Homeland Security and Emergency Management 8 (2011).

  20. The Warren Group. Property data. (2024).

  21. Dewey. Real Estate Data. (2023).

  22. Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, San Francisco California USA, 2016). https://doi.org/10.1145/2939672.2939785.

  23. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31, (2018).

  24. Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in neural information processing systems 30 (2017).

  25. Atkinson, E. J. et al. Assessing fracture risk using gradient boosting machine (GBM) models. Journal of Bone and Mineral Research 27, 1397–1404 (2012).

    Google Scholar 

  26. Hastie, T. T., Tibshirani, S. & Friedman, H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (New York: Springer, 2009).

  27. Zhang, Z., Zhao, Y., Canes, A., Steinberg, D. & Lyashevska, O. Predictive analytics with gradient boosting in clinical medicine. Ann. Transl. Med 7, 152–152 (2019).

    Google Scholar 

  28. Hong, W. S., Haimovich, A. D. & Taylor, R. A. Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 13, e0201016 (2018).

    Google Scholar 

  29. Zhang, X., Yan, C., Gao, C., Malin, B. A. & Chen, Y. Predicting Missing Values in Medical Data Via XGBoost Regression. J Healthc Inform Res 4, 383–394 (2020).

    Google Scholar 

  30. Ogutu, J. O., Piepho, H.-P. & Schulz-Streeck, T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc 5, S11 (2011).

    Google Scholar 

  31. Zhao, X., Yan, X., Yu, A. & Van Hentenryck, P. Prediction and Behavioral Analysis of Travel Mode Choice: A Comparison of Machine Learning and Logit Models. Travel Behaviour and Society 20, 22–35 (2020).

    Google Scholar 

  32. Jun, M.-J. A comparison of a gradient boosting decision tree, random forests, and artificial neural networks to model urban land use changes: the case of the Seoul metropolitan area. International Journal of Geographical Information Science 35, 2149–2167 (2021).

    Google Scholar 

  33. Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The Balanced Accuracy and Its Posterior Distribution. in 2010 20th International Conference on Pattern Recognition 3121–3124, https://doi.org/10.1109/ICPR.2010.764 (IEEE, Istanbul, Turkey, 2010).

  34. Wang, F. & Ross, C. L. Machine Learning Travel Mode Choices: Comparing the Performance of an Extreme Gradient Boosting Model with a Multinomial Logit Model. Transportation Research Record 2672, 35–45 (2018).

    Google Scholar 

  35. Wu, F., Jing, X.-Y., Shan, S., Zuo, W. & Yang, J.-Y. Multiset Feature Learning For Highly Imbalanced Data Classification. Proceedings of the AAAI conference on artificial intelligence 31 (2017).

  36. Prusty, S., Patnaik, S. & Dash, S. K. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol. 4, 972421 (2022).

    Google Scholar 

  37. Ramezan, C. A., Warner, T. A. & Maxwell, A. E. Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification. Remote Sensing 11, 185 (2019).

    Google Scholar 

  38. Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. Advances in neural information processing systems 25 (2012).

  39. Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54, 1937–1967 (2021).

    Google Scholar 

  40. Hassanali, M., Soltanaghaei, M., Javdani Gandomani, T. & Zamani Boroujeni, F. Software development effort estimation using boosting algorithms and automatic tuning of hyperparameters with Optuna. J Software Evolu Process 36, e2665 (2024).

    Google Scholar 

  41. Yang, K., Liu, L. & Wen, Y. The impact of Bayesian optimization on feature selection. Sci Rep 14, 3948 (2024).

    Google Scholar 

  42. Florida State University. High Performance Compute Cluster. https://its.fsu.edu/services/high-performance-compute-cluster.

  43. Zadrozny, B. & Elkan, C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining 694–699 (2002).

  44. Kim, K. Housing Condition Scores (Census tract, ZCTA, and H3 Grid). figshare https://doi.org/10.6084/m9.figshare.29606177 (2025).

  45. Landis, J. R. & Koch, G. G. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 159 (1977).

    Google Scholar 

  46. Kain, J. F. & Quigley, J. M. Measuring the Value of Housing Quality. Journal of the American Statistical Association 65, 532–548 (1970).

    Google Scholar 

  47. Sengupta, U. & Tipple, A. G. The Performance of Public-sector Housing in Kolkata, India, in the Post-reform Milieu. Urban Studies 44, 2009–2027 (2007).

    Google Scholar 

  48. Lee, J. S. & Oh, D.-H. Housing quality evaluation and housing choice using PIF: A case of the Bundang New Town housing market in Korea1. International Journal of Urban Sciences 16, 63–83 (2012).

    Google Scholar 

  49. Sinha, R. C., Sarkar, S. & Mandal, N. R. An Overview of Key Indicators and Evaluation Tools for Assessing Housing Quality: A Literature Review. J. Inst. Eng. India Ser. A 98, 337–347 (2017).

    Google Scholar 

  50. Zey‐Ferrell, M., Kelley, E. A. & Bertrand, A. L. Consumer Preferences and Selected Socioeconomic Variables Related to Physical Adequacy of Housing. Home Economics Research Journal 5, 232–243 (1977).

    Google Scholar 

  51. Meehan, K., Jurjevich, J. R., Chun, N. M. & Sherrill, J. Geographies of insecure water access and the housing–water nexus in US cities. Proceedings of the National Academy of Sciences 117, 28700–28707 (2020).

    Google Scholar 

  52. Romitti, Y., Sue Wing, I., Spangler, K. R. & Wellenius, G. A. Inequality in the availability of residential air conditioning across 115 US metropolitan areas. PNAS Nexus 1, pgac210 (2022).

    Google Scholar 

  53. O’Neil, M. M. & Roscigno, V. J. Racial/Ethnic inequality & contemporary disparities in mortgage lending. PLOS ONE 20, e0308121 (2025).

    Google Scholar 

  54. Courchane, M. J. & Ross, S. L. Evidence and Actions on Mortgage Market Disparities: Research, Fair Lending Enforcement, and Consumer Protection. Housing Policy Debate 29, 769–794 (2019).

    Google Scholar 

Download references

Acknowledgements

The research reported in this publication was supported by the Gulf Research Program of the National Academies of Sciences, Engineering, and Medicine under award number SCON-10000677 and the Centers for Disease Control and Prevention Climate Ready Cities and States Initiative (NUE1EH001496-02-00).

Author information

Authors and Affiliations

  1. Florida State University, Department of Geography, Tallahassee, FL, USA

    Kyusik Kim & Christopher K. Uejio

  2. Kennesaw State University, Department of Geography and Anthropology, Kennesaw, GA, USA

    Kyusik Kim

  3. Florida State University, Department of Urban and Regional Planning, Tallahassee, FL, USA

    Tisha Holmes

  4. Florida State University, Center for Ocean-Atmospheric Prediction Studies, Tallahassee, FL, USA

    Emily Powell

Authors
  1. Kyusik Kim
    View author publications

    Search author on:PubMed Google Scholar

  2. Tisha Holmes
    View author publications

    Search author on:PubMed Google Scholar

  3. Emily Powell
    View author publications

    Search author on:PubMed Google Scholar

  4. Christopher K. Uejio
    View author publications

    Search author on:PubMed Google Scholar

Contributions

K. Kim and C. Uejio designed and conceptualized the research; K. Kim processed data; K. Kim performed the ML algorithms for prediction; K. Kim wrote the original draft and visualizations; T. Holmes, E. Powell, and C. Uejio reviewed and edited the manuscript.

Corresponding author

Correspondence to Kyusik Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, K., Holmes, T., Powell, E. et al. Large-scale modeling for housing condition prediction using machine learning algorithms. Sci Data (2026). https://doi.org/10.1038/s41597-026-07012-w

Download citation

  • Received: 24 July 2025

  • Accepted: 02 March 2026

  • Published: 11 March 2026

  • DOI: https://doi.org/10.1038/s41597-026-07012-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing