Abstract
While housing price prediction is well-studied, the prediction of large-scale housing conditions remains underexplored due to data limitations. This paper addresses this gap by developing a machine-learning model to predict housing conditions across the United States. We integrated property-level data from the Warren Group with neighborhood characteristics from the U.S. Census Bureau’s American Community Survey and trained three gradient-boosting algorithms: CatBoost, LightGBM, and XGBoost. Despite XGBoost’s slightly higher balanced accuracy, CatBoost was selected as the best model due to its superior resistance to overfitting. The final model’s predictions were aggregated to census tracts, ZIP code tabulation areas, and a 36.13 km2 resolution hexagonal grid for national-scale spatial analysis. The resulting comprehensive dataset can serve as a valuable resource for researchers and practitioners to analyze the geography of housing quality with applications in urban planning, disaster management, community resilience, public health, and more.
Similar content being viewed by others
Data availability
The Housing Condition Scores dataset is available at the Figshare repository https://doi.org/10.6084/m9.figshare.29606177.v1.
Code availability
The codes utilized to create the dataset are available on GitHub (https://github.com/kim-kyusik/housing_condition_scores) and are compatible with Python 3.12.4.
References
Bonnefoy, X. Inadequate housing and health: an overview. IJEP 30, 411 (2007).
Rolfe, S. et al. Housing as a social determinant of health and wellbeing: developing an empirically-informed realist theoretical framework. BMC Public Health 20, 1138 (2020).
Rauh, V. A., Landrigan, P. J. & Claudio, L. Housing and Health: Intersection of Poverty and Environmental Exposures. Annals of the New York Academy of Sciences 1136, 276–288 (2008).
Damiens, J. The impact of housing conditions on mortality in Belgium (1991–2016). J Pop Research 37, 391–421 (2020).
Garg, R. et al. Low housing quality, unmet social needs, stress and depression among low-income smokers. Preventive Medicine Reports 27, 101767 (2022).
Hiscock, R., Kearns, A., Macintyre, S. & Ellaway, A. Ontological Security and Psycho-Social Benefits from the Home: Qualitative Evidence on Issues of Tenure. Housing, theory and society 18, 50–66 (2001).
Thomson, H., Petticrew, M. & Morrison, D. Health effects of housing improvement: systematic review of intervention studies. BMJ 323, 187–190 (2001).
Chimed‐Ochir, O. et al. Effect of housing condition on quality of life. Indoor Air 31, 1029–1037 (2021).
Wang, D. & Wang, F. Contributions of the Usage and Affective Experience of the Residential Environment to Residential Satisfaction. Housing Studies 31, 42–60 (2016).
Xu, W. et al. Combining deep learning and crowd-sourcing images to predict housing quality in rural China. Sci Rep 12, 19558 (2022).
US Census Bureau. American Housing Survey (AHS). Census.gov https://www.census.gov/programs-surveys/ahs.html (2025).
U.S. Department of Housing and Urban Development. Consolidated Planning/CHAS Data | HUD USER. https://www.huduser.gov/portal/datasets/cp.html (2024).
Emrath, P. & Taylor, H. Housing Value, Costs, and Measures of Physical Adequacy. Cityscape 99, 125 (2012).
Ho, W. K. O., Tang, B.-S. & Wong, S. W. Predicting property prices with machine learning algorithms. Journal of Property Research 38, 48–70 (2021).
Soltani, A., Heydari, M., Aghaei, F. & Pettit, C. J. Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Cities 131, 103941 (2022).
Truong, Q., Nguyen, M., Dang, H. & Mei, B. Housing Price Prediction via Improved Machine Learning Techniques. Procedia Computer Science 174, 433–442 (2020).
Uber Engineering. H3: Uber’s Hexagonal Hierarchical Spatial Index. Uber Engineering Blog https://eng.uber.com/h3/ (2018).
Cutter, S. L., Boruff, B. J. & Shirley, W. L. Social Vulnerability to Environmental Hazards. Social Science Q 84, 242–261 (2003).
Flanagan, B. E., Gregory, E. W., Hallisey, E. J., Heitgerd, J. L. & Lewis, B. A Social Vulnerability Index for Disaster Management. Journal of Homeland Security and Emergency Management 8 (2011).
The Warren Group. Property data. (2024).
Dewey. Real Estate Data. (2023).
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, San Francisco California USA, 2016). https://doi.org/10.1145/2939672.2939785.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31, (2018).
Ke, G. et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in neural information processing systems 30 (2017).
Atkinson, E. J. et al. Assessing fracture risk using gradient boosting machine (GBM) models. Journal of Bone and Mineral Research 27, 1397–1404 (2012).
Hastie, T. T., Tibshirani, S. & Friedman, H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (New York: Springer, 2009).
Zhang, Z., Zhao, Y., Canes, A., Steinberg, D. & Lyashevska, O. Predictive analytics with gradient boosting in clinical medicine. Ann. Transl. Med 7, 152–152 (2019).
Hong, W. S., Haimovich, A. D. & Taylor, R. A. Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 13, e0201016 (2018).
Zhang, X., Yan, C., Gao, C., Malin, B. A. & Chen, Y. Predicting Missing Values in Medical Data Via XGBoost Regression. J Healthc Inform Res 4, 383–394 (2020).
Ogutu, J. O., Piepho, H.-P. & Schulz-Streeck, T. A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc 5, S11 (2011).
Zhao, X., Yan, X., Yu, A. & Van Hentenryck, P. Prediction and Behavioral Analysis of Travel Mode Choice: A Comparison of Machine Learning and Logit Models. Travel Behaviour and Society 20, 22–35 (2020).
Jun, M.-J. A comparison of a gradient boosting decision tree, random forests, and artificial neural networks to model urban land use changes: the case of the Seoul metropolitan area. International Journal of Geographical Information Science 35, 2149–2167 (2021).
Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The Balanced Accuracy and Its Posterior Distribution. in 2010 20th International Conference on Pattern Recognition 3121–3124, https://doi.org/10.1109/ICPR.2010.764 (IEEE, Istanbul, Turkey, 2010).
Wang, F. & Ross, C. L. Machine Learning Travel Mode Choices: Comparing the Performance of an Extreme Gradient Boosting Model with a Multinomial Logit Model. Transportation Research Record 2672, 35–45 (2018).
Wu, F., Jing, X.-Y., Shan, S., Zuo, W. & Yang, J.-Y. Multiset Feature Learning For Highly Imbalanced Data Classification. Proceedings of the AAAI conference on artificial intelligence 31 (2017).
Prusty, S., Patnaik, S. & Dash, S. K. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front. Nanotechnol. 4, 972421 (2022).
Ramezan, C. A., Warner, T. A. & Maxwell, A. E. Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification. Remote Sensing 11, 185 (2019).
Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. Advances in neural information processing systems 25 (2012).
Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54, 1937–1967 (2021).
Hassanali, M., Soltanaghaei, M., Javdani Gandomani, T. & Zamani Boroujeni, F. Software development effort estimation using boosting algorithms and automatic tuning of hyperparameters with Optuna. J Software Evolu Process 36, e2665 (2024).
Yang, K., Liu, L. & Wen, Y. The impact of Bayesian optimization on feature selection. Sci Rep 14, 3948 (2024).
Florida State University. High Performance Compute Cluster. https://its.fsu.edu/services/high-performance-compute-cluster.
Zadrozny, B. & Elkan, C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining 694–699 (2002).
Kim, K. Housing Condition Scores (Census tract, ZCTA, and H3 Grid). figshare https://doi.org/10.6084/m9.figshare.29606177 (2025).
Landis, J. R. & Koch, G. G. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 159 (1977).
Kain, J. F. & Quigley, J. M. Measuring the Value of Housing Quality. Journal of the American Statistical Association 65, 532–548 (1970).
Sengupta, U. & Tipple, A. G. The Performance of Public-sector Housing in Kolkata, India, in the Post-reform Milieu. Urban Studies 44, 2009–2027 (2007).
Lee, J. S. & Oh, D.-H. Housing quality evaluation and housing choice using PIF: A case of the Bundang New Town housing market in Korea1. International Journal of Urban Sciences 16, 63–83 (2012).
Sinha, R. C., Sarkar, S. & Mandal, N. R. An Overview of Key Indicators and Evaluation Tools for Assessing Housing Quality: A Literature Review. J. Inst. Eng. India Ser. A 98, 337–347 (2017).
Zey‐Ferrell, M., Kelley, E. A. & Bertrand, A. L. Consumer Preferences and Selected Socioeconomic Variables Related to Physical Adequacy of Housing. Home Economics Research Journal 5, 232–243 (1977).
Meehan, K., Jurjevich, J. R., Chun, N. M. & Sherrill, J. Geographies of insecure water access and the housing–water nexus in US cities. Proceedings of the National Academy of Sciences 117, 28700–28707 (2020).
Romitti, Y., Sue Wing, I., Spangler, K. R. & Wellenius, G. A. Inequality in the availability of residential air conditioning across 115 US metropolitan areas. PNAS Nexus 1, pgac210 (2022).
O’Neil, M. M. & Roscigno, V. J. Racial/Ethnic inequality & contemporary disparities in mortgage lending. PLOS ONE 20, e0308121 (2025).
Courchane, M. J. & Ross, S. L. Evidence and Actions on Mortgage Market Disparities: Research, Fair Lending Enforcement, and Consumer Protection. Housing Policy Debate 29, 769–794 (2019).
Acknowledgements
The research reported in this publication was supported by the Gulf Research Program of the National Academies of Sciences, Engineering, and Medicine under award number SCON-10000677 and the Centers for Disease Control and Prevention Climate Ready Cities and States Initiative (NUE1EH001496-02-00).
Author information
Authors and Affiliations
Contributions
K. Kim and C. Uejio designed and conceptualized the research; K. Kim processed data; K. Kim performed the ML algorithms for prediction; K. Kim wrote the original draft and visualizations; T. Holmes, E. Powell, and C. Uejio reviewed and edited the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kim, K., Holmes, T., Powell, E. et al. Large-scale modeling for housing condition prediction using machine learning algorithms. Sci Data (2026). https://doi.org/10.1038/s41597-026-07012-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-07012-w


