Abstract
The global rise in diabetes mellitus poses a major public health concern, highlighting the urgent need for effective risk prediction models to support early identification and prevention strategies. Yet, applying conventional statistical techniques to large-scale health surveillance data remains challenging due to memory limitations and high computational demands. One promising alternative, the divide and recombine (D&R) approach partitions the data into smaller subsets, fits logistic models independently within each, and combines the estimates to approximate full-sample maximum likelihood results. While the D&R methodology and its theoretical properties are well established in the statistical literature, such validation on national-scale health surveillance data remains limited. This study provides a comprehensive empirical validation of the D&R strategy for fitting logistic regression models efficiently on massive datasets, demonstrating its computational scalability, statistical robustness, and reproducibility in a real-world public health setting. We utilize the Behavioral Risk Factor Surveillance System (BRFSS) data from 2014 to 2024, encompassing over 2.4 million observations and 16 demographic, behavioral, clinical, and socioeconomic predictors. Monte Carlo simulations using 5 million synthetic observations and 1,000 replications confirm that the D&R method matches the statistical efficiency of centralized estimation (relative efficiency > 99.8%) while reducing computation time by more than 52% and memory usage by 77–89%. Application to BRFSS data further demonstrates the framework’s practical scalability, successfully recovering well-established diabetes risk factors including age, body mass index, general health status, alcohol use, cardiovascular disease history, and smoking, consistent with the epidemiological literature. These findings confirm that the D&R framework enables large-scale chronic disease modeling on standard computing infrastructure without reliance on high-performance systems, with direct implications for population health monitoring and preventive care resource allocation.
Data availability
The BRFSS data used in this study are publicly available from the Centers for Disease Control and Prevention at https://www.cdc.gov/brfss/. Analysis scripts specific to this study are publicly available via GitHub and Zenodo (see Code Availability statement below).
Code availability
The R code developed for this study, implementing the Divide and Recombine approaches for fitting logistic regression to large-scale health surveillance data, is openly available on GitHub at https://github.com/NayemMH/DR-Logistic-Regression-BRFSS and permanently archived on Zenodo (DOI: 10.5281/zenodo.19231359), with no restrictions on access or reuse. Readers are also encouraged to use the drglm R package28, developed and published by the first author, which provides a fully documented, general-purpose implementation of the D&R framework for generalized linear models, including logistic regression, with additional features for out-of-memory data handling and parallel computation. The drglm package is freely available on the Comprehensive R Archive Network (CRAN) (DOI:10.32614/CRAN.package.drglm) and can be installed directly in R.
References
Van Seventer, J.M. & Hochberg, N.S. Principles of infectious diseases: Transmission, diagnosis, prevention, and control. Int. Encycl. Public Health22 (2017)
Kenworthy, N., Thomann, M. & Parker, R. From a global crisis to the ‘end of aids’: New epidemics of signification. Glob. Public Health 13(8), 960–971 (2018).
Zumla, A., Alagaili, A. N., Cotten, M. & Azhar, E. I. Infectious diseases epidemic threats and mass gatherings: Refocusing global attention on the continuing spread of the middle east respiratory syndrome coronavirus (MERS-COV). BMC Med. 14(1), 1–4 (2016).
World Health Organization: Noncommunicable Diseases. https://www.who.int/news-room/fact-sheets/detail/noncommunicable-diseases. Accessed 22 Mar 2023 (2023).
Bennett, J.E., Stevens, G.A., Mathers, C.D., Bonita, R., Rehm, J., Kruk, M.E., Riley, L.M., Dain, K., Kengne, A.P. & Chalkidou, K. NCD countdown 2030: worldwide trends in non-communicable disease mortality and progress towards sustainable development goal target 3.4. Lancet392(10152), 1072–1088 (2018).
Centers for Disease Control and Prevention: What is Diabetes? https://www.cdc.gov/diabetes/basics/diabetes.html. Accessed 22 Mar 2023 (2023).
Centers for Disease Control and Prevention: Diabetes Fast Facts. https://www.cdc.gov/diabetes/basics/quick-facts.html. Accessed 22 Mar 2023 (2023).
Forbes, J. M. & Cooper, M. E. Mechanisms of diabetic complications. Physiol. Rev. 93(1), 137–188 (2013).
American Diabetes Association. Economic costs of diabetes in the US in 2017. Diabetes Care 41(5), 917–928 (2018).
American Diabetes Association: The Cost of Diabetes. https://diabetes.org/about-us/statistics/cost-diabetes. Accessed 22 June 2023 (2023)
Center for Disease Control and Prevention: Diabetes and COVID-19. https://www.cdc.gov/diabetes/library/reports/reportcard/diabetes-and-covid19.html. Accessed 22 Mar 2023 (2023)
Kastora, S., Patel, M., Carter, B., Delibegovic, M. & Myint, P. K. Impact of diabetes on covid-19 mortality and hospital outcomes from a global perspective: An umbrella systematic review and meta-analysis. Endocrinol. Diabetes Metabol. 5(3), 00338 (2022).
Saeedi, P. et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas. Diabetes Res. Clin. Pract. 157, 107843 (2019).
Sun, H. et al. IDF diabetes atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res. Clin. Pract. 183, 109119 (2022).
Islam, M. R. et al. Evaluation of the united states covid-19 vaccine allocation strategy. PLoS One 16(11), 0259700 (2021).
Kahn, R. et al. Age at initiation and frequency of screening to detect type 2 diabetes: A cost-effectiveness analysis. Lancet 375(9723), 1365–1374 (2010).
Herman, W. H. The cost-effectiveness of diabetes screening programs in adults: What do we know?. Diabetes Care 38(9), 1809–1816 (2015).
Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression 3rd edn. (Wiley, 2013).
Steyerberg, E. W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating 2nd edn. (Springer, 2019).
Noble, D., Mathur, R., Dent, T., Meads, C. & Greenhalgh, T. Risk models and scores for type 2 diabetes: Systematic review. BMJ 343, 7163 (2011).
Hidalgo, J. I. G. et al. Application of machine learning models for the estimation of diabetes likelihood. Diagnostics 10(11), 959 (2020).
Dey, D. The proper application of logistic regression model in complex survey data. BMC Med. Res. Methodol.25, 1–12 (2025) (systematic review of methodological issues in logistic regression with complex survey data).
Ma, Q. Recent applications and perspectives of logistic regression modelling in healthcare. Theor. Nat. Sci.36, 185–190 (2024) (review spanning 2018–2024 of logistic regression use in healthcare).
Luque, A., Carrasco, A., Martín, A. & Las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 91, 216–231 (2019).
Cleveland, W.S. & Hafen, R. Divide and recombine (D&R): Data science for large complex data. Stat. Anal. Data Min.7(6) (2014)
Chen, X. & Xie, M.-G. A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24(4), 1655–1684 (2014).
Lin, L. & Lumley, T. Aggregated estimating equation estimation. Stat. Interface 10(2), 263–277 (2017).
Nayem, M.M.H. Drglm: Fitting Linear and Generalized Linear Models in “Divide and Recombine” Approach to Large Data Sets. R Package Version 1.1. https://doi.org/10.32614/CRAN.package.drglm (2024).
Xi, R., Lin, N. & Chen, Y. Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans. Knowl. Data Eng. 21(4), 479–492 (2008).
Hafen, R. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (2016).
Guha, S. et al. Large complex data: Divide and recombine (D&R) with Rhipe. Statistics 1(1), 53–67 (2012).
Guha, S., Kidwell, P., Hafen, R.P. & Cleveland, W.S. Visualization databases for the analysis of large complex datasets. In Artificial Intelligence and Statistics. 193–200 (PMLR, 2009).
Rathi, R., Cook, D.J. & Holder, L.B. Serial partitioning approach to scaling graph-based knowledge discovery. PhD Thesis, University of Texas at Arlington (2004)
Centers for Disease Control and Prevention: Behavioral Risk Factor Surveillance System. https://www.cdc.gov/brfss/index.html. Accessed 22 Mar 2023 (2023)
Dinh, A., Miertschin, S., Young, A. & Mohanty, S. D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 19(1), 1–15 (2019).
Hill-Briggs, F. et al. Social determinants of health and diabetes: A scientific review. Diabetes Care 44(1), 258–279 (2021).
Shriraam, V., Mahadevan, S. & Arumugam, P. Prevalence and risk factors of diabetes, hypertension and other non-communicable diseases in a tribal population in South India. Indian J. Endocrinol. Metab. 25(4), 313 (2021).
Asiimwe, D., Mauti, G. O. & Kiconco, R. Prevalence and risk factors associated with type 2 diabetes in elderly patients aged 45–80 years at Kanungu district. J. Diabetes Res. 2020, 1–5 (2020).
Ullah, Z., Saleem, F., Jamjoom, M., Fakieh, B., Kateb, F., Ali, A.M. & Shah, B. Detecting high-risk factors and early diagnosis of diabetes using machine learning methods. Comput. Intell. Neurosci.2022 (2022)
Buchanan, T. A. & Xiang, A. H. Gestational diabetes mellitus. J. Clin. Invest. 115(3), 485–491 (2005).
Katsarou, A. et al. Type 1 diabetes mellitus. Nat. Rev. Dis. Prim. 3(1), 1–17 (2017).
Eisenbarth, G. S. Type I diabetes mellitus. N Engl. J. Med. 314(21), 1360–1368 (1986).
Astrup, A. & Finer, N. Redefining type 2 diabetes: ‘Diabesity’ or ‘obesity dependent diabetes mellitus’?. Obes. Rev. 1(2), 57–59 (2000).
Chatterjee, S., Khunti, K. & Davies, M. J. Type 2 diabetes. Lancet 389(10085), 2239–2251 (2017).
Centers for Disease Control and Prevention: Diabetes Basics. https://www.cdc.gov/diabetes/basics/index.html. Accessed 22 Mar 2023 (2023)
American Diabetes Association: Diagnosis and classification of diabetes mellitus. Diabetes Care33(Supplement_1), 62–69 (2010)
Buuren, S. & Groothuis-Oudshoorn, K. MICE: Multivariate Imputation by Chained Equations in R. https://www.jstatsoft.org/v45/i03/ (2011).
Caspersen, C. J., Thomas, G. D., Boseman, L. A., Beckles, G. L. & Albright, A. L. Aging, diabetes, and the public health system in the United States. Am. J. Public Health 102(8), 1482–1497 (2012).
Ahima, R. S. Connecting obesity, aging and diabetes. Nat. Med. 15(9), 996–997 (2009).
Morley, J. E. Diabetes and aging: Epidemiologic overview. Clin. Geriatr. Med. 24(3), 395–405 (2008).
Enea, M. Speedglm: Fitting Linear and Generalized Linear Models to Large Data Sets. R Package Version 0.3-5. https://CRAN.R-project.org/package=speedglm (2023).
Robin, X. et al. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12(1), 77. https://doi.org/10.1186/1471-2105-12-77 (2011).
American Diabetes Association Professional Practice Committee: Standards of care in diabetes—2023. Diabetes Care46(Supplement_1), 1–291. https://doi.org/10.2337/dc23-Sint (2023).
Idler, E. L. & Benyamini, Y. Self-rated health and mortality: A review of twenty-seven community studies. J. Health Soc. Behav. 38(1), 21–37 (1997).
Baliunas, D. O. et al. Alcohol as a risk factor for type 2 diabetes. Diabetes Care 32(11), 2123–2132 (2009).
Willi, C., Bodenmann, P., Ghali, W. A., Faris, P. D. & Cornuz, J. Active smoking and the risk of type 2 diabetes: A systematic review and meta-analysis. JAMA 298(22), 2654–2664 (2007).
Szklo, M. & Nieto, F. J. Epidemiology: Beyond the Basics 3rd edn. (Jones & Bartlett Learning, 2014).
Boyd, S., Parikh, N., Chu, E., Peleato, B.& Eckstein, J. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. https://doi.org/10.1561/2200000016 (Now Publishers Inc, 2011).
Acknowledgements
We thank the Centers for Disease Control and Prevention for maintaining and providing public access to the BRFSS data. We are grateful to the millions of BRFSS participants whose voluntary contributions enable public health research. We also thank anonymous reviewers for their constructive feedback that substantially improved this manuscript.
Author information
Authors and Affiliations
Contributions
Author Contributions: M.M.H. Nayem contributed to conceptualization, methodology, software, formal analysis, investigation, data curation, writing (original draft and review & editing), and visualization. S. Biswas contributed to conceptualization, methodology, writing (review & editing), supervision, and project administration. Both authors discussed the results, contributed to the final manuscript, and approved the final version for submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical compliance
This study utilized publicly available de-identified survey data from the Behavioral Risk Factor Surveillance System (BRFSS). All BRFSS procedures were approved by relevant institutional review boards at the Centers for Disease Control and Prevention and participating state health departments. All participants provided informed consent prior to participation. The current secondary data analysis was conducted in accordance with ethical standards for human subjects research and did not require additional IRB approval as it used publicly available, de-identified data.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nayem, M.M.H., Biswas, S.C. Divide and recombine approaches for fitting logistic regression to large-scale health surveillance data: application to diabetes risk prediction in BRFSS. Sci Rep (2026). https://doi.org/10.1038/s41598-026-46927-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-46927-7