Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Divide and recombine approaches for fitting logistic regression to large-scale health surveillance data: application to diabetes risk prediction in BRFSS
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 03 April 2026

Divide and recombine approaches for fitting logistic regression to large-scale health surveillance data: application to diabetes risk prediction in BRFSS

  • Md. Mahadi Hassan Nayem1 &
  • Soma Chowdhury Biswas1 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Computational biology and bioinformatics
  • Diseases
  • Health care
  • Mathematics and computing
  • Medical research
  • Risk factors

Abstract

The global rise in diabetes mellitus poses a major public health concern, highlighting the urgent need for effective risk prediction models to support early identification and prevention strategies. Yet, applying conventional statistical techniques to large-scale health surveillance data remains challenging due to memory limitations and high computational demands. One promising alternative, the divide and recombine (D&R) approach partitions the data into smaller subsets, fits logistic models independently within each, and combines the estimates to approximate full-sample maximum likelihood results. While the D&R methodology and its theoretical properties are well established in the statistical literature, such validation on national-scale health surveillance data remains limited. This study provides a comprehensive empirical validation of the D&R strategy for fitting logistic regression models efficiently on massive datasets, demonstrating its computational scalability, statistical robustness, and reproducibility in a real-world public health setting. We utilize the Behavioral Risk Factor Surveillance System (BRFSS) data from 2014 to 2024, encompassing over 2.4 million observations and 16 demographic, behavioral, clinical, and socioeconomic predictors. Monte Carlo simulations using 5 million synthetic observations and 1,000 replications confirm that the D&R method matches the statistical efficiency of centralized estimation (relative efficiency > 99.8%) while reducing computation time by more than 52% and memory usage by 77–89%. Application to BRFSS data further demonstrates the framework’s practical scalability, successfully recovering well-established diabetes risk factors including age, body mass index, general health status, alcohol use, cardiovascular disease history, and smoking, consistent with the epidemiological literature. These findings confirm that the D&R framework enables large-scale chronic disease modeling on standard computing infrastructure without reliance on high-performance systems, with direct implications for population health monitoring and preventive care resource allocation.

Data availability

The BRFSS data used in this study are publicly available from the Centers for Disease Control and Prevention at https://www.cdc.gov/brfss/. Analysis scripts specific to this study are publicly available via GitHub and Zenodo (see Code Availability statement below).

Code availability

The R code developed for this study, implementing the Divide and Recombine approaches for fitting logistic regression to large-scale health surveillance data, is openly available on GitHub at https://github.com/NayemMH/DR-Logistic-Regression-BRFSS and permanently archived on Zenodo (DOI: 10.5281/zenodo.19231359), with no restrictions on access or reuse. Readers are also encouraged to use the drglm R package28, developed and published by the first author, which provides a fully documented, general-purpose implementation of the D&R framework for generalized linear models, including logistic regression, with additional features for out-of-memory data handling and parallel computation. The drglm package is freely available on the Comprehensive R Archive Network (CRAN) (DOI:10.32614/CRAN.package.drglm) and can be installed directly in R.

References

  1. Van Seventer, J.M. & Hochberg, N.S. Principles of infectious diseases: Transmission, diagnosis, prevention, and control. Int. Encycl. Public Health22 (2017)

  2. Kenworthy, N., Thomann, M. & Parker, R. From a global crisis to the ‘end of aids’: New epidemics of signification. Glob. Public Health 13(8), 960–971 (2018).

    Google Scholar 

  3. Zumla, A., Alagaili, A. N., Cotten, M. & Azhar, E. I. Infectious diseases epidemic threats and mass gatherings: Refocusing global attention on the continuing spread of the middle east respiratory syndrome coronavirus (MERS-COV). BMC Med. 14(1), 1–4 (2016).

    Google Scholar 

  4. World Health Organization: Noncommunicable Diseases. https://www.who.int/news-room/fact-sheets/detail/noncommunicable-diseases. Accessed 22 Mar 2023 (2023).

  5. Bennett, J.E., Stevens, G.A., Mathers, C.D., Bonita, R., Rehm, J., Kruk, M.E., Riley, L.M., Dain, K., Kengne, A.P. & Chalkidou, K. NCD countdown 2030: worldwide trends in non-communicable disease mortality and progress towards sustainable development goal target 3.4. Lancet392(10152), 1072–1088 (2018).

  6. Centers for Disease Control and Prevention: What is Diabetes? https://www.cdc.gov/diabetes/basics/diabetes.html. Accessed 22 Mar 2023 (2023).

  7. Centers for Disease Control and Prevention: Diabetes Fast Facts. https://www.cdc.gov/diabetes/basics/quick-facts.html. Accessed 22 Mar 2023 (2023).

  8. Forbes, J. M. & Cooper, M. E. Mechanisms of diabetic complications. Physiol. Rev. 93(1), 137–188 (2013).

    Google Scholar 

  9. American Diabetes Association. Economic costs of diabetes in the US in 2017. Diabetes Care 41(5), 917–928 (2018).

    Google Scholar 

  10. American Diabetes Association: The Cost of Diabetes. https://diabetes.org/about-us/statistics/cost-diabetes. Accessed 22 June 2023 (2023)

  11. Center for Disease Control and Prevention: Diabetes and COVID-19. https://www.cdc.gov/diabetes/library/reports/reportcard/diabetes-and-covid19.html. Accessed 22 Mar 2023 (2023)

  12. Kastora, S., Patel, M., Carter, B., Delibegovic, M. & Myint, P. K. Impact of diabetes on covid-19 mortality and hospital outcomes from a global perspective: An umbrella systematic review and meta-analysis. Endocrinol. Diabetes Metabol. 5(3), 00338 (2022).

    Google Scholar 

  13. Saeedi, P. et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas. Diabetes Res. Clin. Pract. 157, 107843 (2019).

    Google Scholar 

  14. Sun, H. et al. IDF diabetes atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res. Clin. Pract. 183, 109119 (2022).

    Google Scholar 

  15. Islam, M. R. et al. Evaluation of the united states covid-19 vaccine allocation strategy. PLoS One 16(11), 0259700 (2021).

    Google Scholar 

  16. Kahn, R. et al. Age at initiation and frequency of screening to detect type 2 diabetes: A cost-effectiveness analysis. Lancet 375(9723), 1365–1374 (2010).

    Google Scholar 

  17. Herman, W. H. The cost-effectiveness of diabetes screening programs in adults: What do we know?. Diabetes Care 38(9), 1809–1816 (2015).

    Google Scholar 

  18. Hosmer, D. W. Jr., Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression 3rd edn. (Wiley, 2013).

    Google Scholar 

  19. Steyerberg, E. W. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating 2nd edn. (Springer, 2019).

    Google Scholar 

  20. Noble, D., Mathur, R., Dent, T., Meads, C. & Greenhalgh, T. Risk models and scores for type 2 diabetes: Systematic review. BMJ 343, 7163 (2011).

    Google Scholar 

  21. Hidalgo, J. I. G. et al. Application of machine learning models for the estimation of diabetes likelihood. Diagnostics 10(11), 959 (2020).

    Google Scholar 

  22. Dey, D. The proper application of logistic regression model in complex survey data. BMC Med. Res. Methodol.25, 1–12 (2025) (systematic review of methodological issues in logistic regression with complex survey data).

  23. Ma, Q. Recent applications and perspectives of logistic regression modelling in healthcare. Theor. Nat. Sci.36, 185–190 (2024) (review spanning 2018–2024 of logistic regression use in healthcare).

  24. Luque, A., Carrasco, A., Martín, A. & Las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 91, 216–231 (2019).

    Google Scholar 

  25. Cleveland, W.S. & Hafen, R. Divide and recombine (D&R): Data science for large complex data. Stat. Anal. Data Min.7(6) (2014)

  26. Chen, X. & Xie, M.-G. A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24(4), 1655–1684 (2014).

    Google Scholar 

  27. Lin, L. & Lumley, T. Aggregated estimating equation estimation. Stat. Interface 10(2), 263–277 (2017).

    Google Scholar 

  28. Nayem, M.M.H. Drglm: Fitting Linear and Generalized Linear Models in “Divide and Recombine” Approach to Large Data Sets. R Package Version 1.1. https://doi.org/10.32614/CRAN.package.drglm (2024).

  29. Xi, R., Lin, N. & Chen, Y. Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans. Knowl. Data Eng. 21(4), 479–492 (2008).

    Google Scholar 

  30. Hafen, R. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (2016).

  31. Guha, S. et al. Large complex data: Divide and recombine (D&R) with Rhipe. Statistics 1(1), 53–67 (2012).

    Google Scholar 

  32. Guha, S., Kidwell, P., Hafen, R.P. & Cleveland, W.S. Visualization databases for the analysis of large complex datasets. In Artificial Intelligence and Statistics. 193–200 (PMLR, 2009).

  33. Rathi, R., Cook, D.J. & Holder, L.B. Serial partitioning approach to scaling graph-based knowledge discovery. PhD Thesis, University of Texas at Arlington (2004)

  34. Centers for Disease Control and Prevention: Behavioral Risk Factor Surveillance System. https://www.cdc.gov/brfss/index.html. Accessed 22 Mar 2023 (2023)

  35. Dinh, A., Miertschin, S., Young, A. & Mohanty, S. D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med. Inform. Decis. Mak. 19(1), 1–15 (2019).

    Google Scholar 

  36. Hill-Briggs, F. et al. Social determinants of health and diabetes: A scientific review. Diabetes Care 44(1), 258–279 (2021).

    Google Scholar 

  37. Shriraam, V., Mahadevan, S. & Arumugam, P. Prevalence and risk factors of diabetes, hypertension and other non-communicable diseases in a tribal population in South India. Indian J. Endocrinol. Metab. 25(4), 313 (2021).

    Google Scholar 

  38. Asiimwe, D., Mauti, G. O. & Kiconco, R. Prevalence and risk factors associated with type 2 diabetes in elderly patients aged 45–80 years at Kanungu district. J. Diabetes Res. 2020, 1–5 (2020).

    Google Scholar 

  39. Ullah, Z., Saleem, F., Jamjoom, M., Fakieh, B., Kateb, F., Ali, A.M. & Shah, B. Detecting high-risk factors and early diagnosis of diabetes using machine learning methods. Comput. Intell. Neurosci.2022 (2022)

  40. Buchanan, T. A. & Xiang, A. H. Gestational diabetes mellitus. J. Clin. Invest. 115(3), 485–491 (2005).

    Google Scholar 

  41. Katsarou, A. et al. Type 1 diabetes mellitus. Nat. Rev. Dis. Prim. 3(1), 1–17 (2017).

    Google Scholar 

  42. Eisenbarth, G. S. Type I diabetes mellitus. N Engl. J. Med. 314(21), 1360–1368 (1986).

    Google Scholar 

  43. Astrup, A. & Finer, N. Redefining type 2 diabetes: ‘Diabesity’ or ‘obesity dependent diabetes mellitus’?. Obes. Rev. 1(2), 57–59 (2000).

    Google Scholar 

  44. Chatterjee, S., Khunti, K. & Davies, M. J. Type 2 diabetes. Lancet 389(10085), 2239–2251 (2017).

    Google Scholar 

  45. Centers for Disease Control and Prevention: Diabetes Basics. https://www.cdc.gov/diabetes/basics/index.html. Accessed 22 Mar 2023 (2023)

  46. American Diabetes Association: Diagnosis and classification of diabetes mellitus. Diabetes Care33(Supplement_1), 62–69 (2010)

  47. Buuren, S. & Groothuis-Oudshoorn, K. MICE: Multivariate Imputation by Chained Equations in R. https://www.jstatsoft.org/v45/i03/ (2011).

  48. Caspersen, C. J., Thomas, G. D., Boseman, L. A., Beckles, G. L. & Albright, A. L. Aging, diabetes, and the public health system in the United States. Am. J. Public Health 102(8), 1482–1497 (2012).

    Google Scholar 

  49. Ahima, R. S. Connecting obesity, aging and diabetes. Nat. Med. 15(9), 996–997 (2009).

    Google Scholar 

  50. Morley, J. E. Diabetes and aging: Epidemiologic overview. Clin. Geriatr. Med. 24(3), 395–405 (2008).

    Google Scholar 

  51. Enea, M. Speedglm: Fitting Linear and Generalized Linear Models to Large Data Sets. R Package Version 0.3-5. https://CRAN.R-project.org/package=speedglm (2023).

  52. Robin, X. et al. pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12(1), 77. https://doi.org/10.1186/1471-2105-12-77 (2011).

    Google Scholar 

  53. American Diabetes Association Professional Practice Committee: Standards of care in diabetes—2023. Diabetes Care46(Supplement_1), 1–291. https://doi.org/10.2337/dc23-Sint (2023).

  54. Idler, E. L. & Benyamini, Y. Self-rated health and mortality: A review of twenty-seven community studies. J. Health Soc. Behav. 38(1), 21–37 (1997).

    Google Scholar 

  55. Baliunas, D. O. et al. Alcohol as a risk factor for type 2 diabetes. Diabetes Care 32(11), 2123–2132 (2009).

    Google Scholar 

  56. Willi, C., Bodenmann, P., Ghali, W. A., Faris, P. D. & Cornuz, J. Active smoking and the risk of type 2 diabetes: A systematic review and meta-analysis. JAMA 298(22), 2654–2664 (2007).

    Google Scholar 

  57. Szklo, M. & Nieto, F. J. Epidemiology: Beyond the Basics 3rd edn. (Jones & Bartlett Learning, 2014).

    Google Scholar 

  58. Boyd, S., Parikh, N., Chu, E., Peleato, B.& Eckstein, J. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. https://doi.org/10.1561/2200000016 (Now Publishers Inc, 2011).

Download references

Acknowledgements

We thank the Centers for Disease Control and Prevention for maintaining and providing public access to the BRFSS data. We are grateful to the millions of BRFSS participants whose voluntary contributions enable public health research. We also thank anonymous reviewers for their constructive feedback that substantially improved this manuscript.

Author information

Authors and Affiliations

  1. Department of Statistics, University of Chittagong, Hathazari, Chattogram, 4331, Bangladesh

    Md. Mahadi Hassan Nayem & Soma Chowdhury Biswas

Authors
  1. Md. Mahadi Hassan Nayem
    View author publications

    Search author on:PubMed Google Scholar

  2. Soma Chowdhury Biswas
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Author Contributions: M.M.H. Nayem contributed to conceptualization, methodology, software, formal analysis, investigation, data curation, writing (original draft and review & editing), and visualization. S. Biswas contributed to conceptualization, methodology, writing (review & editing), supervision, and project administration. Both authors discussed the results, contributed to the final manuscript, and approved the final version for submission.

Corresponding author

Correspondence to Md. Mahadi Hassan Nayem.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical compliance

This study utilized publicly available de-identified survey data from the Behavioral Risk Factor Surveillance System (BRFSS). All BRFSS procedures were approved by relevant institutional review boards at the Centers for Disease Control and Prevention and participating state health departments. All participants provided informed consent prior to participation. The current secondary data analysis was conducted in accordance with ethical standards for human subjects research and did not require additional IRB approval as it used publicly available, de-identified data.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nayem, M.M.H., Biswas, S.C. Divide and recombine approaches for fitting logistic regression to large-scale health surveillance data: application to diabetes risk prediction in BRFSS. Sci Rep (2026). https://doi.org/10.1038/s41598-026-46927-7

Download citation

  • Received: 10 December 2025

  • Accepted: 28 March 2026

  • Published: 03 April 2026

  • DOI: https://doi.org/10.1038/s41598-026-46927-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Big data
  • Divide and recombine
  • Generalized linear models
  • Diabetes prediction
  • BRFSS
  • Sequential partitioning
  • Computational efficiency
  • Large-scale validation
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics