Introduction

The estimated global incidence of out-of-hospital cardiac arrest (OHCA) treated by emergency medical service (EMS) is 62.0 per 100,000 person-years, with a specific incidence of 53.1 per 100,000 person-years in North America1. Despite advances in pre- and post-resuscitation care, a recent systematic review showed that survival to hospital discharge after OHCA was extremely low, at 8.8%2. Development of an accurate system to predict the daily incidence of OHCA might provide a more opportunity to prevent and react in the prehospital setting.

Several studies have shown associations between ambient temperature and cardiovascular events3,4,5,6 and between day of week or season and cardiovascular events7,8,9,10,11,12. However, many of those studies used conventional linear regression, which may not be suitable for handling large amounts of high-resolution meteorological data and not be applied to real-world practice. Machine learning (ML) can use advanced analytics to integrate multiple quantitative variables and identify associations not identified with conventional one-dimensional statistical approaches. Recently, our team developed a ML prediction model for daily OHCA incidence based on combined meteorological and chronological data with high accuracy for the Japanese population13. Moreover, given that several studies have demonstrated the association between social determinants and cardiovascular risk, we considered it valuable to incorporate sociodemographic variables into the prediction model14,15.

In this study, we developed and evaluated the ML prediction model for robust estimation of daily OHCA incidence of cardiac origin for the U.S. population based on comprehensive meteorological data, chronological data, and sociodemographic data. In addition, applying the invariant causal prediction (ICP), we identified variables that consistently contributed to the predicted OHCA incidence across all sub-regions stratified by the sociodemographic status of the population size, the proportion of people living in poverty, and the proportion of people who have achieved an education, and reconstructed the prediction model for greater generalizability.

Results

Characteristics of the training and testing datasets

From the CARES registry, 421,531 EMS-treated OHCAs of non-traumatic cause between 2013 and 2019 were matched with meteorological data. Of these, 196,735 cases from 2013 to 2017 were included in the training dataset, and 156,615 cases from 2018 to 2019 were assigned to the testing dataset. Within the testing dataset, 119,455 cases were from internal areas (i.e., these areas were the same areas as the training data), and 37,160 cases were from external areas (i.e., these areas were not included in the training dataset). The characteristics of the datasets are summarized in Table 1. The median age of OHCA onset increased slightly from 64 (IQR, 52–76) to 65 years (52–76), and the proportion of males increased from 61 to 62%; however, these differences were modest. The proportion of individuals living below the poverty level was 11% of the population throughout the study period.

Table 1 Characteristics of daily data in the training dataset (2013–2017) and testing dataset (2018–2019)

The median of the mean ambient temperature within a day decreased in the low, intermediate, or high-temperature regions. In the low-temperature region, differences between maximum and minimum ambient temperatures within a day (diurnal temperature range) increased from 9.4 °C (6.8–12.1) to 9.6 °C (7.0–12.3) in the internal area and 8.9 °C (6.2–11.6) to 9.6 °C (7.0–12.3) in the external area between 2013 and 2019. The diurnal temperature range was greater in higher temperature regions. Relative humidity also increased throughout the study period in all regions.

Between 2013–2017 and 2018–2019, the incidence of OHCA per 100,000 person-years increased from 63.7 to 76.7 in internal areas and from 65.3 to 73.7 in external areas. The incidence of OHCA by meteorological condition is shown in Supplementary Fig. 1.

Model diversity

To select the optimal analytical algorithms for model development, we developed ML prediction models including all 34 variables, using the generalized additive model (GAM) modeling time series, eXtreme Gradient Boosting (XGBoost), CatBoost, and random forest. Predicted and observed incidence of OHCA with a cardiac origin for each model are shown in Fig. 1. All ML prediction models effectively identified days with significant increases in OHCA incidence at the nationwide level, demonstrating strong concordance between predicted and observed values. In both the training dataset and the testing dataset from internal areas, predictive performance of all models was similar. In contrast, in the testing dataset from external areas, the XGBoost gradient boosting algorithm had the highest predictive performance at the nationwide level (root mean squared error (RMSE), 0.032, [95% confidence interval {CI}: 0.030–0.033]; mean absolute error (MAE), 0.025 [0.024–0.026]; mean absolute percentage error (MAPE), 13.17% [12.13–14.22]), the state level (RMSE, 0.212 [0.125–0.272]; MAE, 0.144 [0.101–0.187]), and the agency level (RMSE, 1.324 [0.967–1.603]; MAE, 0.345 [0.308–0.383]) among all models (Table 2).

Fig. 1: Observed versus predicted incidence of out-of-hospital cardiac arrest by various analysis methods.
Fig. 1: Observed versus predicted incidence of out-of-hospital cardiac arrest by various analysis methods.
Full size image

The results obtained using each method, including GAM (A), XGBoost (B), CatBoost (C), and Random Forest (D), are presented. The light blue lines indicate the observed daily incidence per 100,000 of out-of-hospital cardiac arrests in the registry participating areas. The yellow lines indicate the predicted daily incidence per 100,000 based on combined meteorological, chronological, and sociodemographic variables. GAM generalized additive model, Jan means January, XGBoost eXtreme gradient boosting.

Table 2 Performance of the prediction model by various analysis methods

Predictive performance of the model after ICP

Through the application of ICP to the XGBoost model, we identified 17 variables (ICP model) from the initial 34 (non-ICP model) that consistently contributed to the prediction for OHCA incidence across all sub-areas, as determined by deciles of agency population sizes, tertiles of proportion living below the poverty level, and tertiles of proportion with a high school diploma or higher (Supplementary Table 1). These 17 variables included mean ambient temperature, diurnal ambient temperature, mean wind speed, difference in wind speed, mean relative humidity, difference in relative humidity, mean precipitation, difference in precipitation, year, January, February, median age, proportion of men, proportion of Blacks, proportion of Asians, proportion with a high school diploma or higher, and proportion living below the poverty level.

Figure 2 and Table 3 shows that the ICP model maintained high predictive accuracy at the nationwide level in the training dataset (RMSE, 0.022 [0.021–0.023]; MAE, 0.018 [0.017–0.019]; and MAPE, 11.42% [10.76–12.09]), testing dataset with the internal area (RMSE, 0.021 [0.019–0.024]; MAE, 0.017 [0.015–0.018]; and MAPE, 7.80% [7.22–8.38]), and testing dataset with the external area (RMSE, 0.033 [0.030–0.035]; MAE, 0.026 [0.024–0.028]; and MAPE, 13.92% [12.70–15.14]), as did the non-ICP model with the initial 34 variables. At the state and agency level, both non-ICP and ICP models yielded similar predictive accuracy. To further clarify the predictive performance of this model at varying time intervals, we also evaluated 3-day-ahead and 7-day-ahead predictive performance. The model retained a satisfactory level of performance up to seven days in advance.

Fig. 2: Observed versus predicted incidence of out-of-hospital cardiac arrest in the invariant causal prediction model at varying time intervals.
Fig. 2: Observed versus predicted incidence of out-of-hospital cardiac arrest in the invariant causal prediction model at varying time intervals.
Full size image

A shows the results of the training dataset obtained using the ICP model. B and C present the testing results for same-day, 3-day-ahead, and 7-day-ahead predictions in internal and external settings, respectively. The light blue lines indicate the observed daily incidence per 100,000 of out-of-hospital cardiac arrests in the registry-participating areas. The yellow lines indicate the predicted daily incidence per 100,000 by the XGBoost gradient boosting model using predictors selected by ICP. ICP denotes invariant causal prediction; Jan, January.

Table 3 Performance of the invariant causal prediction model for out-of-hospital cardiac arrest at varying time intervals

Contribution of each predictor to the predicted value of OHCA incidence

The predictive importance of variables in the ICP model is shown in Fig. 3. With regards to meteorological variables, mean ambient temperature within a day was the variable most strongly contributing to the predicted OHCA incidence, followed by mean relative humidity, diurnal temperature range, difference and mean in wind speed, and difference and mean precipitation. Among sociodemographic variables, proportion living below the poverty level was the variable most strongly contributing to the predicted OHCA incidence, followed by proportion with a high school diploma or higher, percentage of Black persons and Asian, and median age contributed to the predicted OHCA incidence.

Fig. 3: Importance of variables in a machine learning prediction model.
Fig. 3: Importance of variables in a machine learning prediction model.
Full size image

This figure shows a variable importance plot for meteorological variables (red), chronological variables (blue), and sociodemographic variables (black) in a machine learning prediction model using XGBoost. The yellow to purple dots in each row represent low to high values for each predictor normally scaled. The x-axis shows the Shapley value, indicating the variable’s impact on the model. Positive SHAP values tend to drive predictions toward more cases of OHCA and negative SHAP values tend to drive the prediction toward fewer cases of OHCA. * In the model, 2013 was considered year 0. OHCA denotes out-of-hospital cardiac arrest; SHAP, Shapley Additive Explanations; XGBoost, eXtreme Gradient Boosting.

Predictive performance based on annual average of daily mean ambient temperature

We compared the predictive performance of the XGBoost models stratified by annual average of ambient temperature (Supplementary Table 2) at aggregate level for all regions falling within that temperature category. Population was lower in the low-temperature region, followed by the intermediate-temperature region and the high-temperature region. The predictive accuracy of both non-ICP and ICP models was higher in the intermediate- and high-temperature regions than in the low-temperature region across both the training and testing datasets, consistent with the population. However, there was not much difference in the predictive accuracy between summer and winter in the regions stratified by average ambient temperature (Supplementary Table 3).

Discussion

In this study, using an ML prediction model developed with the combination of meteorological, chronological, and sociodemographic variables, we successfully predicted the daily incidence of non-traumatic OHCAs in the United States with high precision at the nationwide, state, and agency level, respectively. In addition, ICP identified meteorological and sociodemographic variables as consistently important predictors of daily OHCA incidence, ensuring nationwide generalizability, regardless of population sizes of the EMS agencies.

An association between ambient temperature and incidence of cardiovascular events has been previously reported3,4,5,6,7,8,9,10. However, since these studies focused on ambient temperature or season alone, diversity in comprehensive meteorological, chronological, and sociodemographic variables were not considered. Recently, our team reported that an ML prediction model for OHCA incidence in Japan based on a comprehensive meteorological dataset and chronological variables had high predictive accuracy13. In the present study, when sociodemographic data were added to the prediction model, we achieved a high predictive accuracy in the U.S. population. Model evaluation demonstrated a decline in predictive performance when applied across both external periods and areas, suggesting areas participating in the OHCA registry could receive higher merit for predictive accuracy when our ML models are implemented compared to areas not participating in the registry. Nevertheless, the model retained a satisfactory performance under these external validation conditions.

Conducting ICP, we identified 17 out of 34 variables that contributed to the predicted non-traumatic OHCA incidence with preserving a robust predictive accuracy regardless of population sizes of the EMS agencies, thereby enhancing the generalizability of the prediction model. As a result, all chronological variables, except for the year, January, and February, were excluded from the prediction model by ICP. SHAP analysis revealed that sociodemographic variables contributed more strongly to prediction for OHCA incidence than most meteorological factors. While the differences in sociodemographic characteristics between counties might shape vulnerability to weather conditions, they are not directly relevant to daily fluctuations in OHCA incidence because those variables do not change throughout the year. This finding is consistent with epidemiological evidence that socioeconomic status and racial disparities can be related to cardiovascular risks and might also amplify the impact of environmental exposures14,16,17,18. In addition, meteorological predictors, except for mean ambient temperature and relative humidity, showed comparable contributions, suggesting that OHCA incidence is influenced by a complex interplay of weather factors, such as Heat Index and Wind Chill, rather than a single dominant variable. These results may emphasize the need for comprehensive inclusion of meteorological features in predictive models and also raise the hypothesis that social determinants of health might modulate the effects of climate on OHCA risk.

The predictive accuracy of the model was acceptable at the state level, albeit it exhibited diminished performance compared to the nationwide level. Our ML prediction model had variations in predictive accuracy across states. Analyses stratified by average ambient temperature in the training and testing datasets showed that predictive performance was the lowest in the low-temperature region, while the season variable (summer or winter) did not substantially change predictive accuracy. These results were partially explained by the population size in the participating area. Collecting more samples may improve our ML model. In addition, populations residing in the low-temperature area might be more habituated and better able to cope with climate change, such as through building insulation and lifestyle habits. Curriero et al. reported a latitude dependence of the temperature–mortality relationship in their analysis based on 11 eastern cities in the United States19. More effective adaption to colder temperature was observed in cities that are further north. In order to be more practical, it needs to be further improved to predict OHCA incidence within a medical catchment area.

Our prediction model retained a satisfactory level of performance up to seven days in advance, which holds potential value for proactive operational planning. Furthermore, the model’s predictive accuracy is inherently influenced by the quality of weather forecasts. Given that meteorological conditions can typically be predicted with reasonable accuracy up to seven days ahead, it is plausible that the model may sustain higher predictive performance over a similar temporal range. Beyond its predictive accuracy, our model has substantial potential for real-world application in EMS and hospital operations. In EMS, dynamic ambulance deployment informed by predictive analytics has been shown to reduce response times and improve patient outcomes20,21. Integrating our weather-sensitive predictions into such EMS operational workflows could enable proactive ambulance positioning before anticipated high-risk periods, thereby facilitating more rapid transport and enabling advanced post-arrest care. In the hospital setting, a systematic review highlighted that various preparedness activities, such as resource reallocation, can improve hospitals’ surge capacity during demand spikes22. Furthermore, a UK teaching hospital reported that its ML-based prediction pipeline for patient admissions substantially outperformed the conventional six-week rolling average benchmark in predictive accuracy, even after the COVID-19 outbreak23. Moreover, public health agencies could also leverage forecast outputs for targeted messaging campaigns, such as issuing warnings on predicting high-risk days, directed toward vulnerable populations. Therefore, these applications indicate that our model could be feasibly integrated into operational decision-making pipelines, strengthening both public health strategies and clinical preparedness in responses to weather-related health risks. Such system-level improvements have the potential not only to prevent cardiac arrest events but also to increase rates of prehospital return of spontaneous circulation and overall survival. A future prospective study to evaluate the effectiveness of this approach is needed.

This study has several inherent limitations. First, although the CARES registry is the largest database of OHCA in the United States, it covered approximately 53% of catchment areas as of 2022. The CARES registry includes only EMS-treated OHCA cases, which may introduce selection bias by excluding untreated cases. Thus, sociodemographic variables such as race and poverty rate may reflect biases in EMS activation or record keeping driven by socioeconomic factors; however, it represents the population most relevant for targeting public health interventions. Second, our data did not address the potential variability in patients’ preexisting medical conditions. Third, the 12 km resolution has limitations, particularly in areas with complex geography. In mountainous regions, for example, it tends to smooth elevation differences, potentially misrepresenting temperature extremes in valleys or at high altitudes24,25,26,27. Likewise, weather variability near coastlines or large water bodies, where conditions can change rapidly, is often averaged out, masking important local phenomena such as lake-effect snow or coastal heat stress. Nevertheless, a key advantage of 12 km data is its broad availability and consistency for public health applications. Fourth, the predictability of future OHCA events will depend on the accuracy of meteorological data. Finally, external testing in other developed countries was not performed.

In conclusion, an ML prediction model incorporating multiple meteorological, chronological, and sociodemographic variables could predict the incidence of non-traumatic OHCAs with high precision in the U.S. population. Through ICP, the variables were refined to focus on meteorological and sociodemographic factors, rather than chronological ones, while maintaining high predictive accuracy. This prediction model might be useful for public health prevention strategies in temperate regions.

Methods

The study was approved by the University of Michigan Hospital’s institutional review board (HUM00189913). The requirement for written informed consent was waived because the researchers only analyzed deidentified (anonymized) data.

Data source for OHCA data

OHCA data was provided by the Cardiac Arrest Registry to Enhance Survival (CARES)28, which is a prospective multicenter registry of patients with OHCA from 30 state-based registries, the District of Columbia, and more than 45 community sites in 16 additional U.S. states. It has a catchment area of approximately 175 million residents in 2021 (Fig. 4A). The design of the registry, which was established by the U.S. Centers for Disease Control and Prevention and Emory University, has been previously described28. Patient-level data were collected by EMS agencies using standardized international Utstein definitions for clinical variables and outcomes to ensure uniformity.

Fig. 4: Data sources and overview of the machine learning model for predicting out-of-hospital cardiac arrest incidence.
Fig. 4: Data sources and overview of the machine learning model for predicting out-of-hospital cardiac arrest incidence.
Full size image

A illustrates the states and communities participating in the Cardiac Arrest Registry to Enhance Survival (CARES). B displays an example of daily maximum ambient temperature data on July 15, 2022, obtained from the North American Land Data Assimilation System (NLDAS). C summarizes the development process of the machine learning model used to predict out-of-hospital cardiac arrest incidence.

CARES includes non-traumatic OHCAs where resuscitative efforts were initiated by a 911 responder. The following patient information was collected and analyzed in this study: age, sex, etiology of arrest (i.e., presumed cardiac etiology, respiratory/asphyxia, drowning/submersion, electrocution, exsanguination/hemorrhage, drug overdose, or others), and location of cardiac arrest. Each patient in the CARES registry was geocoded to a U.S. county based on the ZIP code for the location of the OHCA through a crosswalk file from the U.S. Department of Housing and Urban Development. Data were submitted in two ways: via a data entry form on the CARES website (https://mycares.net/) or daily uploads from an EMS agency’s electronic patient care record system. The CARES analyst (R.A.-A.) reviewed records for completeness and accuracy. Due to the nature of the data resources, there were no missing values for developing and evaluating our prediction models.

Meteorological data

We analyzed meteorological data from National Aeronautics and Space Administration (NASA)’s North American Land Data Assimilation System (NLDAS), which provides hourly gridded data with 12-km spatial resolution (Fig. 4B)29,30. Agencies such as NASA and NOAA maintain long-term archives at this resolution, allowing researchers to examine climate–health trends across time and large geographic areas24,25,26,27. For public health applications, the 12 km grid aligns reasonably well with many health datasets, which are typically reported at the county, ZIP code, or hospital level. This compatibility supports regional exposure assessment and facilitates studies linking environmental exposures, such as ambient temperature or humidity, to health outcomes including cardiovascular and respiratory conditions.

Meteorological variables included eight factors: mean daily values and differences between maximum and minimum values within a day of ambient temperature (°C), precipitation (mm), relative humidity (%), and wind speed (m/s) during the study period. Those values were averaged by EMS agency areas. Due to the nature of the data resources, there were no missing values of those meteorological predictors.

Chronological and sociodemographic data

Chronological variables included 20 factors: year (2013 was considered year 0), month (January to December as categorical variables), and day of the week. Sociodemographic data were collected for six factors at the census tract level, including median age (three categories: < 38.8, 38.8–42.9, and ≥42.9 year, near its tertiles), proportion of men (three categories: <48.8%, 48.8–50.1, and ≥50.1%, near its tertiles), race (proportion of Blacks [three categories: <1%, 1–5%, and ≥5%, near its tertiles], proportion of Asians [three categories: <0.9%, 0.9–3%, and ≥3%, near its tertiles]), proportion of individuals with a high school diploma or higher (three categories:<60.5%, 60.5–65.3%, and ≥65.3%, near its tertiles), and proportion of individuals living below the poverty level (three categories: <7.9%, 7.9–13.1%, and ≥13.1%, near its tertiles). The sociodemographic data at the census tract level were merged into EMS agency areas. In cases where an EMS agency area was covered by multiple census tract areas, we used the sociodemographic from the tract with the highest number of OHCA cases among the several tract areas. Due to the nature of the data resources, there were no missing values of those sociodemographic predictors.

Data management and development of prediction models

In this study, we focused on EMS agencies that had data available in both periods:(A) at least part of January 1, 2013 to December 31, 2017, and (B) at least part of January 1, 2018 to December 31, 2019. By restricting the analysis to EMS agencies with data spanning both periods, we ensured comparability over time and reduced potential biases from agencies that appeared in only one timeframe, thereby strengthening the robustness of the evaluation. We then matched the CARES data, meteorological, chronological, and sociodemographic data between January 1, 2013, and December 31, 2019. From this merged dataset, 80% of the data from January 1, 2013 to December 31, 2017 were used to construct the training dataset for developing the prediction model, while data from January 1, 2018 to December 31, 2019 were reserved as the testing dataset to assess the temporal generalizability of the model (Fig. 4C). In addition, the testing dataset was stratified into internal (i.e., these areas were the same areas as the training data) and external areas (i.e., these areas were not included in the training dataset).

To develop prediction model for the daily incidence of OHCA, we used GAM modeling time-series effect, eXtreme Gradient Boosting (XGBoost) gradient boosting algorithm, CatBoost, and random forest31,32. Note that, as a reference model, we employed the GAM modeling time-series effect with a negative binomial distribution to analyze the temporal dynamics of event counts. The reference model incorporated long-term temporal trends, seasonal variation with a yearly cycle, and day-of-week effects, with population size included as an offset to model incidence rates. Those algorithms can model non-linear associations between the predictors and the outcome variable. GAM can model non-linear associations of the predictors with the outcome by using spline functions. Tree-based models, XGBoost, CatBoost, and random forest, can also model non-linear associations between predictors and the outcome, and are inherently more robust to multicollinearity than GAM. We selected each model’s hyperparameters with minimum RMSE while developing models in the training dataset between 2013 and 2017 with the 5-fold cross validation of EMS agencies. Using the hyperparameters with minimum RMSE at the EMS agency level, we developed prediction models for the OHCA incidence per day per EMS agency. The candidate hyperparameters as well as the best-performing hyperparameters are provided in Supplementary Table 4.

All of the aforementioned meteorological, chronological, and sociodemographic variables were used as initial predictors. The predictor variables were standardized based on the mean and standard deviation values of each predictor in the training dataset. Log-transformed population size for each agency area was included as predictors for CatBoost and random forest. This was included as an offset term for the GAM modeling time-series effect and the XGBoost algorithm. We assessed predictive performance of the model using the testing dataset (i.e., data between 2018 and 2019 from all participating agency areas) stratified into internal and external areas. We used mgcv package for R, version 1.8-42 (https://cran.r-project.org/web/packages/mgcv/index.html) for the GAM modeling time-series effect, ranger package for R, version 0.15.1 (https://cran.r-project.org/web/packages/ranger/index.html) for the random forest model, bonsai package for R, version 0.4.0 (https://cran.r-project.org/web/packages/bonsai/index.html) of the CatBoost model, and xgboost package for R, version 1.7.7.1 (https://cran.r-project.org/web/packages/xgboost/index.html) for the XGBoost models.

The present study used the ICP to identify key predictors of the number of cardiac arrest cases under varying environmental conditions as a combination of deciles of agency population sizes, tertiles of proportion living below the poverty level, and tertiles of proportion with a high school diploma or higher. The ICP identifies variables that consistently predict outcomes across different conditions by assuming that the underlying causal structure remains constant, even as environmental factors may vary. As the environmental factors, we used a combination of deciles of population sizes, tertiles of proportion living below the poverty level, and tertiles of proportion with a high school diploma or higher at EMS agency area level based on the previous evidence33,34. Among socioeconomic determinants associated with cardiovascular events, the evidence regarding these two factors has been inconsistent14. Including too many variables as environmental factors would create an excessive number of strata, resulting in small sample sizes within each stratum and potentially overlooking useful predictors. Therefore, we limited the environmental factors to key sociodemographic variables only. The predictor selection by the ICP analysis is achieved by statistical tests to evaluate candidate models. We applied the ICP approach three times: 1) to weather variables, 2) to sociodemographic variables, and 3) to calendar variables, under varying the combination categories of the environmental factors, to identify important predictors for each type of variable. We used the InvariantCausalPrediction package (version: 0.8) for R (https://CRAN.R-project.org/package=InvariantCausalPrediction). Then, to clarify the robustness of this prediction model at varying time intervals, we also tested 3-day-ahead and 7-day-ahead predictive performance.

Primary and secondary outcomes

The primary outcome was predictive accuracy of OHCA incidence per 100,000 per day at nationwide level of the prediction model based on RMSE, MAE, and MAPE, which are generally used as measures of predictive accuracy for a forecasting method. The secondary outcome was predictive accuracy of OHCA incidence per 100,000 per day at the state level and EMS agency level, which was limited to 24 state-based registries, and at the agency level.

The prediction model developed in this study is based on meteorological factors, which influence entire regions, and is ultimately intended for implementation at a medical catchment area. Therefore, we evaluated model performance using continuous-outcome metrics that are appropriate for aggregated data, along with correlation-based statistics to assess concordance between observed and predicted regional incidence.

Sample size calculation

We evaluated the sample size required for the testing dataset based on the precision of the RMSE, targeting a 95% CI relative half-width of ±10%. Because the data have a region × day structure, we accounted for within-region and within-day correlations using a design effect (DEFF) approximation35,36. This two-way DEFF approach aligns with the multi-way clustering framework but omits cross-cluster terms, providing a conservative estimate37. Using the intraclass correlation coefficient (ICC) and DEFF estimated from the training data, we determined that the required sample size for the unobserved testing data is approximately 91,120 observations. We estimated ICCs from the training dataset (842,534 observations, 728 regions, 1826 days), yielding a substantial intra-region ICC (0.34) and negligible intra-day ICC (0.0002). The resulting DEFF was approximately 399. To achieve the target precision (±10%), about 196 effective degrees of freedom are required; adding 33 model parameters yields an effective sample size of ~229. Accounting for the DEFF, this corresponds to approximately 91,120 observations needed for the testing dataset. Our testing datasets had 511,000 observations for the internal areas and 129,940 observations for the external areas. Thus, we confirmed that our testing datasets had enough sample size to evaluate the developed prediction models. The details are shown in Supplementary Methods.

Sample Size Calculation

The prediction evaluation metric was the root mean squared error (RMSE). We defined the sample size required for the testing dataset as the number of observations necessary for the 95% confidence interval (CI) of RMSE to have a relative half-width (half-width / true value) within ±10%. Since the RMSE estimator is based on the error variance, its distribution follows a chi-squared law.

Because our dataset had a day × region panel structure, correlations are expected within regions and within days. To avoid overestimating precision, we applied a design effect (DEFF) correction35,36. With two clustering dimensions, DEFF can be approximated as:

$$\mathrm{DEFF}\approx 1+(\mathrm{mr}-1){\rm{\rho }}{\rm{r}}+(\mathrm{md}-1){\rm{\rho }}{\rm{d}},$$
(1)

where mr is the average number of days per region, md the average number of regions per day, and ρr, ρd the intraclass correlation coefficients (ICCs). ICCs were estimated by fitting a linear mixed-effects model with region and day as random effects. The DEFF was then computed, and the effective sample size was defined as:

$${\rm{neff}}={\rm{N}}/{\rm{DEFF}}$$
(2)

with effective degress of freedom νeff=neff−p (p: number of fixed-effect parameters).

In this study, to account for clustering in two dimensions (region × day), we adopted a sum-approximation DEFF35,36. This approach is consistent with the framework of multi-way clustering inference proposed by Cameron et al., but it ignores cross-cluster interaction terms and thus represents a conservative approximation37.

First, we evaluated the correlation structure of the training data (842,534 observations, 728 regions, 1826 days). On average, each region contained 1157 days, and each day included 461 regions. Variance component estimates from a linear mixed model indicated that the intra-region ICC was relatively high (0.34), whereas the intra-day ICC was negligible (0.0002). Based on these results, the design effect (DEFF) was calculated to be approximately 399. To ensure that the relative half-width of the 95% confidence interval (CI) for the RMSE remains within ±10%, a theoretical effective degrees of freedom of νtarget ≈ 196 is required. Given the number of estimated model parameters (p = 33), the required effective sample size was νtarget + p ≈ 229. Taking the DEFF into account, the corresponding number of observations is

$$\mathrm{Nrequired}\approx ({\rm{\nu }}\mathrm{target}+{\rm{p}})\times \mathrm{DEFF}\approx 229\times 398.6\approx \mathrm{91,120}.$$
(3)

Therefore, by using the ICC and DEFF estimated from the training data, we determined that the required sample size for the unobserved testing data is approximately 91,120 observations. Our testing datasets had 511,000 observations for the internal areas and 129,940 observations for the external areas. Thus, we confirmed that our testing datasets had enough sample size to evaluate the developed prediction models.

Statistical analysis

The characteristics of the present dataset were summarized with medians and interquartile ranges (IQRs) for continuous variables, and numbers and percentages for categorical variables by area and day in the training and testing datasets.

We examined the concordance between the predicted incidence of OHCA based on the ML model and the observed incidence of OHCA in the training and testing dataset. The predictive accuracy of the prediction models was evaluated based on RMSE, MAE, and MAPE between predicted values calculated with the prediction models and observed daily OHCA incidence at levels of the EMS agencies, the states, and the nation. RMSE and MAE reflect the average magnitude of differences between predicted values and observed values. RMSE and MAE can range from zero to infinity. Lower RMSE and MAE values indicate higher predictive performance. MAPE is an average of the absolute values of errors divided by observed values. MAPE ranges from zero to infinity. Lower MAPE values indicate higher model predictive performance. In general, MAPE less than 10% is considered highly accurate predicting38. Formulas are as follows;

$${\rm{RMSE}}=\sqrt{{\sum }_{t=1}^{n}{({observed\; value}-{predicted\; value})}^{2}/N}$$
(4)
$${\rm{M}}\mathrm{AE}=\frac{1}{n}\mathop{\sum }\limits_{t=1}^{n}|{observed\; value}-{predicted\; value}|$$
(5)
$${\rm{M}}\mathrm{APE}=\frac{1}{n}\mathop{\sum }\limits_{t=1}^{n}\frac{|{observed\; value}-{predicted\; value}|}{{observed\; value}}* 100$$
(6)

Additionally, we estimated point estimates and 95% CIs for prediction accuracy metrics (RMSE, MAE, MAPE). For each metric, we constructed a constant-only regression model using the corresponding error series as the dependent variable and obtained robust standard errors with two-way clustering (by region and time). Estimation was conducted in R using the fixest package, version 0.11.1 (https://cran.r-project.org/web/packages/fixest/index.html). This approach accounts for spatiotemporal dependence in the data when constructing CIs.

To show important predictors of the OHCA incidence in the developed model with ICP, we used the Shapley Additive Explanations (SHAP) values summarizing contribution of each predictor to the predicted value of an instance39,40. For a given set of feature values, a SHAP value reflects how much a single variable, in the context of its interaction with other variables, contributes to the difference between the actual prediction and the mean prediction. As an additional study, we assessed the predictive accuracy of the ML model stratified into low-, intermediate-, and high-temperature areas, further divided into summer (June–August) and winter (December–February). Low-, intermediate-, and high-temperature regions were defined as regions with mean ambient temperature in the 25th percentile or lower, in the 25–75th percentiles, and 75th percentile or higher, respectively.

All statistical analyses were performed with R statistical software, version 4.2.3 (https://www.R-project.org/). Missing values for continuous and categorical variables were, respectively, imputed by a median value for each continuous variable and treated as a missing category. These missing procedures work well when using XGBoost due to the nature of decision tree algorithms.