Abstract
Estimation of exposure–response association is central to epidemiologic research. Although the advantages of machine learning (ML) techniques for modeling complex relationships are well-recognized, their use in epidemiologic studies are limited mainly because they do not provide direct estimates of associations, such as odds ratios (ORs). We suggest eight hybrid estimators of the OR that are functions of the output from a classifier, or their probability-calibrated form, multiplied by an adjustment factor that is ‘borrowed’ from logistic regression (LR). We also suggest two estimators based on partial dependence functions. We applied these estimators to output from LR, random forest (RF) and gradient boosting (GB) models for investigating associations between (1) temperature and respiratory or cardiovascular admissions and (2) prenatal exposure to temperature and overweight among infants. Most (87%) of the estimates produced by GB were within the LR 95% CI, but for RF the results were mixed: 0%, 60% and 13% of the estimates were within this CI for the Respiratory, Cardiovascular and Infants data, respectively. Additionally, GB-based CIs for the uncalibrated estimates were narrower by 13–59% compared to the LR CIs. These findings may enhance the integration between ML and epidemiologic research by providing interpretable results.
Similar content being viewed by others
Introduction
Machine learning (ML) methods can handle complex non-linear input–output relations and potential interactions among the covariates. They often have high predictive performance and are robust to multicollinearity and outliers1,2. In recent years, ML models have been increasingly used in environmental epidemiology3,4, for prediction of exposures5,6,7, to predict health outcomes8,9,10, for variable selection11, in “auxiliary” analyses, e.g., to obtain propensity12 or survival13 scores and combinations of the above14.
Estimation of exposure–response association is central to epidemiologic research. In particular, traditional approaches such as logistic regression (LR) provide measures of association like the odds ratio (OR). However, ML techniques do not offer such direct association measures and only a few studies have dealt with this issue. In an early work, analytic OR estimates were provided for a two-layer neural network15. Since an analytic approach is not applicable to many of the ML models, an alternative approach is to estimate ORs using the output scores of such models. However, these scores are often not directly interpretable as class probability estimates16 and a common way to address this is probability-calibration of the scores. The present paper focuses on three representative calibration methods: Platt scaling17 that essentially transforms the output scores into probability estimates with logistic regression; isotonic regression18 that aims for a monotonic fit to the output scores; and the GUESS algorithm19 that is based on parametric estimation of the score distribution in each class.
Only a few studies attempted to estimate the OR using these methods. Continuous OR curves were estimated based on a GUESS-type calibration of predictions from a one-layer neural network combined with a kernel mapping2. In a study investigating associations between air pollution exposure and congenital heart defects, the authors used partial dependence functions (ParDepFuns) to estimate OR curves based on output from random forest (RF) and gradient boosting (GB) models20. The ParDepFun was suggested as a tool for understanding underlying relations between specific predictors and an outcome for “black box” models21,22 and provides marginal class probabilities.
Yet, to our knowledge, there is no rigorous investigation of possible estimates of ORs that are based on ML models. Importantly, if an estimate of the OR was available for analyses based on ML methods, epidemiology and other fields could have benefited from both the flexibility and accuracy of these models as well as the interpretability of the results. Here, we seek to gain insight by suggesting ten estimators of the OR using the output scores of ML algorithms, or their probability-calibrated form. We illustrate these methods by applying LR, RF and GB models to diverse epidemiologic datasets. The first dataset is from a case–control study investigating associations between low temperature and cardiovascular and respiratory admissions of elderly patients. The second is from a historical cohort study of association between prenatal exposure to temperature and overweight among infants.
Methods
The traditional estimator of OR as a function of the predictions
First we define the conditional OR. Let \(Y\) be a binary health outcome, \(X\) a binary exposure. and \(Z\) a covariate. We would like to estimate
We examine the traditional estimator obtained from a logistic regression (LR) model and show that it can be formulated in terms of the predictions from that model and an adjustment factor. Let \(p=E\left(Y|X,Z\right)=P\left(Y=1|X,Z\right)\) and a logistic model of the form \(\text{logit}\left(p\right)=\text{log}\left[p/(1-p)\right]=\alpha +\beta x+\gamma z\). Applying this model to a dataset with \(n\) subjects, the predicted value of \(p, {\widehat{p}}_{i},\) for subject \(i, i=1,\dots ,n\) is extracted from \(\text{logit}\left({\widehat{p}}_{i}\right)=\widehat{\alpha }+\widehat{\beta }{x}_{i}+\widehat{\gamma }{z}_{i},\) where \(\widehat{\alpha },\widehat{\beta } \) and \(\widehat{\gamma}\) are the parameter estimates, and the OR of interest is estimated by \(\text{exp}(\widehat{\beta })\). Without loss of generality, we index the subjects so that subjects \(1,\dots ,{n}_{1}\) form the exposed group (group \({E}_{1}\)) and subjects \({n}_{1}+1,\dots , n\) form the unexposed group (\({E}_{0}\)), such that \({n}_{0}=n-{n}_{1}.\) Then
where \({\overline{z} }_{1}\) and \({\overline{z} }_{0}\) are the mean values of \(Z\) in the exposed and unexposed groups, respectively.
Let \(G{M}_{k}\left({\widehat{o}}_{LR}\right)={\left\{\prod_{i\in {E}_{k}}\left[{\widehat{p}}_{i}/(1-{\widehat{p}}_{i})\right]\right\}}^{{1/n}_{k}}\) be the geometric mean of the predicted odds, \({\widehat{p}}_{i}/(1-{\widehat{p}}_{i}),\) in group \({E}_{k}, k=\text{0,1}\), then \(\text{log}\left[G{M}_{1}\left({\widehat{o}}_{LR}\right)/G{M}_{0}\left({\widehat{o}}_{LR}\right)\right]=\widehat{\beta }+\widehat{\gamma }({\overline{z} }_{1}-{\overline{z} }_{0}).\) Therefore,
where \({\widehat{A}}_{LR}\) is an adjustment factor and \({c\widehat{OR}}_{LR}\) is the “crude” OR. Therefore, the “traditional” OR is a product of two terms: The first term is an adjustment factor reflecting differences between the mean values of the covariates at different exposure levels and the second term is a ratio of the geometric means of individual odds (GMO) at two exposure levels. The generalization to additional covariates, including functions of \(Z\) like polynomials, and to an exposure variable with more than two levels is immediate. Equation (1) implies that the adjustment factor can also be estimated by \({\widehat{A}}_{LR}={\widehat{OR}}_{LR}/{c\widehat{OR}}_{LR}.\)
Robust estimator of the OR
The geometric means in Eq. (1) are not defined when an output score is equal to one and is equal to zero when a score is equal to 0. We suggest that output values of 1/0 are replaced with the second-largest or second-smallest values, respectively. Additionally, the GMO-based estimator can be unstable when there is a high proportion of predictions near zero or one. We suggest more robust estimators based on the ratio of the odds of mean prediction (OMP), defined by \({c\widetilde{OR}}_{LR}=\left[{\overline{p} }_{1}/(1-{\overline{p} }_{1})\right]/ \left[{\overline{p} }_{0}/(1-{\overline{p} }_{0})\right]\) and \({\widetilde{OR}}_{LR}={\widehat{A}}_{LR}\times {c\widetilde{OR}}_{LR},\) where \({\overline{p} }_{1}\) and \({\overline{p} }_{0}\) are the average prediction for the exposed and unexposed groups, respectively. The GMO- and OMP-type estimators were inspired by the separate and combined ratio estimators in sampling theory23.
Hybrid estimator of the OR
In this section we extend the estimate in Eq. (1) to a general binary classifier. Let \(g=g\left(X,Z\right)\) be a binary classifier with a well-calibrated output scores \({s}_{gi}=\widehat{g}\left(X={x}_{i},Z={z}_{i}\right) 0\le {s}_{gi}\le 1\), \(i=1,\dots ,n.\) We approximate the log-odds of \(s\) using an additive model of the form \(\text{logit}\left({s}_{gi}\right)=\widehat{\delta }+\widehat{\eta }x+\widehat{\varphi }\left(z\right),\) where \(\varphi\) is an arbitrary function. This approximation is in the spirit of generalized additive models. and is reasonable for a binary exposure that has no interaction with Z. Let \({\widehat{o}}_{gi}={s}_{gi}/{(1-s}_{gi})\) be the individual odds, \(G{M}_{k}\left({\widehat{o}}_{g}\right), k=\text{0,1}\) their geometric means in the unexposed and exposed groups and \({c\widehat{OR}}_{g}=\frac{G{M}_{1}\left({\widehat{o}}_{g}\right)}{G{M}_{0}\left({\widehat{o}}_{g}\right)}.\) Under the assumption that \(\frac{{\widehat{OR}}_{g}}{{c\widehat{OR}}_{g}}\approx \frac{{\widehat{OR}}_{LR}}{{c\widehat{OR}}_{LR}}={\widehat{A}}_{LR},\) the hybrid estimator is \({\widehat{OR}}_{g}={\widehat{A}}_{LR}\times {c\widehat{OR}}_{g}.\) The adjustment factor is a measure of exposure imbalance in the data. The assumption that it has similar values for the LR and ML models suggests that they have similar differences between the averages of the covariates (which may have been transformed) in the exposed and unexposed groups. The respective OMP-type estimators are \({c\widetilde{OR}}_{g}=\left[{\overline{s} }_{g1}/{(1-\overline{s} }_{g1})\right]/\left[{\overline{s} }_{g0}/{(1-\overline{s} }_{g0})\right]\) and \({\widetilde{OR}}_{g}={\widehat{A}}_{LR}\times {c\widetilde{OR}}_{g},\) where \({\overline{s} }_{g1}\) and \({\overline{s} }_{g0}\) are the average scores for the exposed and unexposed groups, respectively.
OR estimator based on partial dependence functions
The ParDepFun of \(h=h(X,Z)\) on the exposure \(X\) represents the relation between the exposure and the outcome after accounting for the average effect of the covariate \(Z\)21,22. It can be estimated by the marginal average of \(\widehat{h},\) \(\overline{\widehat{h} }\left(X\right)=\frac{1}{n}\sum_{i=1}^{n}\widehat{h}\left(X,{z}_{i}\right).\) If \(h\left(X,Z\right)=\delta +\eta x+\varphi \left(z\right),\) then the ParDepFun of \(X\) is estimated by \(\overline{\widehat{h} }\left(X\right)=\frac{1}{n}\sum_{i=1}^{n}(\widehat{\delta }+\widehat{\eta }X+\widehat{\varphi }\left({z}_{i}\right))=\widehat{\eta }X+\left(\widehat{\delta }+\overline{\widehat{\varphi } }\right),\) and has two values: \(\overline{\widehat{h} }\left(1\right)=\) \(\widehat{\eta }+\left(\widehat{\delta }+\overline{\widehat{\varphi } }\right)\) and \(\overline{\widehat{h} }\left(0\right)=\) \(\widehat{\delta }+\overline{\widehat{\varphi } }.\) Therefore, \(\widehat{\eta }\) can be estimated by \(\overline{\widehat{h} }\left(1\right)-\overline{\widehat{h} }\left(0\right).\) For \(h=\text{logit}(g)\) the GMO-type estimate of the OR is \(\widehat{OR}\left(ParDepFun\right)=\text{exp}\left[\overline{\widehat{h} }\left(1\right)-\overline{\widehat{h} }\left(0\right)\right].\) This estimate can be computed for any ML model as follows:
-
Step 1. Fit the classifier \(g\) to the original data.
-
Step 2. Create two new datasets: In data1, replace all values of \(X\) by 1 while keeping the original values of the outcome and all other covariates; in data0, replace all values of \(X\) by 0.
-
Step 3. Pass data1 and data0 through the classifier fitted in step 1 and obtain two sets of scores \({\left\{{s}_{1i}\right\}}_{i=1}^{n}\) and \({\left\{{s}_{oi}\right\}}_{i=1}^{n}\).
-
Step 4. For each set of predictions in step 3 average the logit(odds) of the predictions to obtain \(\overline{\widehat{h} }\left(1\right)\) and \(\overline{\widehat{h} }\left(0\right)\) and estimate the OR by \(\text{exp}\left[\overline{\widehat{h} }\left(1\right)-\overline{\widehat{h} }\left(0\right)\right].\)
To obtain an OMP-type estimate, we modify step 4 as follows:
-
Step 4′. For each set of predictions in step 3 average the predictions to obtain \(\overline{s }\left(1\right)\) and \(\overline{s }\)(0) and estimate the OR by \(\widetilde{OR}\left(ParDepFun\right)=\left\{\overline{s }(1)/[1-\overline{s }\left(1\right)]\right\}/\left\{\overline{s }(0)/[1-\overline{s }\left(0\right)]\right\}\).
Calibration of ML scores
A calibration method, \(f,\) attempts to transform ML scores into probability estimates that are consistent with the true conditional class probability24. We demonstrate three types of frequently used calibration methods:
Platt scaling17. This approach ransforms the output scores by the sigmoid function \({f}_{i}=f\left({s}_{i}\right)={\left[1+\text{exp}(A+B{s}_{i})\right]}^{-1}.\) Additionally, outcome values are replaced by \(1/({N}_{-}+2)\) and \(({N}_{+}+1)/\left({N}_{+}+2\right)\) in place of 0 and 1, respectively, where \({N}_{-}\) and \({N}_{+}\) are the number of zero and one values, respectively.
IsoReg calibration18. An approach based on isotonic regression, that has a restriction that \(f\left({s}_{i}\right)\) is monotonic non-decreasing in \({s}_{i}\). A commonly used algorithm for estimating the isotonic regression is pair-adjacent violators25.
The GUESS algorithm19. This algorithm first fits two probability distribution functions to the original scores \(\text{P}\left({s}_{i} \right| Y=k), k=\text{0,1}.\) Then, using Bayes’ theorem, it computes the required conditional probability \(\text{P}\left( Y=1\right| {s}_{i}).\)
Evaluation and visualization. To evaluate the calibration methods we first divide the range of calibrated scores \({f(s}_{i})\) into 10 equal-frequency intervals. Then, we compute the relative frequency of the class with \(Y=1\), \({\overline{y} }_{j},\) and the mean of the calibrated scores, \({\overline{f\left( s\right)}}_{j},\) for each interval \(j, j=1,\dots ,10.\) We define the expected calibration error (ECE) and the maximum calibration error (MCE) by \({\text{ECE}} = \sum\nolimits_{j = 1}^{10} {\left| {\overline{y}_{j} - \overline{f\left( s \right)}_{j} } \right|} \,{\text{and}} \,{\text{MCE}} = \max_{{j \in \left\{ {1, \ldots ,10} \right\}}} \left| {\overline{y}_{j} - \overline{f\left( s \right)}_{j} } \right|\)\(.\) We used the Brier score (BS) to evaluates the overall performance of the calibration: \(\text{BS}=\frac{1}{n}\sum_{i=1}^{n}({y}_{i}-{f(s}_{i}){)}^{2}\) and the area under the receiver-operating characteristic curve (AUC) to assess the diagnostic capability of a model. To visualize the results we inspected reliability plots, which plot \({\overline{y} }_{j}\) against \({\overline{f\left( s \right)}}_{j}\)25,26. When the scores are well calibrated, we expect that the points are aligned along a 45-degree line.
Illustrative data
Admissions data. This case–control study27 examined association between ambient temperature and increased risk of hospitalization for respiratory and cardiovascular diseases among the elderly (> 60 years) who lived in southern Israel, 2004–2009. Cases, admitted for respiratory (n = 3437) or cardiovascular (n = 11,208) symptoms, were compared to patients admitted electively (n = 2366), mainly for cataract surgery (Table S1). Covariates based on hospital records included sex, age, year and season of hospitalization, and distance from an industrial park to residence. Locality-level data included ethnicity and urbanity; environmental variables comprised ambient temperature and air pollutants. Details about the study area and environmental data were described elsewhere28. Exposure was defined as the average of the daily minimal apparent temperature over the 7 days preceding admission smaller than the overall median. Tables S2 and S3 show the data distribution and LR model details, respectively. Analysis of a de-identified Admissions data provided by Soroka University Medical Center was approved by the Soroka Institutional Review Board. No informed consent was required. All methods were performed in accordance with the relevant guidelines and regulations.
Infants’ data. This population-based cohort assessed the relation between prenatal ambient temperature exposure and infant overweight. The study population, exposure measurement and covariates were described previously29. Briefly, for this methodological study, we included infants aged 2–2.5 years who visited a child health clinic in Israel and were born 2011–2018. The outcome was infant overweight defined by body mass index at age 2–2.5 years > 95th percentile according to WHO-standardization. Exposure was the average daily ambient temperature during pregnancy at the mother’s residence, divided into quintiles. Covariates included infant sex, year and month of birth, and region of residence. Of 692,666 observations with known outcome and exposure, we created a stratified sample of 160,000 infants (20,000 from each year of birth). Tables S4 and S5 show the data distribution and LR model details, respectively. Analysis of a de-identified Infants data provided by the Israeli Ministry of Health was approved by the Supreme Ethics Committee of the Israeli Ministry of Health. No informed consent was required. All methods were performed in accordance with the relevant guidelines and regulations.
Application and evaluation of the OR estimators
We examined three common machine-learning models that are considered good classifiers: LR, random forest (RF) and gradient boosting (GB). We used the output scores of these models to obtain 10 estimates of the OR for each model: Eight hybrid estimates and two ParDepFun-based estimates. The hybrid estimates were computed using uncalibrated scores and 3 sets of calibrated scores (Platt, IsoReg and GUESS), each calculated using either the GMO or OMP approach. Additionally, we computed GMO- and OMP-type estimates based on a ParDepFun, using uncalibrated scores. We illustrated the proposed estimators separately for the respiratory and cardiovascular admissions (20 estimates for each model) and for exposure quintiles 2 to 5 (Q2–Q5) versus Q1 for the Infants’ data (40 estimates).
To determine the tuning parameters for the RF and GB models we randomly sampled 70% of the observations of each examined dataset as the training data and the remaining observations as the testing data. When tuning the parameters, we targeted similar train- and test-data AUCs for all models to avoid overfitting. The hyperparameters we used for RF analysis of the Admissions data were \(B=700\) trees, \(m=3\) candidate features at each node partition and a maximal depth of \(d=64\) for each tree. For GB we chose \(B=700\) trees (boosting iterations, nrounds), a shrinkage parameter \(\eta =0.01\), and a depth of \(d=3\) for each tree. For the Infants’ data, RF parameters were \(B=500, m=3\) and no restriction on tree depth, and for GB, \(B=500\), \(\eta =0.01\) and \(d=3.\) The use of a train/test setup allowed us to control for overfitting of the ML models. However, once the hyperparameters were chosen, all OR estimates were based on the complete datasets according to the common epidemiological practice.
To estimate the uncertainty of the OR estimates we used 5000 bootstrap samples to obtain a 95% percentile confidence interval (CI). The process for each sample included (a) fitting the LR, RF and GB models with the above chosen hyperparamters, (b) calculating the adjustment factor for the LR model, (c) calibrating the predictions for all models using the Platt, IsoReg and GUESS methods, and (d) calculating the OR estimates. We also calculated the bootstrap estimate of OR\(, {OR}^{*}\), as the average of the estimates obtained for the 5000 bootstrap samples.
The LR model is among the most widely used models in epidemiology and provides direct estimates of the OR and its 95% CI. Therefore, although the results of ML and LR models are likely to be different, the epidemiologist will tend to look for consistency between ML- and LR-based estimates and will probably rely more on estimates that are not very different from the common LR estimates. We suggest three evaluation metrics that compare the proposed estimates and their CIs with the respective LR-based statistics and a fourth metric based on \({OR}^{*}/OR\):
Inclusion of \({OR}_{g}\) in LR CIs. We examine whether OR estimates are within the LR \((1-\alpha )\)-level CI, for the 68% and 95% levels. For comparability, we used the LR bootstrap interval in our evaluation.
Length of the 95% CI. Denote by \(\left(L,U\right)\) the 95% CI for an OR estimate. The length of the interval is \(U-L\). When it is divided by the length of the LR 95% CI we get the relative length of the interval.
Asymmetry of the 95% CI. Define \(A=\left(U-{OR}^{*}\right)/\left({OR}^{*}-L\right)\) the asymmetry of the 95% CI. When it is divided by the asymmetry of the LR 95% CI we get the relative asymmetry.
The ratio \({OR}^{*}/OR.\) This ratio evaluates how close an estimate of OR is to the parameter it estimates.
All statistical analyses were performed in R software version 4.4. (R Foundation for Statistical Computing, Vienna, Austria)30. RF and GB were implemented using the randomForest version 4.7.1.131 and Xgboost version 1.7.8.132 packages, respectively. Bootstrap samples and statistics were created with the boot package version 1.3.3033.
Results
Performance of calibration
Table 1 shows the performance metrics for the uncalibrated (original) and calibrated predictions for the three ML models applied to the Admissions and Infants datasets. Calibration reduced the ECE values for RF and GB by a similar factor, with the highest reduction for IsoReg (ECE reduced on average by 92%) and the lowest gain for GUESS (35%). The BS and AUC metrics had similar patterns to those of the ECE and MCE but showed very low variability. Figure 1 shows the reliability plots. For RF, the IsoReg curve was the only one that aligned with the 45-degree line for all datasets.
Reliability diagrams for uncalibrated and calibrated output by model. Admissions data, n = 17,011, 2004–2009 and Infants’ data, n = 160,000, 2011–2018. IsoReg, isotonic regression.
Estimated ORs and CIs
Table S6 presents the number of predictions that were equal to zero or one and the respective second smallest/largest values that were used to estimate GMO-based ORs. For LR and GB the numbers were mostly negligible (< 0.2%). For RF output, the rates of 0/1 values for the ParDepFun and hybrid-uncalibrated estimates were high (except for the respiratory data). For example, the percentage of zeroes for the Infants’ data were 44.7–53.4% and 92.8–94.9% for the ParDepFun and hybrid-uncalibrated estimates, respectively.
Figures 2 and 3 show the point estimates and 95% CIs for each of the 10 estimates of OR for the Admissions and Infants data, respectively. Descriptively, the prominent finding was that the estimates of LR and GB were relatively close to each other and had similar CI lengths, while results of RF were mixed: The ORs for cardiovascular admissions and for the second exposure quintile (Q2) in the Infants cohort were mostly in line with the LR and GB estimates; for the Infants data Q4–Q5 only few (1–2) estimates were close to the LR and GB estimates; and for the Respiratory and Q3 groups none of the ORs were close to those of the other models.
ORs and 95% CIs of admission for the average minimal daily apparent heat in the 7 days before admission (℃) below median (exposed) vs. above median by estimation method and model, Admissions data, n = 17,011, 2004–2009. The width of the shaded gray area is equal to the width of the bootstrap 95% CI for the LR estimate. Black arrows indicate truncated values. The x-axis is shown using a log scale. GMO, geometric mean of odds; GB, gradient boosting; IsoReg, isotonic regression; LR, logistic regression; OMP, odds of the mean prediction; ParDepFun, partial dependence functions; RF, random forest.
ORs and 95% CIs of infant obesity for quintiles of average daily ambient temperature during pregnancy (℃) by estimation method and model, Infants’ data, n = 160,000, 2011–2018. The width of the shaded gray area is equal to the width of the LR bootstrap 95% CI. Black arrows indicate truncated values. The x-axis is shown using a log scale. GMO, geometric mean of odds; GB, gradient boosting; IsoReg, isotonic regression; LR, logistic regression; OMP, odds of the mean prediction; ParDepFun, partial dependence functions; RF, random forest.
Figures 4 and 5 present findings on the evaluation metrics of the proposed ORs for the Admissions and Infants datasets, respectively (Appendices S1 and S2 include the detailed data). The salient results are:
Evaluation metrics of ORs and 95% CIs by study group, estimation method and model, Admissions data, n = 17,011, 2004–2009. The lines connecting the symbols have no quantitative meaning and are drawn only to aid in reading and interpreting the graphs. For ORs, the reference lines indicate the LR parametric CIs at 68% (two middle lines) and 95% levels (bottom and top lines). The y-axis is shown using a log scale. For CI length, the reference line indicates the length of the LR bootstrap CI. Length values above 3 have been truncated to allow for clearer presentation and the actual values are indicated on the graph. For CI asymmetry, the reference line indicates the asymmetry of the LR bootstrap CI. The y-axis for OR*/OR is shown using a log scale. GMO, geometric mean of odds; GB, gradient boosting; IsoReg, isotonic regression; LR, logistic regression; OMP, odds of the mean prediction; ParDepFun, partial dependence functions; RF, random forest.
Evaluation metrics of ORs and 95% CIs by quintile, estimation method and model, Infants’ data, n = 160,000, 2011–2018. The lines connecting the symbols have no quantitative meaning and are drawn only to aid in reading and interpreting the graphs. For ORs, the reference lines indicate the LR parametric CIs at 68% (two middle lines) and 95% levels (bottom and top lines). The y-axis is shown using a log scale. For CI length, the reference line indicates the length of the LR bootstrap CI. Length values above 0.3 have been truncated to allow for clearer presentation and the actual values are indicated on the graph. For CI asymmetry, the reference line indicates the asymmetry of the LR bootstrap CI. The y-axis for OR*/OR is shown using a log scale. GMO, geometric mean of odds; GB, gradient boosting; IsoReg, isotonic regression; LR, logistic regression; OMP, odds of the mean prediction; ParDepFun, partial dependence functions; RF, random forest.
Inclusion of \({OR}_{g}\) in LR CIs. Of the 60 estimates, 100%, 20% and 87% were within the LR 95% CI for LR, GB and RF, respectively. All RF hybrid estimates for Q3-Q5 of the Infants data, except OMP-uncalibrated, were below one. This result is attributed to the high rate of zero scores (> 92%) that were replaced by positive values when calibrated or when used in GMO-uncalibrated estimates. These transformations almost eliminated the differences in odds between exposure quintiles, while the OMP applied to raw scores maintained the variation between quintiles.
Length of the 95% CI. The averages (SDs) of the relative length were 1.05 (0.11), 2.02 (4.68) and 1.10 (0.42) for LR, RF and GB, respectively. Across datasets, all ParDepFun and uncalibrated CIs produced by GB were narrower than the LR 95% CI.
Asymmetry of the 95% CI. Most (86%) CIs exhibited asymmetry in the expected direction (> 1, heavier right tail). CIs based on ParDepFun and GUESS often had stronger positive or negative asymmetry than the other CIs.
The ratio \({OR}^{*}/OR\). The averages (SDs) of the ratio \(\text{were}:\) LR 1.00 (0.01), RF 1.04 (0.07) and GB 1.01 (0.03),
Overall, GMO- and OMP-based estimates had similar performance. For example, the averages of the relative CI length were: GMO (1.06. 2.25. 0.98), OMP (1.04, 2.38, 0.95) for (LR, RF, GB), respectively.
Discussion
In this study, we proposed ten estimators of the OR that are based on uncalibrated and calibrated output scores obtained from machine learning models. We evaluated these estimators for three ML models using two datasets of different types. We found that the estimates based on the GB algorithm were generally consistent with the estimates of the standard LR model but that the results for RF were mixed and highly variable. Our findings also suggested that the GB estimates using the ParDepFun or the hybrid-uncalibrated methods had narrower CIs compared to LR. This does not necessarily imply that these estimates were more accurate, as CIs reflect uncertainty for a given model. The advantages of ML models in terms of their flexibility and prediction accuracy are well-recognized34. This research addressed the knowledge gap regarding the ability to obtain OR estimates from these models. We have shown that by carefully selecting the ML model, estimation method and calibration, estimates of the OR can be obtained in ML-based analyses and that they are reliable. This finding enhances the ability of epidemiologists and researchers in other fields to interpret the results of ML models.
We found only one recent study that attempted to estimate ORs based on ML models, A population-based birth cohort study in Beijing (n = 30,669 births, 2009–2012) investigated the association between maternal exposure to particulate matter and risk of congenital heart defects (CHD, n = 321 cases) using RF and GB models20. Using partial dependence plots the authors estimated continuous OR curves, similarly to our ParDepFun-OMP estimates. Positive association for particulate matter exposure < 100 µg/m3 was indicated for both models with stronger associations and narrower CIs for GB for this range.
Our findings indicated that high performance of calibration did not necessarily predict good performance of the corresponding OR estimate. However, calibration may be important for obtaining reliable ML-based ORs particularly when using RF. To estimate the recurrence rate of lymphoma among 510 patients in Shanxi, China, 2011–2017, calibration for output from LR, RF, naïve Bayes (NB) and other models was applied using the Platt, IsoReg or shape-restricted polynomial regression (RPR) methods24. The ECE, MCE and BS values for LR were not improved by calibration, but the initial errors of RF and NB output were reduced to the LR error level after Platt or RPR calibration. In comparison, our RF analyses indicated that IsoReg outperformed Platt calibration.
A study of eight datasets with a binary outcome (n = ~ 8000 to ~ 40,000, % cases = 3 to 53%, # features = 14 to 200) examined predictions made by ten ML models, including LR, RF and boosted trees, before and after Platt and IsoReg calibration25. As expected, LR predictions were mostly well calibrated initially, while boosted trees output exhibited sigmoid-shaped reliability plots that approximated the diagonal line after calibration. The results for RF were less clear: three of the eight datasets had well-calibrated output before calibration and calibration reduced the log-loss for the remaining datasets.
GUESS calibration worked less well than the Platt and IsoReg methods for our data. Researchers in Seoul came to a similar conclusion for a cohort study (2016–2017) that predicted the likelihood of abnormality in chest radiography among 57,481 patients using a deep convolutional neural network. For example, the ratio between the calibrated and initial ECEs were 0.25, 0.36 and 0.53 for Platt, IsoReg and GUESS, respectively35.
We chose the RF model to illustrate our suggested OR estimates because it is considered one of the algorithms with high predictive capability and is one of the most used ML models36,37. However, we found that for the purpose of estimating ORs its performance was unstable and unpredictable. For example, although it is expected that RF would work better for balanced samples, none of the RF estimates for the Respiratory data (59.2% cases) were within the LR 95% CI, compared to 60% of the ORs for the Cardiovascular group (82.6% cases). It would be reasonable to assume that the RF votes would largely benefit from probability calibration because they are not constructed in a probabilistic manner, and indeed our calibration metrics supported this supposition. However, for OR estimation, our insight was more complex: while the ORs for the Respiratory group were improved by calibration, for the Cardiovascular data the uncalibrated estimates were consistent with those from LR and GB.
Several RF models produced high rates of 0 or 1 predictions, even after we tried tuning the hyperparameters to reduce their number. The measures we used to deal with 0/1 predictions included (a) changing extreme values to the next smallest/largest value for GMO-based estimates, (b) using the OMP estimator which relies on averaged predictions and (3) calibrating the predictions, for example, Platt calibration which includes built-in handling of 0/1 values. The impact of these measures was mixed. For example, for the Infants’ data (9.8% overweight), 93% of the RF votes were equal to zero. For the GMO-uncalibrated and calibrated estimates, the difference between exposure groups was small, yielding many protective (< 1) ORs. However, several OMP-uncalibrated estimates were within the LR CI, albeit with wider CIs. Potentially, increasing the balance between the classes, for example through minority oversampling using SMOTE38, could have improved the performance of RF. When using these techniques to estimate effect size in epidemiological studies, it is advisable to examine the sensitivity of the estimates to resampling, as it may alter the underlying population distribution and the exposure-outcome relationship. For example, when we balanced the Infants data the ORs were very high (OR > 3).
Limitations of this study should be considered. First, the ParDepFun and hybrid estimators are based on the assumption that there are no interactions between the exposure and the covariates22. Model-agnostic interpretability methods such as Shapley additive explanations (SHAP)39 can help in checking for interactions. SHAP provides interaction values that measures how much the joint presence of two features contributes beyond the sum of their separate effects. A common remedy in epidemiology when there is evidence of important interactions is to carry out an effect modification analysis. In addition, the hybrid estimators are based on the assumption that the adjustment factors of the LR and ML models are similar. For a model \(M,\) let \(\text{logit}\left({p}_{M}\right)={\alpha }_{M}+{\beta }_{M}x+{\varphi }_{M}\left(z\right)\) and \({\Delta }_{M}={\overline{\varphi } }_{M1}\left(z\right)-{\overline{\varphi } }_{M0}\left(z\right)\) the exposure imbalance in the \({\varphi }_{M}\) scale. If \({\varphi }_{LR}\) approximates \({\varphi }_{M}\) for some ML classifier, the values of the corresponding \(\Delta^{\prime } {\text{s}}\) are likely to be similar. Otherwise\(,\) the \(\Delta^{\prime }{s}\) may still be similar, since they are contrasts between averages, but meaningful variation is possible. This may be explored by checking whether \(\Delta_{LR}\) is sensitive to different specifications of the LR (e.g., adding polynomials or nonlinear functions of a feature), especially for features with high exposure-imbalance.
Second, it was also argued that partial dependence plots (PDPs) fail when the exposure is correlated with the covariates because they may marginalize over improbable combinations of variable values40. A possible diagnostic tool for this problem is accumulated local effect (ALE) plots40. Instead of averaging over all observations globally, ALE accumulates local differences in predictions based on small intervals of the feature values. ALE curves are on the prediction scale but they are centered so that their mean is zero. For a continuous feature, substantial differences between the shapes of a PDP and an ALE plot can indicate that the PDP is likely biased by correlation. For a binary covariate, disagreement between the differences PDP(1)-PDP(0) and ALE(1)-ALE(0) indicates that the PDP is susceptible to distortion.
Third, our evaluation of the OR estimates was primarily based on agreement with the results of the logistic regression model. The logistic model is a widely accepted standard tool for binary outcomes and is often used as the first model applied to the data. However, we cannot rule out that differences between ORs based on LR and those based on other ML models can arise from a misspecification of the LR model.
Fourth, we have demonstrated our suggested estimators for a few datasets but we recognize that these data may not cover all scenarios that are observed in environmental epidemiology. Nevertheless, we used data with varying study designs, sample sizes, class imbalance and class discrimination for typical ORs to demonstrate the potential of OR estimation from ML models. Our findings indicated that it may be difficult to predict the usefulness of a specific model or method in advance, and it may be necessary to explore each scenario individually. Finally, the error metrics and reliability diagrams that we used to evaluate calibration depend on the way bins are determined and on the number of bins, which may reduce the reliability of these measures19,26. Considering that calibration in itself was not the target of this work, we have used the basic error metrics and the common binning approach, but there are other types of metrics that are not based on binning26.
In conclusion, ML algorithms may aid researchers in identifying complex associations between exposure and health outcomes. Importantly, our results suggest that they can also provide interpretable measures of association such as the OR. To obtain reliable OR estimates, it is important that the ML output is probability-calibrated either by pre-selecting a well-calibrated algorithm (e.g., GB) or by choosing an appropriate calibration method.
Data availability
The Admissions data that support the findings of this study are available from Soroka Medical Center but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Soroka Institutional Review Board. The Infants data that support the findings of this study are available from the Israeli Ministry of Health but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Supreme Ethics Committee of the Israeli Ministry of Health. The methodological data generated and analysed during this study are included in this published article and its supplementary information files (Appendices S1 and S2).
References
Goldstein, B. A., Navar, A. M. & Carter, R. E. Moving beyond regression techniques in cardiovascular risk prediction: Applying machine learning to address analytic challenges. Eur. Heart J. 38(23), 1805–1814. https://doi.org/10.1093/eurheartj/ehw302 (2017).
Heine, J. J., Land, W. H. & Egan, K. M. Statistical learning techniques applied to epidemiology: A simulated case-control comparison study with logistic regression. BMC Bioinf. 12, 37–50. https://doi.org/10.1186/1471-2105-12-37 (2011).
Bellinger, C., Mohomed Jabbar, M. S., Zaïane, O. & Osornio-Vargas, A. A systematic review of data mining and machine learning for air pollution epidemiology. BMC Public Health 17(1), 907–925. https://doi.org/10.1186/s12889-017-4914-3 (2017).
Chen, S., Yu, J., Chamouni, S., Wang, Y. & Li, Y. Integrating machine learning and artificial intelligence in life-course epidemiology: Pathways to innovative public health solutions. BMC Med. 22(1), 354–364. https://doi.org/10.1186/s12916-024-03566-x (2024).
Carrión, D. et al. A 1-km hourly air-temperature model for 13 northeastern US states using remotely sensed and ground-based measurements. Environ. Res. 200, 111477. https://doi.org/10.1016/j.envres.2021.111477 (2021).
dos Santos, R. S. Estimating spatio-temporal air temperature in London (UK) using machine learning and earth observation satellite data. Int. J. Appl. Earth Obs. Geoinf. 88, 102066. https://doi.org/10.1016/j.jag.2020.102066 (2020).
Wei, Y. et al. Exposure-response associations between chronic exposure to fine particulate matter and risks of hospital admission for major cardiovascular diseases: Population based cohort study. BMJ 384, e076939. https://doi.org/10.1136/bmj-2023-076939 (2024).
Kassomenos, P., Petrakis, M., Sarigiannis, D., Gotti, A. & Karakitsios, S. Identifying the contribution of physical and chemical stressors to the daily number of hospital admissions implementing an artificial neural network model. Air Qual. Atmos Health 4(3), 263–272. https://doi.org/10.1007/s11869-011-0139-2 (2011).
Polezer, G. et al. Assessing the impact of PM2.5 on respiratory disease using artificial neural networks. Environ. Pollut. 235, 394–403. https://doi.org/10.1016/j.envpol.2017.12.111 (2018).
Wang QiXin, W. Q., Liu Yang, L. Y. & Pan XiaoChuan, P. X. Atmosphere pollutants and mortality rate of respiratory diseases in Beijing. Sci. Total Environ. 391(1), 143–148. https://doi.org/10.1016/j.scitotenv.2007.10.058 (2008).
Matta, K. et al. Associations between persistent organic pollutants and endometriosis: A multipollutant assessment using machine learning algorithms. Environ. Pollut. 260, 114066. https://doi.org/10.1016/j.envpol.2020.114066 (2020).
Dong, S. et al. Maternal exposure to black carbon and nitrogen dioxide during pregnancy and birth weight: Using machine-learning methods to achieve balance in inverse-probability weights. Environ. Res. 211, 112978. https://doi.org/10.1016/j.envres.2022.112978 (2022).
Behera, M. et al. Statistical learning methods as a preprocessing step for survival analysis: Evaluation of concept using lung cancer data. Biomed. Eng. 10, 97–111. https://doi.org/10.1186/1475-925X-10-97 (2011).
Schwartz, J., Wei, Y., Dominici, F. & Yazdi, M. D. Effects of low-level air pollution exposures on hospital admission for myocardial infarction using multiple causal models. Environ. Res. 232, 116203. https://doi.org/10.1016/j.envres.2023.116203 (2023).
Duh, M.-S., Walker, A. M. & Ayanian, J. Z. Epidemiologic interpretation of artificial neural networks. Am. J. Epidemiol. 147(12), 1112–1122. https://doi.org/10.1093/oxfordjournals.aje.a009409 (1998).
Silva Filho, T. et al. Classifier calibration: A survey on how to assess and improve predicted class probabilities. Mach. Learn. 112(9), 3211–3260. https://doi.org/10.1007/s10994-023-06336-7 (2023).
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers (eds Smola, A. et al.) 61–74 (MIT Press, 1999).
Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada 694–699 (2002).
Schwarz, J. & Heider, D. GUESS: Projecting machine learning scores to well-calibrated probability estimates for clinical decision-making. Bioinformatics 35(14), 2458–2465. https://doi.org/10.1093/bioinformatics/bty984 (2019).
Ren, Z. et al. Maternal exposure to ambient PM10 during pregnancy increases the risk of congenital heart defects: Evidence from machine learning models. Sci. Total Environ. 630, 1–10. https://doi.org/10.1016/j.scitotenv.2018.02.181 (2018).
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001).
Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn. (Springer, 2009).
Cochran, W. G. Sampling Techniques (Wiley, 1977).
Fan, S. et al. Probability calibration-based prediction of recurrence rate in patients with diffuse large B-cell lymphoma. BioData Min. 14(1), 38–55. https://doi.org/10.1186/s13040-021-00272-9 (2021).
Niculescu-Mizil, A. & Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning - ICML ‘05 625–632 (ACM Press, 2005).
Huang, Y., Li, W., Macheret, F., Gabriel, R. A. & Ohno-Machado, L. A tutorial on calibration measurements and calibration models for clinical prediction models. J. Am. Med. Inform. Assoc. 27(4), 621–633. https://doi.org/10.1093/jamia/ocz228 (2020).
Nirel, R., Maimon, N., Fireman, E., Eyal, A., Peretz, A. Living near a hazardous industrial site and admissions for respiratory and cardiovascular diseases among the elderly: a case-control study, in ISEE Conference Abstracts 262014 2354.
Nirel, R. et al. Respiratory hospitalizations of children living near a hazardous industrial site adjusted for prevalent dust: A case–control study. Int. J. Hyg. Environ. Health 218(2), 273–279. https://doi.org/10.1016/j.ijheh.2014.12.003 (2015).
Alterman, N. et al. Ambient temperature and indicators of overweight during infancy and early childhood: A population-based historical cohort study. Environ. Res. https://doi.org/10.1016/j.envres.2025.121983 (2025).
R Developement Core Team. Version 4.4.1. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. 2024 (accessed 20 January 2026). https://www.r-project.org/
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2(3), 18–22 (2002).
Chen, T. et al. Xgboost: Extreme gradient boosting. R Package Version 0.4-2 1(4), 1–4 (2015).
Canty, A., Ripley, B. Package ‘Boot’. Bootstrap Functions (Originally by Angelo Canty for S); 2021. 2021 (accessed 20 January 2026); https://cran.r-project.org/web/packages/boot/index.html
Atias, D. et al. Machine learning in epidemiology: an introduction, comparison with traditional methods, and a case study of predicting extreme longevity. Ann. Epidemiol. 110, 23–33. https://doi.org/10.1016/j.annepidem.2025.07.024 (2025).
Hwang, E. J., Kim, H., Lee, J. H., Goo, J. M. & Park, C. M. Automated identification of chest radiographs with referable abnormality with deep learning: Need for recalibration. Eur. Radiol. 30(12), 6902–6912. https://doi.org/10.1007/s00330-020-07062-7 (2020).
Ohanyan, H. et al. Exposome-wide association study of body mass index using a novel meta-analytical approach for random forest models. Environ. Health Perspect. 132(6), 067007. https://doi.org/10.1289/EHP13393 (2024).
Cappelli, F., Castronuovo, G., Grimaldi, S. & Telesca, V. Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases. Int. J. Environ. Res. Public Health 21(7), 867–887. https://doi.org/10.3390/ijerph21070867 (2024).
Fernández, A., Garcia, S., Herrera, F. & Chawla, N. V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905. https://doi.org/10.1613/jair.1.11192 (2018).
Molnar, C. Interpretable Machine Learning: Leanpub (2020).
Apley, D. W. & Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. B 82(4), 1059–1086. https://doi.org/10.1111/rssb.12377 (2020).
Acknowledgements
We thank Dr. Neora Alterman for assistance with the Infants’ data.
Funding
This study was funded by The Council for Higher Education of Israel through The Hebrew University Center for Interdisciplinary Data Science Research (CIDR). This grant also supports the open access publishing.
Author information
Authors and Affiliations
Contributions
Ronit Nirel, Efrat Morin and Raanan Raz contributed to the study conception and design. Material preparation and data collection were performed by Raanan Raz and Nimrod Maimon. Statistical analyses were performed by Ronit Nirel and Naor Bauman. The first draft of the manuscript was written by Ronit Nirel and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
For this study, informed consent has been waived by the Soroka Institutional Review Board (Admissions data) and by the Supreme Ethics Committee of the Israeli Ministry of Health (Infants’ data) due to the anonymity and retrospective nature of the study.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nirel, R., Bauman, N., Morin, E. et al. Estimating the odds ratio from the output scores of machine learning models: possibilities and limitations. Sci Rep 16, 8922 (2026). https://doi.org/10.1038/s41598-026-38150-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-38150-1







