Abstract
Arboviral diseases such as dengue pose major public health challenges in endemic regions, notably in Norte de Santander (Colombia), where they place substantial pressure on healthcare services. We analyzed 8,814 confirmed dengue cases reported to the Colombian National Public Health Surveillance System (SIVIGILA) from January 2015 to June 2019 to investigate temporal dynamics and determinants of hospitalization. We applied a dual methodology based on: (i) machine-learning classifiers—logistic regression, random forest, and support vector machines—to predict hospitalization risk from symptom profiles and (ii) Cox models with time-varying coefficients to assess the timing of hospitalization as a function of socio-environmental and clinical predictors, accommodating non-proportional hazards. Our main findings are as follows. On average, patients sought medical attention about four days after symptom onset. Severe and non-severe cases had similar onset-to-hospitalization times, but severe cases were often admitted shortly after the appearance of key warning signs. Abdominal pain and low platelet count markedly increased the risk of hospitalization in classification models and were associated with higher hazards of earlier hospitalization in the time-to-event analysis, with vomiting likewise linked to earlier hospitalization. Among classifiers, random forest achieved the highest predictive accuracy, whereas logistic regression and Cox models yielded interpretable estimates of risk (odds ratios) and timing (time-varying hazard ratios). These findings highlight the value of early recognition of specific symptoms and the integration of machine learning with survival analysis to support proactive, resource-aware dengue management. All analyses were conducted in the R software.
Introduction
Dengue fever, caused by dengue virus (DENV), along with its severe manifestations, ranks among the most consequential arboviral diseases globally, with a mortality rate of approximately 2.5% in severe cases1,2. Transmitted primarily by female Aedes mosquitoes, notably Aedes aegypti, DENV is endemic to tropical and subtropical regions, including Colombia2. Its symptoms range from mild fever to headache, arthralgia, myalgia, nausea, rash, retro-orbital pain, and vomiting, with serious cases escalating to bleeding, shock, and organ failure3,4,5.
Recent studies have examined multiple aspects of DENV, such as early biomarkers of severe forms, global prevalence patterns, rapid diagnostic tools, preventive strategies, and the effects of climate change on its transmission dynamics6,7,8,9,10.
Prior studies have characterized DENV warning signs and progression timelines4,11, regional transmission and burden heterogeneity12, as well as modeling spanning methods related to machine-learning (ML) classifiers and survival analysis13,14,15,16,17,18,19. One can build on these methods by integrating interpretable ML with time-varying Cox models to jointly quantify hospitalization risk and timing using medical surveillance data. Recent advances in multivariate statistical modeling and tensor decomposition techniques have shown promise in structuring high-dimensional epidemiological data20,21.
Geographic and environmental factors play a critical role in the spread of arboviral diseases, highlighting the necessity of epidemiological monitoring and intervention strategies.
In endemic regions like Colombia, DENV imposes a substantial economic and healthcare burden, particularly in low-income areas where it is classified as a neglected tropical disease2,22,23. The economic impact of arboviral infections encompasses both direct and indirect costs, with the estimated cost of DENV in Colombia ranging from US$129.9 to US$167.8 million between 2010 and 201224. In the Department of Norte de Santander (NS), Colombia, where DENV incidence is high25, urbanization and climatic conditions favorable to Aedes aegypti proliferation exacerbate these challenges associated with difficulties in vector control, limitations in epidemiological surveillance, and increased strain on healthcare systems. Comparative studies across countries, including Indonesia, Kenya, Thailand, Vietnam, and other Colombian regions, highlight similar socio-economic constraints12,26,27,28,29,30. Control measures—such as vector management programs, public awareness campaigns, and vaccine development—remain critical to curbing disease spread31.
The co-circulation of arboviruses like chikungunya, dengue, and Zika in endemic areas complicates clinical diagnosis due to overlapping symptoms, posing challenges for effective healthcare management. Accurate prediction of hospitalizations is essential for optimizing resource allocation. In the Americas, including Colombia, multiple studies have explored arboviral prevalence and survival analysis techniques to understand disease patterns, timelines, and healthcare demands11,14,15,16,17,32,33. The COVID-19 pandemic highlighted vulnerabilities in public health systems, emphasizing the need for proactive strategies to manage arboviral outbreaks and the use of artificial intelligence/ML models34,35,36.
Temporal dynamics are critical for understanding disease progression and anticipating healthcare needs. Metrics such as the time from symptom onset to hospitalization affect patient outcomes3,5. Delayed hospitalization is associated with severe complications and increased mortality risks in DENV37. Monitoring these delays, influenced by socio-environmental predictor variables (or simply predictors) and symptom severity, enables a proactive approach to resource allocation37,38,39.
Despite extensive research, few studies have jointly quantified how symptomatology and socio-environmental context shape both the risk (probability) and the timing (time to event) of hospitalization, revealing a gap in the literature. This gap is particularly salient in NS, Colombia, where socio-environmental conditions sustain DENV transmission. Therefore, to cover this gap, the objective of the present study is to analyze DENV cases from NS, focusing on pathways to hospitalization for explaining and predicting hospitalization risk due to DENV in Colombia. We integrate ML classification models—logistic regression (LR), random forest (RF), and support vector machine (SVM)40,41,42,43,44,45,46—to make this prediction from clinical and socio-environmental predictors. We use time-to-event models via Cox regression with time-varying effects to assess how these predictors affect hospitalization timing through hazard ratios (HRs). This integrated modeling shows the strength of flexible learners (RF/SVM) and the interpretability of LR and Cox regression, linking risk and urgency in routine surveillance settings.
Specifically, the contributions of this study are: (i) integrating interpretable statistical modeling (LR) and flexible ML classifiers (RF/SVM) with time-varying Cox regression to quantify who is hospitalized and when hospitalization occurs within a single analytic framework; (ii) operationalizing this integration on a large real-world surveillance dataset of 8,814 confirmed NS-resident cases by using transparent preprocessing and Monte Carlo cross-validation (CV), with uncertainty reported from resampling-based performance metrics as well as confidence intervals (CIs); and (iii) translating model outputs—adjusted odds ratios (ORs) using LR, variable-importance profiles (utilizing RF), and time-varying HRs (employing Cox regression)—into triage-oriented guidance, including selection of decision thresholds under local resource constraints.
The remainder of this article is organized as follows. Section “Methodology” describes the study area, data sources, and the methodology. In Section “Results”, we present the findings obtained in the present investigation. In Section “Discussion”, the epidemiological implications are discussed. Section “Conclusions” states concluding remarks, limitations, and suggestions for future research.
Methodology
In this section, we detail the methodology used to analyze both the risk and timing of hospitalization in DENV cases. First, we provide an overview of the data sources and their preparation steps. Next, we present our classification strategies, which involve LR, RF, and SVM. Subsequently, we describe the survival analysis for modeling delays, with emphasis on the rationale for employing Cox regression with time-varying coefficients. Then, we explain how the predictive classification models complement the time-to-event analysis to provide a comprehensive view of hospitalization risk and timing.
Study area
The Department of NS is located in the Colombian Andean region, bordering Venezuela; see Fig. 1. The department is divided into six subregions, each characterized by distinct demographic and environmental attributes; see Table 1 for specific details. For clarity, these subregions are designated as Norte (N), Occidental (OCC), Oriental (ORI), Central (C), Sur-Oriental (SORI), and Sur-Occidental (SOCC). According to the most recent Population and Housing Census conducted by the National Administrative Department of Statistics (Departamento Administrativo Nacional de Estadística, DANE, in Spanish) in 2018, NS had 1,491,689 inhabitants, with 50.7% females. The age demographics are as follows: 10% in early childhood (0–5 years), 10% in childhood (6–11 years), 12% in adolescence (12–18 years), 14% in early adulthood (19–26 years), 42% in adulthood (27–59 years), and 12% in older adulthood (60+ years). The inhabitants of NS accounted for approximately 3.1% of the Colombian total population in that year47,49.
The urban population constitutes 78.68%, while 21.32% reside in rural areas, including town centers and dispersed settlements. Additionally, approximately 38% of the population has been affected by the country’s armed conflict, and since 2015, migratory flows from Venezuela have added social pressure on public resources48.
Some subregions, particularly ORI, N, and C, present environmental conditions conducive to Aedes aegypti proliferation, mainly due to favorable temperature and altitude ranges. Furthermore, social practices such as water storage and waste accumulation near residential areas contribute to mosquito breeding sites3.
Consequently, NS reported the highest DENV case numbers among all Colombian departments in 2018, with a severe DENV case-fatality rate of 14.7%50. In 2019, the capital city of Cúcuta, located in the ORI subregion, registered the highest case count among municipalities. Currently, DENV prevention and control are prioritized within the Healthy Living and Transmissible Diseases Program of the NS Development Plan for 2020-202351.
Dengue virus data
We analyzed anonymized DENV case records from the Colombian National Public Health Surveillance System (SIVIGILA) for NS, covering January 2015 to June 2019. The dataset is case-based, with one row per reported case and columns describing demographics, residence, occupation, symptoms, clinical indicators, and event dates including symptom onset, first consultation, and hospitalization, recorded in the ISO 8601 format (year-month-day). Missing entries were encoded as N/A (not applicable) according to the anonymized export specifications. The SIVIGILA dataset contained 16,625 reported DENV cases, including both non-severe and severe.
We restricted the analysis to cases confirmed by clinical evaluation, laboratory testing, or epidemiological linkage, yielding 9,161 confirmed DENV cases. Among these cases, 8,814 (96.2%) were residents of NS and 347 (3.8%) were non-residents and diagnosed within the Department of NS. The primary analysis focused on the 8,814 confirmed cases corresponding to residents of NS. Both hospitalized and non-hospitalized cases were included as outcome labels for model training and validation, thereby avoiding class-conditional sampling bias.
The input variables included demographic, socio-environmental, temporal, and clinical predictors sourced from the anonymized SIVIGILA framework. Demographic and socio-environmental factors encompassed sex, age, age category (categorical), occupation (ISCO-08 classification), ethnicity, and socioeconomic level, in addition to residence-related variables like subregion, municipality, and settlement type (urban, rural, dispersed rural). Temporal predictors were obtained from event records and encompassed the notification date, consultation date, symptoms, hospitalization date, epidemiological week, and year. These predictors enabled the models to encompass a wide range of patient traits, from mild to severe manifestations3,49,51. The cleaned anonymized dataset and the accompanying R scripts have been made publicly available on GitHub (https://github.com/alexacl95/NorteSantanderDengue). The original microdata remain subject to the Colombian national data access policies.
The variables considered included demographic and socio-environmental predictors such as subregion of residence, sex, age group, settlement type, and occupation, as well as temporal variables such as dates of first medical consultation and hospitalization. Symptom data were also incorporated to provide a comprehensive overview of the clinical presentation of DENV cases in the region. Detailed descriptions of the SIVIGILA dataset and its variables are available in3.
For analytical purposes, the dataset was divided into two subsets: (i) a classification subset to model the probability of hospitalization as a binary outcome, based on symptom profiles and selected socio-environmental predictors, and (ii) a survival subset to model delays from symptom onset to hospitalization, with time-to-event (hospitalization) as the main outcome. For the survival analysis (onset \(\rightarrow\) hospitalization), non-hospitalized patients were right-censored at the earlier of 30 June 2019 or 15 days post-onset, where observations related to “onset \(\rightarrow\) hospitalization times” greater than 15 days (\(T_{\textrm{oh}}>15\)) were excluded from the primary analysis to align with the administrative censoring window, while the sensitivity analysis assessed alternative cutoffs; for details about a regression model with censored response, see52.
Data preprocessing and quality control
We implemented a reproducible data preprocessing method prior to modeling that included the following steps (also illustrated in the workflow of Fig. 2):
-
Step 1—Restrict inclusion to confirmed DENV cases (clinical evaluation, laboratory testing, or epidemiological linkage), excluding non-confirmed records.
-
Step 2—Deduplicate records using unique case identifiers and event dates.
-
Step 3—Validate event chronology (symptom onset \(\le\) consultation \(\le\) hospitalization), allowing same-day events and direct hospitalizations without a consultation date, removing inconsistent records.
-
Step 4—Compute delays (in days) between calendar dates, excluding implausible delays greater than 15 days in the primary analysis (consistent with DENV clinical windows) and assessing alternative cutoffs in sensitivity analysis; see “Time-to-event analysis: modeling delays” for the corresponding survival-censoring rule.
-
Step 5—Encode symptoms and clinical/laboratory indicators according to SIVIGILA reporting forms3,49,51.
-
Step 6—Standardize age into predefined groups and predictor levels (sex, subregion, settlement type, and occupation).
-
Step 7—Address missing data using complete-case analysis for the main models (reporting sample sizes per analysis) and multiple-imputation sensitivity analysis to assess robustness.
-
Step 8—Assess multicollinearity and redundancy using the Pearson coefficient for binary–binary associations, Spearman or polychoric correlations for ordinal variables, the Cramér coefficient for nominal variables, and variance inflation factors (VIFs) after dummy encoding.
-
Step 9—Prevent information leakage by fitting parameter-based transformations (such as scaling and imputation) within each training split during Monte Carlo resampling, applying global steps only once before splitting, and standardizing continuous predictors (such as age) for scale-sensitive algorithms (such as SVM with a radial basis function—RBF—kernel), with RFs being unaffected.
Classification methods: predicting the risk of hospitalization
In parallel to modeling delays18,19,53,54, we aimed to predict whether a patient with suspected DENV infection would ultimately require hospitalization, using clinical and socio-environmental predictors4. This prediction differs from time-to-event survival analysis in that it seeks the risk of hospitalization (binary outcome) rather than modeling when hospitalization occurs. The output variable was the binary indicator of hospitalization, defined as “yes” if the case had a recorded date of hospitalization, and “no” otherwise. This definition allowed us to assess the model’s ability to discriminate between DENV cases requiring hospitalization and those treated in outpatient care. To this end, as mentioned, we employed three widely used classification models: LR, RF, and SVM. Each model was trained and evaluated using Monte Carlo CV; see Section “Cross-validation, performance metrics, and statistical assessment” for details.
LR is a generalized linear model for binomial responses with a logit link, offering simplicity and interpretability. Let \(Y_i \in \{0,1\}\) indicate whether patient i is hospitalized (1) or not (0), with \(y_i\) denoting its observed value, and \(\varvec{x}_i=(1,x_{i1},\dots ,x_{ik})^\top\) the observed vector of k predictors. The LR model is specified as
The unknown regression parameters \(\varvec{\beta }=(\beta _0,\beta _1,\dots ,\beta _k)^\top\) are estimated by the maximum likelihood method, which maximizes the log-likelihood function given by
As a result, \(\exp ({\widehat{\beta }}_j)\), for \(j=1,\dots ,k\), is the OR associated with a one-unit increase in \(x_{ij}\), holding all other predictors constant. For indicator-coded categorical variables, this OR compares the given category to its reference level, a property often leveraged in epidemiological research4. Although unpenalized LR can be fit by using the maximum likelihood method, one may include a penalty to mitigate overfitting in high-dimensional settings, such as ridge regression (maximizing \(\ell (\varvec{\beta })-\lambda (\Vert \varvec{\beta }\Vert _2)^2/2\)) or LASSO (maximizing \(\ell (\varvec{\beta })-\lambda \Vert \varvec{\beta }\Vert _1\)), typically without penalizing the intercept. In our CV, we compared unpenalized and penalized LRs, detecting minimal differences, which are consistent with other infectious disease studies19.
RF is an ensemble that builds M decision trees, each trained on abootstrap sample of size n (with replacement) of the original data. At each node of the tree, a random subset of predictors of size \(m_{\text {try}}\le k\) is considered for splitting, which decorrelates trees and reduces variance relative to a single tree. For patient i, we compute the class probability as the average across trees, given by
where \(p_i^{(m)} = \Pr (Y_i=1\mid \varvec{x}_i; \,\text {tree }m)\) is the corresponding probability for patient i in tree m. In addition, we classify via \({\widehat{Y}}_{\textrm{RF}}(i)= 1_{\{{\widehat{p}}_i^{\textrm{RF}}\ge \tau \}}\) (with default \(\tau =0.5\), which can alternatively be tuned in the CV), where \(1_{B}\) is the indicator function of the set B. Variable importance was measured by averaging the decrease in node impurity, as quantified by the Gini index across all splits in the forest. However, this measure is known to be potentially biased, as it may overemphasize predictors with higher cardinality or greater variability. Key hyperparameters include the number of trees M, the value of \(m_{\text {try}}\), and node-size/depth constraints. We performed grid search within the training split using internal Monte Carlo resampling to optimize out-of-sample F1-score (with class stratification).
SVM seeks a decision boundary that maximizes the margin between hospitalized (\(Y_i=1\)) and non-hospitalized cases (\(Y_i=0\)), for each patient i, after mapping inputs \(\varvec{x}_i\) to a higher-dimensional feature space via a kernel K. We use the RBF kernel formulated as
with bandwidth parameter \(\gamma\) controlling the smoothness of the decision boundary. In the primal formulation of mathematical programming, SVM solves the problem stated as
where \(\phi\) is the feature map implied by K, w is the weight vector defining the separating hyperplane, b the bias term, \(\varvec{\xi }=(\xi _1,\dots ,\xi _n)\) the slack variables, and \(C > 0\) the regularization parameter controlling the penalty for misclassifications. Note that the SVM is conventionally formulated with labels \(y_i \in \{-1,1\}\), but for consistency with binary outcomes coded as \(\{0,1\}\), one can equivalently re-encode the labels as \(t_i = 2y_i - 1\), which recovers the traditional \(\{-1,1\}\) formulation. Equivalently, the dual problem (solved in practice) depends only on kernel evaluations. Predictors were standardized within each training fold. We performed a grid search over \((C,\gamma )\) on the training split using internal Monte Carlo resampling and selected the pair that maximized the F1-score (primary metric). When calibrated probabilities were required, the SVM decision function was converted to probabilities via Platt scaling, fitted on the training split (with internal resampling) and then applied once to the held-out test split to avoid leakage.
We prioritized LR for interpretability (adjusted ORs), RF for capturing non-linear effects and interactions with mixed-type predictors, and SVM with an RBF kernel for margin-based classification in settings with a moderate number of predictors relative to the sample size. We did not adopt deep neural networks, given the tabular structure and available sample size, as well as the need for transparent, well-calibrated triage decisions; nor did we use Gaussian mixture models55, since class boundaries in this feature space are unlikely to be well approximated by Gaussian components.
Cross-validation, performance metrics, and statistical assessment
All ML classifiers (LR, RF, and SVM) were evaluated using iterated random subsampling validation based on Monte Carlo CV. We conducted 1000 independent iterations, with each iteration randomly split into a training set and a test set, while the test proportion was fixed at 30% of the sample. In each iteration, we employed the following:
-
A training set, utilized to fit the model and to run internal resampling for tuning and calibration; and
-
A held-out test set, used once for independent out-of-sample evaluation.
Note that the training set was used only for fitting, internal resampling-based tuning, calibration, and thresholding, whereas the test set remains fully held out and is evaluated once at the end to provide an independent out-of-sample assessment of generalization. The elements defined in Table 2 were used to measure performance.
Based on Table 2, the performance metrics40,41 computed on the test sets are presented in Table 3.
Sensitivity, specificity, and F1-score respectively capture: (i) the proportion of true hospitalized patients correctly identified by the model; (ii) the proportion of non-hospitalized patients correctly classified; and (iii) a balance between PPV and sensitivity, mitigating both FP and FN. All these metrics simultaneously provide a comprehensive view of the model’s performance, quantifying the trade-off between FN—critical to avoid missed severe cases—and FP—which may strain limited resources.
Predictions from all iterations were pooled to construct confusion matrices, with rows normalized to highlight the balance between FN and FP. Performance metrics were summarized across iterations, and definitive model selection prioritized a balance of high F1-score, sensitivity, and specificity, aligning with clinical priorities in DENV-endemic regions4.
To assess uncertainty, we summarized ML performance across the 1000 randomized train–test splits generated through Monte Carlo CV, reporting medians as well as first, second, and third quartiles (Q1, Q2, and Q3) for each metric. In parallel, time-to-event results were summarized via time-varying HRs with 95% confidence bands (see Sections “Cox regression model with partial-likelihood function” and “Non-proportional hazards and time-varying effects”), while delay-time models were compared using the values of log-likelihood, Akaike (AIC), and Bayesian information criteria (BIC).
In terms of statistical testing, group comparisons of delay distributions employed the Wilcoxon rank-sum test, while within-patient comparisons (such as onset \(\rightarrow\) consultation versus consultation \(\rightarrow\) hospitalization among hospitalized patients) used the Wilcoxon signed-rank test. Proportional-hazards assumptions were evaluated using the Schoenfeld residual tests (global and predictor specific). Reported statistical significance and effect estimates are provided in Section “Results”.
Overall, the combination of LR’s interpretability, RF’s ability to capture complex non-linear interactions, and the SVM’s strong margin-based decision boundaries provides a robust predictive toolkit for identifying DENV patients at risk of hospitalization. This prediction layer complements the survival analysis by identifying who is likely to require hospitalization, naturally aligning with models that estimate when hospitalization is expected to occur; see “Time-to-event analysis: modeling delays” and “Cox regression model with partial-likelihood function”.
Time-to-event analysis: modeling delays
Following the CV framework and statistical assessment outlined above, we turn to time-to-event analysis, to characterize hospitalization delays and their probability distributions.
Our delay-time analysis considered three intervals: (i) onset \(\rightarrow\) first consultation (\(T_{\textrm{oc}}\)); (ii) onset \(\rightarrow\) hospitalization (\(T_{\textrm{oh}}\)); and (iii) consultation \(\rightarrow\) hospitalization (\(T_{\textrm{ch}}\)). For descriptive summaries, we also report the time to first healthcare contact, defined as \(T_{\min }=\min \{T_{\textrm{oc}},T_{\textrm{oh}}\}\).
In relation to the time origin and censoring, follow-up started at symptom onset. The event of interest was the first hospitalization. Patients who were not hospitalized were right-censored at the earlier of 30 June 2019 or day 15 post-onset, consistent with the clinical plausibility window used in data quality control. Records with observed \(T_{\textrm{oh}}>15\) days were excluded from the primary analysis, while sensitivity analysis assessed alternative cutoffs.
Consistent with prior work on modeling healthcare delays18,19,53,54,56, we fitted two continuous probability distributions (lognormal and Weibull) for delays and one discrete distribution (negative binomial—NB—57,58) for integer day counts. These distributions capture the right-skewed nature of the probabilistic model of continuous delays, whereas, in the discrete case, the NB distribution additionally accommodates overdispersion relative to the Poisson model. Because health complications caused by DENV generally arise within 7–10 days4, delays exceeding 15 days were treated as likely data-entry errors and excluded from the primary analysis. The survival analysis adopted the same 15-day window via administrative censoring as described above. Model fit was assessed using visual diagnostics (such as histograms and empirical cumulative distribution functions—CDFs—against fitted curves) and quantitative criteria (based on the values of log-likelihood, AIC, and BIC).
Cox regression model with partial-likelihood function
Building on the delay-time definitions above, we now model hospitalization timing using the Cox proportional hazards model56,59,60. CIs for HRs are obtained under the Cox partial-likelihood function. Recall that \(T_{\textrm{oh}}>0\) denotes the time from symptom onset to hospitalization, and let \(\varvec{x}_i=(x_{i1},\dots ,x_{ik})^\top\) be the predictors for patient i. The function \(\lambda (t\mid \varvec{x}_i)\) represents the instantaneous risk of hospitalization at time t, conditional on not yet being hospitalized and on \(\varvec{x}_i\), and is defined as
Under the proportional hazards formulation, we have that
where \(\lambda _0(t)\) is the unspecified baseline hazard and \(\varsigma _j\) are log–HRs. An estimate \({\widehat{\varsigma }}_j>0\), for \(j=1,\dots ,k\), indicates an increased rate of hospitalization (shorter expected time), whereas \({\widehat{\varsigma }}_j<0\) indicates a reduced hazard (delayed time to hospitalization)60.
The proportional hazards regression coefficients \(\varsigma =(\varsigma _1,\dots ,\varsigma _k)^\top\) are estimated by maximizing the Cox partial-likelihood function given by
where \(R_i\) is the risk set at the time of i-th event59,60. Maximizing \(\ell _p(\varsigma )=\log (L_p(\varsigma ))\) yields \({\widehat{\varsigma }}\), with ties in event times being handled with traditional approximations (such as Breslow or Efron). Because \(\lambda _0(t)\) is left unspecified, the Cox regression model is semi-parametric56.
Non-proportional hazards and time-varying effects
The assumption of proportional hazards stated in (1) requires that the HR, \(\exp (\varsigma _j)\) namely, remains constant over time. In practice, some DENV symptoms—such as abdominal pain—may intensify during later stages of the illness, thereby violating this assumption4. We assessed proportionality using tests based on the Schoenfeld residuals60 and identified time-dependent effects for certain predictors. Accordingly, we extended the Cox model to incorporate time-varying coefficients as
where each \(\varsigma _j(t)\), for \(j=1,\dots ,k\), is modeled as a function of time t. Several representations are available for \(\varsigma _j(t)\), including piecewise-constant functions, spline-based methods, and kernel smoothers60. We report HR(t) with 95% pointwise confidence bands, with estimation details provided in Section “Cox regression model with partial-likelihood function”. Conceptually, \(\varsigma _j(t)\) captures how the effect of a given symptom on the risk of hospitalization evolves over time since symptom onset. Allowing \(\varsigma _j(t)\) to vary over time reflects the natural course of DENV more accurately, as certain warning signs (such as vomiting) may be more predictive of rapid deterioration during days 4–6 than during other periods18,19,54.
By fitting the time-varying model established in (2), we identify the periods during which specific predictors increase the HR, \(\textrm{HR}(t)=\exp (\varsigma _j(t))\) say, thereby guiding clinicians and public health officials in determining critical windows for interventions, such as timely hospitalization or intensified monitoring. Thus, unlike parametric survival models (based on lognormal and Weibull distributions), the semi-parametric Cox model—extended to accommodate non-proportional hazards—allows for a more nuanced analysis of the relationships between symptom onset, disease progression, and the timing of hospitalization. This analysis complements the classification-based methods described in Section “Classification methods: predicting the risk of hospitalization”, which identify who is likely to require inpatient care, by clarifying when such care is most likely to be needed.
Software and implementation
All data preprocessing, statistical modeling, and validation procedures were conducted using the R statistical environment (version 4.2.2)61. The analysis utilized specialized R packages: caret and e1071 for CV, hyperparameter tuning, and SVM modeling; randomForest for training and evaluating RF models42; survival for fitting Cox proportional hazards models (including time-varying coefficients)60; and flexsurv for parametric survival modeling using the lognormal and Weibull distributions56.
Hyperparameters—including C and \(\gamma\) for SVM, the number of trees for RF, and penalization terms for LR—were optimized via grid search within a Monte Carlo CV framework. Model diagnostics included residual analysis (based on Schoenfeld and martingale residuals) to assess the assumption of proportional hazards in Cox models, partial-likelihood deviance for model-fit evaluation, and values of log-likelihood, AIC, and BIC for assessing parametric survival models. This provided a systematic and reproducible framework for evaluating both classification performance and survival-based inference.
As specified above, all classifiers (LR, RF, SVM) were trained and evaluated using Monte Carlo cross-validation with 1000 randomized train–test split iterations. Within each training split, hyperparameter search was conducted using the following commands: caret::train (argument tuneGrid) to define candidate grids and orchestrate internal resampling on the training split; for LR, glm (and glmnet in a penalized-LR sensitivity analysis); for RF, randomForest::randomForest with commands mtry, ntree, and nodesize; and for SVM-RBF, e1071::svm with commands C and \(\gamma\). The selection criterion during internal resampling was the F1-score, and the best-scoring configuration was retained.
When calibrated probabilities were required, SVM decision values were calibrated via Platt scaling fitted on the training split and then applied to the corresponding held-out test split (no leakage). Engines were glm for unpenalized LR (with glmnet being used in a penalized-LR sensitivity analysis), randomForest for RF, and e1071 for SVM. Reproducibility was ensured with fixed seeds and pre-generated random split indices. All preprocessing (encoding, scaling, and imputation in sensitivity analysis) was fit only on training data and applied to the held-out test set to prevent leakage.
Algorithm 1 summarizes the steps of the dual methodology, and Fig. 3 presents a flow diagram of the proposed workflow, enabling a joint analysis of hospitalization risk due to DENV. This methodology facilitates the estimation of both the probability and the timing of hospitalization, offering a comprehensive framework for analyzing hospitalization dynamics.
Flowchart of data preprocessing/quality control, CV without leakage, model families (LR/RF/SVM; time-varying Cox regression), detailed steps (2.1–2.3) per model, and integration; where, in step 2.3.A, “margin” refers to the distance between an observation and the model’s decision boundary in SVM, while SHAP (SHapley Additive exPlanations) is a modern model-agnostic interpretability technique derived from cooperative game theory, used here for SVM to identify which symptoms or socio-environmental factors most increase the predicted risk of hospitalization.
Next, in Section “Results”, we present the empirical results for both models—classification and time-to-event—together with uncertainty summaries and interpretability outputs.
Results
This section presents the results of our analysis of DENV cases in NS, Colombia, from 2015 to 2019, highlighting temporal trends, patient characteristics, and hospitalization risk. We also examine delays in seeking medical care and apply the Cox proportional hazards models to evaluate how socio-environmental and clinical predictors influence the timing of hospitalization.
DENV trends in Norte de Santander, Colombia
Figure 4 shows the time-series of weekly reported DENV cases in NS, revealing two distinct phases: (i) an endemic period from mid-2016 to mid-2018, and (ii) two epidemic outbreaks—one in 2015 and a more severe one in 2019. During these phases, a total of 8,814 confirmed cases were reported, of which 156 were classified as severe DENV infections. The majority of cases (6,358; 72.1%) occurred in the ORI subregion, likely due to its higher population density and elevated average temperatures. The OCC subregion reported the second largest number of cases (1,751; 19.9%).
In contrast, the SOCC subregion recorded only 29 DENV cases, with no cases classified as severe, while the C subregion reported 53 cases, only one of which was severe, as shown in Table 4.
In terms of hospitalization, more than half of the DENV-infected patients (5495 cases; 62.3%) required inpatient care due to disease progression. Among these hospitalized patients, 2.8% were severe DENV cases, indicating that 61.7% of non-severe cases and all severe cases resulted in hospitalization.
Table 4 summarizes the annual number of DENV cases and prevalence (per 100,000 inhabitants) across subregions from 2015 to 2019. All subregions experienced year-to-year variation in prevalence over the study period. Overall, the ORI subregion accumulated the highest case counts, followed by OCC, N, SORI, C, and SOCC, with SOCC reporting the fewest cases. The subregion with the highest prevalence varied across years: OCC recorded the highest prevalence in 2015–2017 (233.5, 329.1, and 63.2 per 100,000, respectively), whereas ORI led in 2018–2019 (308.9 and 99.2 per 100,000).
Socio-environmental characteristics of patients
Table 5 summarizes the socio-environmental characteristics of patients infected with DENV, including severe cases, including sex, age group, settlement type, and occupation. The proportions of male and female patients were similar across non-severe and severe DENV cases. More than half of the patients in both groups were in the early childhood (0–5 years) or childhood (6–11 years) categories. Approximately 90% of cases originated from municipal seats, and around 91.5% of DENV cases (92.9% among severe cases) involved patients in elementary occupations.
Hospitalization of patients based on symptomatology
Tables 6 and 7 present the symptom profiles and key clinical findings for patients diagnosed with DENV, including those with severe manifestations. Fever was reported in all patients, followed by common symptoms such as myalgia, arthralgia, and headache. Notably, abdominal pain and vomiting were more prevalent among patients with severe DENV, suggesting that these symptoms may serve as potential indicators of disease severity.
Table 8 reports performance metrics for the LR, RF, and SVM models in predicting hospitalization based on symptom and clinical factors. Metrics include accuracy, F1-score, NPV, PPV, sensitivity, and specificity.
To compare the performance of the three classification models in greater detail, Fig. 5 presents box plots for the primary metrics used in their evaluation. These plots are based on 1000 CV iterations per model, providing a robust assessment of consistency across accuracy, F1-score, sensitivity, specificity, NPV, and PPV.
For completeness, Table 9 shows confusion matrices normalized by row (rows sum to 1) for the LR, RF, and SVM models. Each matrix is reconstructed from the median sensitivity and specificity reported in Table 8 (diagonal cells are the median TP/TN rates and off-diagonals are the complements). This display makes the FN/FP balance explicit without assuming class prevalence.
In clinical terms, the median FN rate (missed hospitalized cases) is lowest for RF (0.1281) compared with LR (0.1916) and SVM (0.1904), highlighting RF’s sensitivity advantage under the reported CV medians.
While all models showed similar specificity, they diverged in sensitivity and NPV, indicating potential to better distinguish patients who require hospitalization. Among the three models, RF consistently achieved the best overall performance—except for PPV—suggesting that it may offer the most effective predictions when prioritizing accuracy over interpretability. Nevertheless, LR retains an important advantage in interpreting the direct influence of each predictor on hospitalization risk, which can be crucial where the rationale behind a model decision is as important as the decision itself.
Figure 6 summarizes feature importance from the RF classifier, quantified by the mean decrease in the Gini index across the forest (larger values indicate greater discriminative power). The top-ranked predictors were low platelet count (thrombocytopenia), abdominal pain, and vomiting (637.25, 611.25, and 291.64, respectively), consistent with clinical warning signs for severe DENV. Additional contributors included diarrhea (106.01), retro-orbital pain (71.15), drowsiness (40.38), and rash (38.06), albeit with lower importance scores.
Regarding model explainability, we report adjusted ORs with 95% CIs for LR (Table 10), variable importance for RF (Fig. 6), and time-varying HR(t) with confidence bands for the Cox model (Fig. 10). All these reports simultaneously link who is at higher risk (via ORs from LR and variable importance from RF) with when hospitalization becomes more likely (via HR(t) from Cox regression), supporting threshold selection for triage and resource allocation.
The RF model’s ability to prioritize features by predictive contribution makes it a valuable asset for clinical decision support. By uncovering complex interactions among symptoms, it facilitates targeted interventions for patients at higher risk of hospitalization—particularly important where resources are constrained.
For a more granular interpretation of how patient symptoms and socio-environmental predictors influence the risk of hospitalization, we fitted an LR model and reported adjusted ORs. To screen for redundancy and potential multicollinearity prior to LR estimation, we computed the Pearson correlation for binary–binary associations and the Cramér coefficient for nominal–nominal and nominal–binary relationships. The binary–binary correlations ranged from \(-0.15\) to 0.39 (Q1 \(=-0.01\), Q2 \(=0.06\), Q3 \(=0.09\)), with the strongest pairs being abdominal pain with vomiting (0.39), abdominal pain with low platelet count (0.31), and diarrhea with vomiting (0.26). The Cramér coefficient (V) ranged from 0.01 to 0.22 (Q1 \(=0.03\), Q2 \(=0.05\), Q3 \(=0.08\)), with the largest associations being \(V=0.22\) (age group versus headache), \(V=0.13\) (age group versus rash), and \(V=0.11\) (age group versus retro-orbital pain). These low-to-moderate associations suggest weak dependency among predictors, reducing concern about severe multicollinearity.
Table 10 reports adjusted ORs with 95% CIs from the definitive LR model, quantifying how each predictor is associated with the odds of hospitalization while holding other predictors constant (relative to the stated reference categories). ORs greater than 1.0 indicate higher odds of hospitalization, whereas values below 1.0 indicate lower odds. For example, abdominal pain (OR \(=8.4232\), 95% CI: 7.1791–9.9096) and low platelet count (OR \(=9.3092\), 95% CI: 7.9653–10.9110) are strongly associated with increased odds of hospitalization, marking them as key warning signs. Other symptoms, such as vomiting (OR \(=2.9380\), 95% CI: 2.3819–3.3879), mucosal bleeding (OR \(=2.4036\), 95% CI: 1.5660–3.7765), diarrhea (OR \(=1.7371\), 95% CI: 1.3865–2.1856), and high hematocrit (OR \(=1.6429\), 95% CI: 1.1025–2.5125), also increase the risk of hospitalization. However, their effect sizes are smaller compared with abdominal pain or low platelet count. For hypotension (OR \(=2.1205\), 95% CI: 0.9654–5.1912), the CI includes 1.0, indicating uncertainty in direction or magnitude. Conversely, rash (OR \(=0.6710\), 95% CI: 0.5834–0.7716), headache (OR \(=0.5985\), 95% CI: 0.5104–0.7012), and retro-orbital pain (OR \(=0.5702\), 95% CI: 0.4926–0.6598) show adjusted ORs below 1.0. Drowsiness also shows an adjusted OR less than 1.0 (OR \(=0.2312\), 95% CI: 0.1611–0.3328) once correlated warning signs (such as abdominal pain or low platelet count) are accounted for. This adjusted association does not contradict the higher crude frequency of drowsiness among severe cases reported in Table 6. The outcome modeled here is hospitalization (not clinical severity), and most hospitalized patients are non-severe, while drowsiness often co-occurs with other markers that more directly drive hospitalization.
Consistently, the time-varying Cox regression model indicates that the hazard associated with drowsiness is less than 1.0 for much of the early course (Q1 HR \(=0.65\), median HR \(=0.93\)) and only exceeds 1.0 later (Q3 HR \(=1.31\); see Table 12 and Fig. 10). Therefore, we do not interpret drowsiness as “protective” but rather as a symptom whose independent contribution to hospitalization risk is attenuated once correlated warning signs are considered.
An interpretation note is as follows. Adjusted ORs reflect conditional associations given all predictors in the model and may therefore diverge from crude symptom frequencies; see Table 6. In particular, “drowsiness” frequently co-occurs with stronger warning signs, and thus its independent adjusted association with hospitalization can be less than one even when its crude frequency is higher among severe cases.
Analysis of delays in seeking medical care
Table 11 provides summary statistics for delays between symptom onset and healthcare contact for patients with DENV and severe DENV. This summary includes the following intervals: (i) time from symptom onset to the initial medical consultation; (ii) time from symptom onset to hospitalization; and (iii) time from consultation to hospitalization. Each interval is reported with the mean, standard deviation (SD), median, interquartile range (IQR), and range, offering a view of central tendency and variability in both non-severe and severe DENV cases. Figure 7 shows histograms of consultation and hospitalization delays, revealing a skewed distribution of these delays, in which most patients seek care or deteriorate shortly after symptom onset.
Group comparisons of delay distributions used the Wilcoxon rank–sum test (two-sided, \(\alpha =0.05\)), whereas within-patient comparisons used the Wilcoxon signed–rank test. At \(\alpha =0.05\), no statistically significant differences were found in the time to initial consultation or in the time to hospitalization between non-severe and severe DENV cases, suggesting that disease severity did not substantially affect the timing of first medical contact. By contrast, among patients who were hospitalized, onset-to-consultation times were significantly longer than consultation-to-hospitalization times (\(W=15{,}536{,}992\), p-value = 0.0003; signed–rank test). Furthermore, hospitalized patients exhibited shorter intervals from symptom onset to first consultation than non-hospitalized patients (\(W=7{,}325{,}610\), p-value \(<0.0001\); rank–sum test), indicating a more urgent care-seeking trajectory.
Figure 8 shows empirical CDFs fitted with three models—lognormal and Weibull (continuous), and NB (discrete)—for three key intervals: (i) onset \(\rightarrow\) consultation, (ii) onset \(\rightarrow\) hospitalization, and (iii) consultation \(\rightarrow\) hospitalization. Across intervals, when delays were treated as integer day counts, the NB model achieved the lowest AIC and BIC, while among continuous specifications, the Weibull distribution outperformed the lognormal distribution. Model comparisons based on AIC and BIC were conducted between discrete and continuous distributions, with cross-model assessment guided by visual diagnostics. For a broader view of delay patterns, Fig. 9 reports pooled estimates from representative models (Weibull for the continuous case and NB for discrete case).
Plots of empirical CDFs for each delay-time with fitted lognormal, NB, and Weibull models, for intervals: (a) onset \(\rightarrow\) consultation; (b) onset \(\rightarrow\) consultation among hospitalized cases; (c) onset \(\rightarrow\) hospitalization; and (d) consultation \(\rightarrow\) hospitalization, with panels (a–c) excluding outpatient cases, based on Colombian DENV data.
Plots of the frequency distributions of delays for: (a) the continuous Weibull model and (b) the discrete NB model based on Colombian DENV data. The delays shown included: consultation time (stratified by hospitalization status) and, for hospitalized cases only, the hospitalization and consultation-to-hospitalization delays.
Influence of socio-environmental and symptom predictors on delay to hospitalization
To assess the effects of socio-environmental and clinical predictors on the time from symptom onset to hospitalization, we fitted a Cox regression model with non-proportional hazards. Schoenfeld residuals revealed violations of the assumption of proportional hazards, prompting the incorporation of time-dependent HRs to more accurately capture the dynamic effects of each predictor. By allowing coefficients to vary over time, the model capture potential shifts in risk across different stages of the disease. Figure 10 presents the time-varying coefficients and CIs for each predictor included in the definitive model.
The results highlight pronounced time-dependent effects for several predictors, confirming that a traditional Cox regression (with fixed coefficients) would be unsuitable. It is worth noting that CIs widen at later time periods for some predictors, because of fewer observations in the distribution’s tail, thereby reducing estimate reliability in those periods. Table 12 summarizes the quartiles and extremes of coefficients and HRs for each predictor. HRs above 1.0 indicate an elevated risk (shorter time to hospitalization), while HRs below 1.0 imply a protective or delaying effect. For time-varying coefficients, summaries (minimum, Q1, median, Q3, maximum) are computed across the time grid (days since onset). For multi-level predictor (for example, subregion and age group), effects are reported relative to their reference category, with category-specific HR(t) curves being shown in Fig. 10. This figure displays wider confidence bands after Q3 for most predictors, reflecting data sparsity at later times, and thus HR estimates in these tails should be interpreted with caution.
In particular, for mucosal bleeding, we observed an extremely high HR value (approaching 3,892) in the last days of the study period (around day 15). This elevated HR reflects the small number of cases with that symptom at later time points, which leads to numerical instability and wider CIs. Thus, while the model highlights the importance of mucosal bleeding as a severe symptom, this extreme HR value should be interpreted with caution, as it may partly reflect data sparcity rather than a literal increase of such magnitude.
For instance, symptoms such as vomiting (HR range: approximately 1.26–1.36), low platelet count (1.33–2.02), and abdominal pain (1.27–1.77) shows an increasing rate of hospitalization over time. Practically speaking, patients presenting with abdominal pain have at least a 27% higher instantaneous hazard of earlier hospitalization, with the risk rising to over 77% when the symptom appears in the later stages of illness. Similarly, a low platelet count during the first week corresponds to roughly a 44% higher risk of prompt hospitalization, and this risk can more than double after day 10.
In contrast, other symptoms appear to delay or reduce the urgency of hospitalization. For example, rash (HR range: 0.64 to 0.99) is linked to a 1–36% decrease in the probability of immediate hospitalization, suggesting that although rash can be a marker of disease severity, it does not necessarily prompt an urgent inpatient response. Likewise, hepatomegaly (minimum HR = 0.16; maximum HR ≈ 1.46), retro-orbital pain (0.75 through 1.13), and drowsiness (0.49 through 1.31) show intervals below 1.0 for at least 50% of their time-varying distributions, implying reduced risk of early hospitalization during portions of the disease course. It is important to emphasize that HRs below 1.0 do not imply clinical irrelevance. Instead, they suggest that the symptom is associated with delayed or less urgent progression, particularly in the short term.
In addition, certain predictors—such as diarrhea (Q1 = 0.96, Q2 = 1.03, Q3 = 1.22), subregion (Q1 = 0.89, Q2 = 1.01, Q3 = 1.11), and age group (Q1 = 0.97, Q2 = 1.05, Q3 = 1.11)—have HR values close to 1.0, indicating balanced effects that neither strongly accelerate nor delay hospitalization. These indications simultaneously highlight the stronger influence of specific clinical symptoms, relative to socio-environmental predictors, in driving the urgency of hospitalization.
Consistent with other studies12, retro-orbital pain and vomiting frequently signal severe DENV progression. Furthermore, symptoms such as rash, while still clinically relevant, may correlate with a slower path to hospitalization. By synthesizing socio-environmental and clinical predictors, the Cox non-proportional hazards model provides valuable insights for triaging cases and optimizing limited medical resources. Patients presenting with high-hazard symptoms can be prioritized for early intervention, potentially improving clinical outcomes and resource allocation in regions with a high DENV burden.
Figure 11 offers an overview of the key results, including temporal trends, patient demographics, clinical presentations, and the main predictors driving hospitalization risk and timing.
Discussion
This section places our findings in a broader epidemiological and socio-environmental context, emphasizing on how factors such as climate, settlement patterns, and migratory flows influence DENV transmission in NS. We also discuss how these findings, together with the predictive modeling results, can guide more effective and timely public health interventions.
Main results
DENV transmission in NS during 2015–2019 exhibited two epidemic peaks: one between 2015-2016 and another starting in 2019, with most cases concentrated in the ORI subregion and over half (62.3%) requiring hospitalization. Modeling identified abdominal pain, vomiting, and low platelet count (thrombocytopenia) as key risk factors for hospitalization, with the RF model achieving the highest accuracy. Delay distributions were right-skewed, with the NB (discrete, integer day counts) and Weibull (continuous) models providing the best fit, while the time-varying Cox analysis revealed increasing risk of hospitalization over time for abdominal pain and thrombocytopenia.
Epidemiological context and case distribution
DENV has been endemic across much of the Americas for several decades, whereas newer arboviruses such as chikungunya and Zika emerged in the region around 201514. In Colombia, Aedes aegypti mosquitoes serve as the primary vectors for these viruses and are present in approximately 80% of the national territory. Environmental conditions such as elevated temperatures, rainfall variability, and household water storage practices contribute to the proliferation of Aedes aegypti mosquitoes62,63,64.
Our investigation in NS identified two epidemic peaks: one between 2015 and 2016, and another starting in 2019, with the latter being more pronounced. These peaks are consistent with other documented DENV outbreaks during the same period3,65. In 2019, NS was listed among the Colombian departments with the highest DENV incidence66, a pattern likely driven by climatic conditions (notably elevated temperatures) and increased water storage practices.
Such patterns, especially during El Niño–Southern Oscillation events, can intensify mosquito breeding and trigger DENV outbreaks65, as observed in the Department of Antioquia3. According to Colombian surveillance data, the epidemic curve for 2019 exceeded the historical baseline (2011–2018), prompting a state of alert in regions like the ORI subregion, which contains the capital city of Cúcuta65. As NS borders Venezuela, migration across this frontier adds further challenges to DENV surveillance and control. Underreporting of DENV is likely more common among transient populations who may not seek formal diagnostic testing65, complicating accurate surveillance.
The Ocaña municipality, located in the OCC subregion (commonly referred to as the western area of NS), has been labeled a moderate-risk zone. It serves as an important corridor between Cúcuta and Bucaramanga in Colombia and has been proposed as a possible entry point for arboviruses from Venezuela into the broader Santander region67. Across NS, the ORI subregion concentrated the largest share of reported DENV cases, with the OCC region being second. Both subregions also showed elevated prevalence in several years; see Fig. 1 and Table 4.
Settlement type was likewise relevant: urban municipal seats accounted for a higher absolute number of reported cases—consistent with patterns reported for Antioquia and other Colombian regions—reflecting the concentration of cases in urban areas. In addition, intermittent piped water and frequent household water storage in urban neighborhoods create peri-domestic breeding sites that favor the persistence of Aedes aegypti mosquitoes.
These patterns show how population density and urban infrastructure—along with domestic water-storage practices—can amplify DENV transmission in specific local contexts.
Comparison with the Department of Antioquia, Colombia
To contextualize the findings from NS, we compared them with those from a similar study conducted in the Department of Antioquia3, which examined socio-environmental and epidemiological predictors associated with DENV. Both regions experience endemic DENV and show comparable outbreak patterns at the national level. Nevertheless, distinct hospitalization rates, symptom profiles, and delays in seeking care suggest that regional differences can strongly influence disease dynamics and clinical outcomes.
According to3, certain subregions in Antioquia—such as Bajo Cauca and Magdalena Medio—reported lower hospitalization rates for non-severe DENV (48.4% and 39.9%, respectively) than those observed in NS for non-severe cases (61.7%; 62.3% overall when severe and non-severe are combined). These reports may reflect differences in healthcare infrastructure, accessibility, and public health awareness. Likewise, the shorter median delays before medical evaluation and hospitalization in Antioquia point to more streamlined health services or effective community outreach programs, relative to NS.
Further insights from3 identify vomiting and abdominal pain as markers of severe DENV in Antioquia, mirroring our results in NS. Relative frequencies for other symptoms (for example, retro-orbital pain, drowsiness) differed across regions, which may reflect variation in clinical reporting practices, demographic composition, or underlying health conditions (comorbidity profiles). Although quantitative data on hospital bed availability, migratory flows, and climate variables are not available for a thorough comparison, contextual factors appear to play a substantial role. Antioquia, being more economically developed, generally benefits from better-resourced healthcare facilities and broader awareness campaigns.
In contrast, NS faces socio-economic hurdles, including high rural-urban disparities and high migratory inflows from Venezuela, which can delay patient care-seeking and exert pressure on healthcare resources. Cultural attitudes toward medical care also appear to play a role. In Antioquia, greater health literacy or stronger trust in formal healthcare could encourage earlier medical consultation. In contrast, residents of NS may rely more heavily on traditional remedies or exhibit lower trust in institutional healthcare, potentially contributing to higher hospitalization rates when disease severity increases.
Furthermore, differing intensities of local public health and vector control programs can shape transmission and clinical outcomes. Overall, this broad comparison underscores that public health measures need to be adapted to regional realities. Despite similarities in climate and vector ecology, regional differences in healthcare capacity, socio-economic conditions, and cultural norms highlight the need for tailored strategies to reduce delays in consultation and improve DENV management. Future work incorporating quantitative data on hospital resources, population mobility, and environmental parameters would provide a more definitive foundation for cross-regional comparisons.
Predictive modeling and symptom analysis
We first characterized delays. Across intervals, when delays were treated as integer day counts, the NB model achieved the lowest AIC and BIC. Among continuous specifications, the Weibull model outperformed the lognormal model. AIC and BIC were compared and cross-model assessment was guided by visual diagnostics; see Figs. 8 and 9. These assessments are consistent with overdispersed right-skewed delay distributions. The pronounced right tail implies that summary measures beyond the mean (such as medians and upper quantiles69) are informative for operational planning. Socio-environmental patterns (see Table 5) showed broadly similar sex distributions—with a moderate female predominance among severe cases—while urban municipal seats accounted for the largest absolute number of reported cases, consistent with other Colombian settings. Younger patients—particularly children (0–11 years)—were more frequently affected by both non-severe and severe DENV.
Symptom profiles indicated high prevalence of fever, myalgia, and headache, whereas severe cases more commonly presented with abdominal pain, vomiting, and thrombocytopenia. In prediction models, these symptoms were consistently influential: LR yielded elevated ORs for hospitalization associated with abdominal pain (OR \(\approx 8.42\), 95% CI: 7.18–9.91) and vomiting (OR \(\approx 2.94\), 95% CI: 2.38–3.39); see Table 10. RF variable-importance analysis likewise prioritized abdominal pain and low platelet count, while the Cox regression linked selected symptoms to differences in time-to-hospitalization; see Table 12. We observed time-varying effects: abdominal pain was associated with an increased risk of hospitalization over time (HR\((t)\approx 1.27\)–3.11 across time windows), whereas rash (HR\((t) \approx 0.64\)–0.99) was associated with a lower instantaneous hazard. Thus, the LR and Cox regression models address complementary questions—who is hospitalized (odds) and when hospitalization occurs (hazard)—whereas, for key symptoms, both indicate elevated risk with time-modulated magnitude.
Integrating the models of LR, RF, and Cox regression provides complementary perspectives. The RF model captures non-linearities and high-order interactions among symptoms; the LR model yields adjusted ORs that clarify each predictor’s contribution; and the Cox regression models time-to-hospitalization via HRs, enabling assessment of temporal patterns and checks for non-proportional hazards. All these models simultaneously support data-informed triage and resource allocation by identifying higher-risk patients and characterizing when hospitalization is most likely.
Model behavior can vary across settings because care-seeking patterns, symptom prevalence, surveillance completeness, and health-system capacity differ. Decision thresholds are typically aligned with local resource constraints, and decision-analytic quantities such as net benefit can be used to examine the trade-off between FN and FP. When variable codings or clinical workflows differ, local re-estimation of model parameters may be needed; for the Cox component, this includes re-estimating the baseline hazard and reassessing proportional-hazards.
Performance is commonly summarized along two dimensions. First, discrimination reflects how well the model separates outcomes and is often reported using the area under the receiver operating characteristic curve and the area under the precision–recall curve. Second, calibration reflects the agreement between predicted and observed risks and is often described by the calibration intercept (ideal \(\approx 0\)) and calibration slope (ideal \(\approx 1\)), together with the Brier score (mean squared error of predicted probabilities, where lower values indicate better overall accuracy). Because surveillance landscapes can evolve, prospective monitoring may help to identify dataset shift over time.
Our symptom ranking and time-to-hospitalization patterns are consistent with reports from international cohorts using ML and survival models—where thrombocytopenia, abdominal pain, and vomiting emerge as warning signs and delays are right-skew distributed14,15,16,17,18,19. Although direct benchmarking of performance metrics is limited by heterogeneity in features and definitions, the qualitative concordance of symptomatology and delay structure supports the external plausibility of our findings.
Limitations of the study
Despite the insights gained from our analysis, some limitations should be acknowledged. Potential underreporting is a concern, particularly in border areas with mobile or migrant populations. Recording delays and recall errors may affect symptom-onset timing and other clinical fields, introducing uncertainty in progression or hospitalization times.
Missing data persist despite traditional data-cleaning procedure sand could bias estimates if missingness is not at random. Selection bias is possible because analyses focus on confirmed cases within the surveillance system, potentially excluding patients who did not seek care or were misdiagnosed. The study window (January 2015–June 2019) bounds generalizability to that period. In addition, the exclusion of records with delays greater than 15 days—applied as a data-quality safeguard—may skew delay distributions if a non-negligible fraction of true late presenters exists. Methodologically, results may be sensitive to non-proportional hazards, variable-importance biases in the RF model, and the need for probability calibration in classifiers. This was evaluated and should be re-checked in external deployments.
Conclusions
This study presented an in-depth analysis of DENV dynamics in NS, Colombia, examining how socio-environmental conditions, symptomatology, and delays contribute to variations in hospitalization. The analysis highlighted the relevance of regional factors—such as urbanization and settlement type—in shaping the distribution of cases and the probability and timing of hospitalization. Over the study period, the ORI subregion accumulated the largest number of cases, while prevalence leaders varied by year—the OCC subregion in 2015–2017 and the ORI subregion in 2018–2019—consistent with climatic and demographic conditions favorable to Aedes aegypti proliferation.
Across delay intervals, the NB model provided the lowest AIC and BIC (within discrete distributions of integer day counts), while the Weibull model outperformed the lognormal model (within continuous distributions). Visual diagnostics supported these results, consistent with overdispersed right-skewed delay distributions. On average, patients sought medical attention about four days after symptom onset, and severe DENV cases were often hospitalized shortly after consultation—frequently within a single day.
By integrating the LR, RF, SVM, and Cox regression models, we developed a framework that captures both the probability and the timing of hospitalization. The LR and RF models identified symptoms such as abdominal pain and low platelet count as strong predictors of hospitalization risk, while the SVM model achieved competitive predictive performance (with lower interpretability). The time-varying Cox regression model also showed that these symptoms are associated with a higher instantaneous hazard (earlier hospitalization), with the magnitude modulated over time. In contrast, symptoms such as rash and (for substantial portions of the time course) retro-orbital pain and drowsiness were related to a lower short-term hazard, indicating less urgent progression. Despite offering valuable insights, our modeling has limitations. Reliance on reported cases may underestimate the true burden, particularly among migrant or underserved populations. The absence of finer-grained environmental predictors (for example, microclimate and household water storage) may limit predictive precision. Excluding records with very long delays (greater than \(15\) days) may also omit true late presenters, potentially skewing the tail behavior of delay distributions. Addressing these limitations in future work could improve accuracy and generalizability.
In addition, regional differences—such as those observed between NS and Antioquia, or the influence of migratory inflows across the Venezuelan border—should be explicitly considered in future validation works, since they can shape both care-seeking behavior and hospitalization patterns.
Future research should examine co-circulation with other arboviruses, including chikungunya and Zika, given clinical overlap and planning implications. The delays were fitted using the lognormal and Weibull distributions68. However, several studies indicate that a close competitor to these two distributions, named the Birnbaum-Saunders model, has often provided better fits to lifetime data69,70,71. Therefore, the applicability of this distribution, as well as general multivariate regression frameworks, should be further examined72,73, especially in projection pursuit-based procedures for both dimension reduction and anomaly detection in robust analytical survival models74,75. Moreover, a bibliometric update of the literature is currently lacking, and its publication could improve the state of the art and inform additional lines of research76. Incorporating real-time environmental data (for example, rainfall, temperature, vector indices) and conducting external validation with local recalibration can enhance outbreak forecasting and clinical triage tools.
In conclusion, our findings support public health authorities by showing how data-driven strategies can optimize resource allocation and improve clinical outcomes in DENV-endemic regions. Refining and expanding the predictive models presented in this study—while validating them externally—can strengthen health-system preparedness for evolving arboviral risks.
References
Dhar-Chowdhury, P. et al. Dengue seroprevalence, seroconversion and risk factors in Dhaka, Bangladesh. PLoS Neglect. Trop. Dis. 11, e0005475 (2017).
Mitra, A. K. & Mawson, A. R. Neglected tropical diseases: epidemiology and global burden. Trop. Med. Infect. Dis. 2, 36 (2017).
Ortiz, S. et al. Identification of hazard and socio-demographic patterns of dengue infections in a Colombian subtropical region from 2015 to 2020: Cox regression models and statistical analysis. Trop. Med. Infect. Dis. 8, 30 (2023).
Wiwanitkit, V. Dengue fever: diagnosis and treatment. Expert Rev. Anti Infect. Ther. 8, 841–845 (2010).
Wang, S. F. et al. Severe dengue fever outbreak in Taiwan. Am. J. Trop. Med. Hyg. 94, 193–197 (2016).
Priye, A. et al. A smartphone-based diagnostic platform for rapid detection of Zika, chikungunya, and dengue viruses. Sci. Rep. 7, 44778 (2017).
Simo, F. B. N. et al. Dengue virus infection in people residing in Africa: a systematic review and meta-analysis of prevalence studies. Sci. Rep. 9, 13626 (2019).
Selvarajoo, S. et al. Knowledge, attitude and practice on dengue prevention and dengue seroprevalence in a dengue hotspot in Malaysia: A cross-sectional study. Sci. Rep. 10, 9534 (2020).
Moallemi, S., Lloyd, A. R. & Rodrigo, C. Early biomarkers for prediction of severe manifestations of dengue fever: a systematic review and a meta-analysis. Sci. Rep. 13, 17485 (2023).
Barcellos, C., Matos, V., Lana, R. M. & Lowe, R. Climate change, thermal anomalies, and the recent progression of dengue in Brazil. Sci. Rep. 14, 5948 (2024).
Villar, L. A., Rojas, D. P., Besada-Lombana, S. & Sarti, E. Epidemiological trends of dengue disease in Colombia (2000–2011): A systematic review. PLoS Negl. Trop. Dis. 9, e0003499 (2015).
Stanaway, J. D. et al. The global burden of dengue: an analysis from the Global Burden of Disease Study 2013. Lancet. Infect. Dis 16, 712–723 (2016).
Nelson, W. B. Appl. Life Data Anal. (Wiley, Hoboken, NJ, US, 2005).
Desjardins, M. R. et al. Knowledge, attitudes, and practices regarding dengue, chikungunya, and Zika in Cali. Colombia. Health Place 63, 102339 (2020).
Lima, E. C. B., Montarroyos, U. R., Magalhães, J. J. F., Dimech, G. S. & Lacerda, H. R. Survival analysis in non-congenital neurological disorders related to dengue, chikungunya and Zika virus infections in Northeast Brazil. Revista do Instituto de Medicina Tropical de São Paulo62 (2020).
Qureshi, H. et al. Prevalence of dengue virus in Haripur district, Khyber Pakhtunkhwa, Pakistan. J. Infect. Public Health 16, 1131–1136 (2023).
Maneerattanasak, S. et al. Prevalence of dengue, Zika, and chikungunya virus infections among mosquitoes in Asia: A systematic review and meta-analysis. Int. J. Infect. Dis. 148, 107226 (2024).
Rodionov, I. V. On estimation of Weibull-tail and log-Weibull-tail distributions for modeling end-to-end delay. In Distributed Computer and Communication Networks 302–314 (Springer, New York, NY, US, 2019).
Velasco, H. et al. Modeling the risk of infectious diseases transmitted by Aedes aegypti using survival and aging statistical analysis with a case study in Colombia. Mathematics 9, 1488 (2021).
Martin-Barreiro, C. et al. A new algorithm for computing disjoint orthogonal components in the parallel factor analysis model with simulations and applications to real-world data. Mathematics 9, 2058 (2021).
Martin-Barreiro, C. et al. A new algorithm for computing disjoint orthogonal components in the three-way Tucker model. Mathematics 9, 203 (2021).
Daumerie, D., Peters, P. & Savioli, L. Working to overcome the global impact of neglected tropical diseases: first WHO report on neglected tropical diseases. World Health Organization Volume 1, (2010).
Engels, D. & Zhou, X. N. Neglected tropical diseases: an effective global response to local poverty-related disease priorities. Infect. Dis. Poverty 9, 17 (2020).
Rodriguez, R. C. et al. The burden of dengue and the financial cost to Colombia, 2010–2012. Am. J. Trop. Med. Hyg. 94, 1065–1072 (2016).
Castrillón, J. C., Castaño, J. C. & Urcuqui, S. Dengue in Colombia: Ten years of evolution (in Spanish). Rev. Chilena Infectol. 32, 142–149 (2015).
Lee, J. S. et al. A multi-country study of the economic burden of dengue fever: Vietnam, Thailand, and Colombia. PLoS Negl. Trop. Dis. 11, e0006037 (2017).
Lee, J. S. et al. A multi-country study of the economic burden of dengue fever based on patient-specific field surveys in Burkina Faso, Kenya, and Cambodia. PLoS Negl. Trop. Dis. 13, e0007164 (2019).
Turner, H. C. et al. An economic evaluation of Wolbachia deployments for dengue control in Vietnam. PLoS Negl. Trop. Dis. 17, e0011356 (2023).
Zimmermann, I. R., Fernandes, R. R. A., Costa, M. G. S., Pinto, M. & Peixoto, H. M. Simulation-based economic evaluation of the Wolbachia method in Brazil: A cost-effective strategy for dengue control. Lancet Region. Health Am. 35, (2024).
Marchant, C., Leiva, V., Cavieres, M. F. & Sanhueza, A. Air contaminant statistical distributions with application to PM10 in Santiago, Chile. Rev. Environ. Contam. Toxicol. 223, 1–31 (2013).
Mayer, S. V., Tesh, R. B. & Vasilakis, N. The emergence of arthropod-borne viral diseases: A global perspective on dengue, chikungunya and Zika fevers. Acta Trop. 166, 155–163 (2017).
Cardona-Ospina, J. A. et al. Dengue and COVID-19, overlapping epidemics? An analysis from Colombia. J. Med. Virol. 93, 522–527 (2020).
Oviedo-Pastrana, M., Méndez, N., Mattar, S., Arrieta, G. & Gomezcaceres, L. Epidemic outbreak of Chikungunya in two neighboring towns in the Colombian Caribbean: a survival analysis. Arch. Public Health 75, (2017).
Sasmono, R. T. & Santoso, M. S. Movement dynamics: reduced dengue cases during the COVID-19 pandemic. Lancet. Infect. Dis. 22, 570–571 (2022).
Gheibi, Z., Boroomand, M. & Soltani, A. Comparing the trends of vector?borne diseases (VBDS) before and after the COVID?19 pandemic and their spatial distribution in Southern Iran. J. Trop. Med. 2023, 7697421 (2023).
Sardar, I., Akbar, M. A., Leiva, V., Alsanad, A. & Mishra, P. Machine learning and automatic ARIMA/Prophet models-based forecasting of COVID-19: Methodology, evaluation, and case study in SAARC countries. Stoch. Env. Res. Risk Assess. 37, 345–359 (2023).
Tejo, A. M., Hamasaki, D. T., Menezes, L. M. & Ho, Y. L. Severe dengue in the intensive care unit. J. Intensive Med. 4, 16–33 (2024).
Carras, M. et al. Associated risk factors of severe dengue in Reunion Island: A prospective cohort study. PLoS Negl. Trop. Dis. 17, e0011260 (2023).
GBD 2015 Mortality and Causes of Death Collaborators. Global, regional, and national life expectancy, all-cause mortality, and cause-specific mortality for 249 causes of death, 1980-2015: a systematic analysis for the Global Burden of Disease Study 2015. The Lancet 388, 1459–1544 (2016).
Hosmer, D. W. & Lemeshow, S. Applied Logistic Regression (Wiley, New York, NY, US, 2000).
Agresti, A. An Introduction to Categorical Data Analysis (Wiley, Hoboken, NJ, US, 2007).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Kocev, D., Vens, C., Struyf, J. & Dzeroski, S. Tree ensembles for predicting structured outputs. Pattern Recogn. 46, 817–833 (2013).
Nick, T. G. & Campbell, K. M. Logistic regression. In Methods in Molecular Biology 273–301 (Humana Press, Totowa, NJ, US, 2007).
Awad, M. & Khanna, R. Support vector machines for classification. In Efficient learning machines: Theories 39–66 (Apress, Berkeley, US, 2015).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Departamento Administrativo Nacional de Estadística (DANE). Censo Nacional de Población y Vivienda 2018. Departamento Administrativo Nacional de Estadística, Bogotá, Colombia, 2019. Available at: https://www.dane.gov.co/index.php/estadisticas-por-tema/demografia-y-poblacion/censo-nacional-de-poblacion-y-vivenda-2018. Accessed on 18 January 2025.
United Nations Development Programme. Norte de Santander: Challenges for sustainable development (in Spanish). United Nations Development Programme, 2019. Available at: https://www.undp.org/es/colombia/publications/norte-de-santander-retos-y-desafios-para-el-desarrollo-sostenible. Accessed on 18 January 2025.
Sectional Health Secretariat of Norte de Santander. Health situation analysis (in Spanish). Update 2021. Available online: https://ids.gov.co/2021/DIMENSIONES_SP/ASIS_NORTE_DE_SANTANDER_2021_MIN.pdf, 2021. Accessed on 18 January 2025.
National Institute of Health. Dengue Event Report (in Spanish), Colombia, 2018. Version 4, 2019. Available at: https://www.ins.gov.co/buscador-eventos/Informesdeevento/Dengue_2019.pdf. Accessed on 18 January 2025.
Government of the Norte de Santander. Development Plan for Norte de Santander 2020-2023: “Más oportunidades para todos” (in Spanish). Government of the Norte de Santander, 2020. Available online: https://ids.gov.co/2020/PLANES/PDD/PDD_NdS_2020-2023.pdf. Accessed on 18 January 2025
Barros, M., Galea, M., Gonzalez, M. & Leiva, V. Influence diagnostics in the tobit censored response model. Stat. Methods Appl. 19, 379–397 (2010).
Aragao, G. M. F., Corradini, M. G., Normand, M. D. & Peleg, M. Evaluation of the Weibull and log normal distribution functions as survival models of Escherichia coli under isothermal and non-isothermal conditions. Int. J. Food Microbiol. 119, 243–257 (2007).
Hung, H. N., Lin, Y. B., Lu, M. K. & Peng, N. F. A statistical approach for deriving the short message transmission delay distributions. IEEE Trans. Wireless Commun. 3, 2345–2352 (2004).
Kotz, S., Leiva, V. & Sanhueza, A. Two new mixture models related to the inverse Gaussian distribution. Methodol. Comput. Appl. Probab. 12, 199–212 (2010).
Lawless, J. F. Statistical Models and Methods for Lifetime Data (Wiley, Hoboken, NJ, US, 2003).
Johnson, N. L., Kemp, A. W. & Kotz, S. Univariate Discrete Distributions (Wiley, New York, NY, US).
Korkmaz, M. C., Leiva, V. & Martin-Barreiro, C. The continuous Bernoulli distribution: Mathematical characterization, fractile regression, computational simulations, and applications. Fractal and Fractional 7, 386 (2023).
Cox, D. R. Regression models and life-tables. J. Royal Stat. Soc. B 34, 187–202 (1972).
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model (Springer, New York, NY, US, 2000).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2022).
Peinado, S. A., Aliota, M. T., Blitvich, B. J. & Bartholomay, L. C. Biology and transmission dynamics of Aedes flavivirus. J. Med. Entomol. 59, 659–666 (2022).
Li, H. H. et al. Mechanical transmission of dengue virus by Aedes aegypti may influence disease transmission dynamics during outbreaks. EBioMedicine 94, 104668 (2023).
Robison, A., Young, M. C., Byas, A. D., Rückert, C. & Ebel, G. D. Comparison of chikungunya virus and Zika virus replication and transmission dynamics in Aedes aegypti mosquitoes. Am. J. Trop. Med. Hyg. 103, 869–875 (2020).
International Federation of Red Cross. Colombia: Dengue Outbreak Emergency Plan of Action (EPoA) Dref \(\text{N}^{\circ }\) MDRCO016 - Colombia. 2019. Available at: https://reliefweb.int/report/colombia/colombia-dengue-outbreak-emergency-plan-action-epoa-dref-n-mdrco016. Accessed on 18 January 2025.
Jiménez-Silva, C. L. et al. Evolutionary history and spatio-temporal dynamics of dengue virus serotypes in an endemic region of Colombia. PLoS ONE 13, e0203090 (2018).
Ocazionez-Jiménez, R. E., Ortiz-Baez, A. S., Gomez-Rangel, S. Y. & Miranda-Esquivel, D. R. Virus dengue serotipo 1 (VDEN-1) de Colombia: su contribución a la ocurrencia del dengue en el departamento de Santander. Biomedica 33, 22–30 (2013).
Johnson, N. L., Kotz, S. & Balakrishnan, N. Continuous Univariate Distributions: Vol. 1 (Wiley, New York, NY, US, 1994).
Sanchez, L., Leiva, V., Galea, M. & Saulo, H. Birnbaum-Saunders quantile regression and its diagnostics with application to economic data. Appl. Stoch. Model. Bus. Ind. 37, 53–73 (2021).
Johnson, N. L., Kotz, S. & Balakrishnan, N. Continuous Univariate Distributions: Vol. 2 (Wiley, New York, NY, US, 1994).
Mazucheli, M., Leiva, V., Alves, B. & Menezes, A. F. B. A new quantile regression for modeling bounded data under a unit Birnbaum-Saunders distribution with applications in medicine and politics. Symmetry 13, 682 (2021).
Johnson, N. L., Kotz, S. & Balakrishnan, N. Continuous Multivariate Distributions: Models and Applications Wiley, New York, NY, US.
Díaz-García, J. A., Galea, M. & Leiva, V. Influence diagnostics for elliptical multivariate linear regression models. Commun. Stat. Theory Methods 32, 625–641 (2003).
Ortiz, S. & Becerra, O. On a Stahel-Donoho estimator with skewness-based random projection directions. Chil. J. Stat. 15, 110–125 (2024).
Farcomeni, A. & Viviani, S. Robust estimation for the Cox regression model based on trimming. Biom. J. 53, 956–973 (2011).
Leiva, V., Castro, C., Vila, R. & Saulo, H. Unveiling patterns and trends in research on cumulative damage models for statistical and reliability analyses: Bibliometric and thematic explorations with data analytics. Chil. J. Stat. 15, 81–109 (2024).
Acknowledgements
The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which helped to improve the quality of this article.
Funding
This research has been partially supported by Ministerio de Ciencia, Tecnología e Innovación de Colombia, projects I) Formación de Capital Humano de Alto Nivel – Universidad EAFIT – Corte 2 Nacional, BPIN 2020000100778 (H.V.), II) Convocatoria 909-2 2022 (S.O. and A.C.-L.), III) 1421-918-91772 (S.O.) and Universidad EAFIT, grant number 954-000002 (A.C.-L.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article.
Author information
Authors and Affiliations
Contributions
H.V.: Literature review, conceptualization, methodology, data analysis, writing—original draft; S.O.: Literature review, conceptualization, methodology, data analysis, writing—original draft; A. C.-L.: Data collection, data analysis, writing—original draft; C. C.: Methodology, data analysis, writing—review and editing; C. M.-B.: Project supervision, methodology, data analysis, writing—review and editing; V. L.: Project supervision, methodology, data analysis, writing—final review. All authors have read and approved the final article.
Corresponding authors
Ethics declarations
Data and code availability
De-identified data, preprocessing scripts, and analysis notebooks reproducing all main tables and figures are available at https://github.com/alexacl95/NorteSantanderDengue.
Use of AI tools declaration
The authors declare that they have not used artificial intelligence (AI) tools in the creation of this article.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Velasco, H., Ortiz, S., Catano-Lopez, A. et al. Integrating machine learning and time-to-event models to explain and predict risk of hospitalization due to dengue in Colombia. Sci Rep 15, 38847 (2025). https://doi.org/10.1038/s41598-025-22681-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22681-0











