Introduction

Infectious diseases continue to pose a significant threat challenge to global health, causing millions of deaths annually and disrupting societies1. An outbreak is defined as an unexpected surge in disease cases that exceed the expected levels within a particular geographic area or demographic group2. Early and precise prediction of such outbreaks is essential for public health authorities to deploy effective control and prevention strategies, thereby mitigating impacts on public health and resources3. Despite this, the complicated dynamics of infectious diseases, combined with limitations in available data, pose significant challenges to traditional prediction methods. Although explicit prevention of outbreaks may not always be feasible, it is entirely possible to prepare adequately for them. The development and deployment of intelligent predictive models are rapidly advancing, significantly enhancing both individual healthcare and global disease management. Modern machine learning models facilitate the monitoring and prediction/forecasting disease case numbers. Increasingly, these models support medical professionals in diagnosis, treatment and aid public health strategies designed to prevent disease spread globally4. A key step in mobilizing healthcare responses to impending outbreaks is visualizing and evaluating outbreak predictions5. Despite significant advancements in healthcare, infectious diseases continue to be a formidable threat to global health, with epidemics capable of causing extensive morbidity and mortality. Machine learning (ML) has emerged as a powerful tool, capable of analyzing various and diverse datasets to identify complex relationships among the various factors that influence disease dynamics. Nevertheless, obstacles such as data scarcity, task complixity, and the robustness of evaluation methodologies persist. ML algorithms exceed in identifying patterns and trends within complex data sets that may fail human detection. By continuously monitoring risk factors like seasonal variations and travel patterns, ML enables the timely prediction of outbreaks, allowing for early interventions that can effectively allocate resources and prevent minor outbreaks from intensifying into major epidemics. The analytic hierarchy process (AHP), a multi-criteria decision making (MCDM) approach developed by Thomas L. Saaty, organizes factors into a hierarchical structure, proving invaluable for decision-makers navigating complex scenarios. This approach assists in defining priorities and assigning weights to various factors by establishing relevant criteria and constructing a pairwise comparison matrix to ensure decision consistency in making a decisionor in decision making6. The AHP utilizes a fundamental scale ranging from 1 to 9 to assess the relative importance of two items, facilitating the calculation of each factor’s weight in the decision-making process7. The accuracy and potential biases within the decision-making process are verified through the consistency index (CI) and consistency ratio (CR); a CI to random consistency index (RI) ratio exceeding 0.1 indicates inconsistency8. Understanding the primary risk factors for infectious diseases is cruical for mitigating outbreaks. A comprehensive analysis of socioeconomic, demographic, geographic, climatic, behavioral, and health-related factors reveals their significant impact on disease patterns2. Environmental variables such as temperature, humidity, seasonal changes, hygiene practices, population age distribution, and urban economic conditions play a critical role in influencing disease transmission and spread9. In real-world scenarios, such as infectious disease outbreak prediction, constructing models from scratch or collecting new training data is often exorbitantly costly or impractical. Thus, minimizing data collection efforts is essenial. Transfer learning, which utilizes knowledge from one domain to solve related issues in another, addresses this by transferring models, weights, or features from one disease context to another10,11. This study uses Zika, Chikungunya, and Dengue as case studies to demonstrate the approach, presenting an enhanced model for infectious disease outbreak prediction. By integrating the Analytical Hierarchy Process with Machine Learning, the model systematically identifies and prioritizes risk factors, considers their interconnections, and applies these insights across multiple diseases via transfer learning to enhance prediction accuracy and reliability.

Literature review

Recent studies that apply the analytic hierarchy process (AHP) in the context of disease or outbreak prediction include the following:

G Harsha et al. (2023) employ both the analytic hierarchy process (AHP) and the fuzzy-AHP (F-AHP) to determine dengue risk zones in Kerala. Their methodology incorporat geo-environmental and demographic variables, including normalized difference vegetation index (NDVI), land surface temperature (LST), topographic wetness index (TWI), land use land cover (LULC), elevation, normalized difference built-up index (NDBI), household density, and population density. The findings indicate that the F-AHP model outperformed the traditional AHP in predictive capability, achieving an AUC of 0.971. Additionally, it is established that the zones classified as extremely high and high risk accounted for approximately 82.87% of the reported dengue cases12.

E. Badillo–Rivera et al. (2020) investigat the social and environmental determinants influencing the transmission of the SARS-CoV-2 virus. Utilizing the Analytic Hierarchy Process (AHP) and Geographic Information System (GIS), they identify susceptible regions within Peru. The study determins a consistency ratio (CR) of 0.032, indicating reliable analytical outcomes. Results demonstrat that approximately 68% of Peruvian regions fall into ‘high’ or ‘very high’ risk categories for the spread of the SARS-CoV-2 virus13.

A Fariza et al. (2021) introduce a novel approach for assessing diphtheria susceptibility in East Java Province, Indonesia, by integrating the analytic hierarchy process (AHP) with the natural breaks classification system. This methodology classifies diphtheria vulnerability into three distinct tiers: low, medium, and high. Utilizing the combined AHP and natural breaks classification approach, the study demonstrats high accuracy with a value of 0.776.

R Mahato et al. (2020) employ the analytic hierarchy process (AHP) in conjunction with Geographic Information Systems (GIS) to identify potential COVID-19 risk zones across several Indian states. Their findings suggest an elevated risk of increased infection rates particularly in the central regions of India. Key factors contributing to this risk include the number of confirmed cases, the proportion of the population living below the poverty line, and the percentage of urban population7.

Sk Ajim Ali et al. (2019) apply the analytic hierarchy process (AHP) alongside geographic information systems (GIS) to analyze mosquito-borne disease distribution, incorporating various environmental factors as decision-making criteria. These factors includ surface temperature, Normalized Difference Vegetation Index (NDVI), land cover, vegetation, land elevation, and slope. The study’s findings underscore a high degree of consistency in decision-making, with the consistency ratio (CR) consistently below 0.1, indicating robust reliability. Among the evaluated factors, water bodies are identified as presenting the highest risk for disease transmission8.

SA Ali et al. (2018) integrat environmental data to delineate dengue risk zones in Kolkata, India, using a two-stage approach involving geographic information systems (GIS) and the analytic hierarchy process (AHP). The first stage involves using GIS to analyze environmental factors, followed by the application of AHP in the second stage to assess the correlation between these environmental elements and dengue risk zones. The study concludes that increased building densities, population densities, and high concentrations of people in confined areas significantly contribut to the rise in dengue fever incidence14.

An analysis of prior research reveals that existing studies predominantly focus on predicting specific diseases using open-source libraries like scikit-learn for feature selection. However, these libraries are often predefined and may not fully accommodate the unique attributes of diverse datasets, potentially leading to inefficient feature selection and data transformation. This study seeks to overcome these challenges by implementing the analytic hierarchy process (AHP) for more precise feature selection, thus aiming to bridge the identified gaps and enhance the accuracy of disease prediction models. Moreover, the scarcity of labeled data presents a significant obstacle in predictive modeling for disease epidemics. To address this, the researchers’ approach incorporates transfer learning, utilizing pre-trained models from large, related datasets to support the effective utilization of existing data and improve performance on targeted tasks.

Research methodology

Mosquito bites are a major vector for transmitting diseases that can lead to epidemics, particularly in tropical and subtropical regions15. Accordingly, this paper introduces a methodological model that integrates the Analytic Hierarchy Process (AHP) for feature prioritization with advanced ensemble machine-learning techniques to anticipate potential outbreaks of epidemic diseases. The researchers specifically focus on Dengue, Zika, and Chikungunya, which are among the fastest-growing viral diseases globally, transmitted by female Aedes mosquitoes15. The proposed model consists of six layers: the data source layer, preprocessing layer, feature engineering layer, data splitting layer, modelling layer, and evaluation layer, as illustrated in Figure 1. To identify common risk factors associated with infectious disease outbreaks, this study employes a predefined search technique across major online databases including PubMed/Medline, Scopus, and CINAHL, using search terms such as outbreak*, epidemic*, pandemic*, emerging disease*, and re-emerged disease*.

Fig. 1
figure 1

Proposed model for epidemic prediction using AHP and Ensamble machine learning.

Data source layer (data acquisition)

Various datasets are utilized to implement the proposed model. Initially, climate datasets collected from 2007 to 2017 are collected from the NASA Langley Research Center, DANE (National Administrative Department of Statistics of Colombia), and SIVIGILIA (National Public Health Surveillance System, Colombian National Institute of Health).

These datasets encompass both climate and socioeconomic data, detailed as follows:

Climatic data: this dataset provides extensive information on climatic variables for each municipality, including average temperature (tavg), minimum temperature (tmin), maximum temperature (tmax), average humidity (havg), wind speeds (maximum (wsmax), minimum (wsmin), and average (wsavg)).

Socioeconomic data: this includes sociodemographic indicators critical for public health analysis, such as illiteracy, low educational achievement, multidimensional poverty index (mpi), child labor, school absence, informal work, school lag, population, lack of health insurance, and dependency rate.

Upon integration of these data, the final dataset comprises 1716 entries, focusing on the diseases dengue, chikungunya, and zika, each distinguished by 27 unique features.

Data pre-processing layer

This layer encompasses exploratory data analysis (EDA), which addresses initial errors, missing values, and inconsistencies in the dataset. The EDA process transforms the data into a suitable format for feature engineering and predictive modeling, ensuring that the dataset is cleaned and standardized for subsequent analysis stages.

Data cleaning

During this phase, the researches addresse error detection, including the identification of negative case counts and misclassified disease types. Additionally, outliers are removed by calculating the Interquartile Range (IQR) for each variable. Values identified as outliers are those that fall below Q1 − 1.5 IQR or exceed Q3 + 1.5 IQR.

Handling missing values

To preserve the integrity of the dataset for analysis, it is essential to address the missing values using appropriate correction or imputation techniques. For instance, the missing values for the average temperature (Tavg) feature are replaced with the mean temperature calculated from the available temperature records.

Data transformation

In this phase, Min–Max normalization is employed to scale continuous data, including population metrics, temperature, and precipitation, to a uniform range between 0 and 1. Additionally, one-hot encoding is applied to convert categorical variables, such as “Municipality,” into a numerical format represented by binary indicators.

Feature engineering layer

Feature Engineering is a critical step in developing predictive models. It involves selecting the most relevant features and constructing new variables to enhance both model accuracy and interpretability. This process encompasses feature selection and feature extraction, both of which are vital for effectively preparing the data for modeling.

Feature selection

Feature selection involves identifying and selecting most pertinent characteristics for forecasting outbreaks16. In this phase, a semi-automated process is employed to optimize feature selection. The Researchers’/the study’s predictive model integrates the systematic consistency of the analytic hierarchy process (AHP) with expert domain knowledge to calculate weights for all diseases collectively. This approach utilizes the combined insights from various diseases to identify the most influential features, ensuring their relevance across different contexts. The steps of the AHP model are as follows17:

Step 1: Determine the goal of the AHP model.

Identify the most influential risk factors of infectious disease outbreaks.

Step 2: Determine criteria/sub-criteria.

Identify criteria such as climate factors, population demographics, and socioeconomic elements. These criteria are further decomposed into various components that influence the spread of disease.

Step 3: Expert input and pairwise comparisons:

Expert manual input

Expert manual input serves as the foundation for the data used in the AHP model. Domain experts evaluate potential features based on their knowledge and familiarity with the epidemiological factors influencing disease outbreaks. They assign initial weights reflecting the significance of each feature’s impact on disease transmission. Subsequently, these manual input weights are normalized using the following equation:

$$Normalized\,Weight = \frac{Given\,Weight}{{Sum\,of\,all\,Given\,Weight}}$$

Conducting pairwise comparisons

In this step, pairwise comparisons are employed to derive standardized, objective weights for each feature based on the manually assigned weights. Utilizing the AHP technique, pairwise comparisons is created among all criteria and sub-criteria. Each pair of factors is evaluated to determine which is more significant, specifically in the context of outbreak prediction18,19.

Step 4: calculation of weights and consistency checking: This step ensures that each factor’s relative importance is accurately represented. A consistency check is then performed to validate the consistency of the judgments, ensuring the reliability of the weight assignments.

Step 5: conduct a consistency ratio (CR) evaluation: to verify the reliability and consistency of expert inputs and the pairwise comparison process18.

Step 6: Extract the weighted priorities of the selected risk factors.

figure a

In the final AHP model, de a formula is developed for calculating risk factor weights. This formula includes multiple factors, each multiplied by a weight derived from the correlation confusion matrix. To accurately assess each factor’s significance, the AHP weights are calculated by integrating expert opinions with the correlation coefficient matrix. This approach enhances this model’s ability to rank epidemiological risk factors effectively, thereby improving its capacity to predict potential disease outbreaks. Table 1 below represents the weight of each risk factor, ordered by highest ranking.

Table 1 Risk factors ranks.

Based on the previous rankings of risk factors shown in Fig. 2, it is concluded that the highest-ranked factors; barriers to health services, dependency rate, and lack of health insurance play a critical role in disease outbreaks. Other notable factors, such as barriers to childhood services, precipitation, and informal employment, reflect important socioeconomic and environmental influences. Conversely, lower-ranked factors like population, average temperatures (Tavg, Tmax, Tmin), and inadequate excreta disposal have comparatively lesser impact on predicting outbreaks. This ranking aids in prioritizing the most influential factors, thereby enhancing efforts in disease prevention and control.

Fig. 2
figure 2

Distributions of AHP weights and ranks for risk factors.

Feature extraction

Feature extraction involves transforming raw data into a format more suitable for modeling, a crucial step in enhancing the dataset’s predictive power for disease outbreaks. In this study, the dataset initially lacked classification and a defined target variable. To address this issue, the 75th percentile method is used to establish an outbreak threshold, effectively distinguishing between normal conditions and outbreak situations and creating a target variable from disease incidence data. This statistical approach analyzes historical disease occurrence rates to identify a critical value that indicates an unusually high incidence of disease. By setting the threshold at the 75th percentile, it is ensured that only the top 25% of data points, representing unusually high incidences, are classified as outbreaks. To compute the 75th percentile of the dataset, the researchers use the standard percetiled formula:

$$P = \frac{n + 1}{{100}} \times 75$$

where:

P = Percentile value.

n = Total number of observations.

After establishing the threshold, the final outbreak target is calculated, a binary variable indicating the presence or absence of an outbreak. If disease incidence exceeds the defined threshold, the outbreak target is set to 1 (indicating an outbreak); otherwise, it is set to 0 (indicating no outbreak).

Data splitting layer

In this phase, the dataset is split into 80% for training and 20% for testing to assess the model’s predictive accuracy on unseen data. This approach allowes the researchers to build and validate a model initially trained on Dengue data, which is then evaluated with data from Zika and Chikungunya viruses.

Modeling layer

After feature selection and data preprocessing, transfer learning is applied to use Dengue data as a base model for forecasting Zika and Chikungunya outbreaks. The model is developed by using Random Forest, XGBoost, and Gradient Boosting algorithms. Additionally, an ensemble technique is employed to combine predictions from these individual models, enhancing overall predictive performance. The algorithm below details the outbreak prediction process using transfer learning. Note that the dataset is imbalanced due to unequal distribution of target classes.

figure b

To generalize the researchers’ predictive model for Chikungunya and Zika outbreaks across different geographical regions and potentially other diseases, transfer learning techniques is employed. Transfer learning is particularly valuable when the training dataset is insufficient for developing highly accurate models, as demonstrated in this study.

The model is trained to use a comprehensive dataset on Dengue outbreaks, which shares similarities with Chikungunya and Zika in terms of transmission vectors (Aedes mosquitoes) and influencing socioeconomic and environmental factors. This pre-training phase enables the model to learn essential patterns and relationships from the extensive Dengue data, capturing critical features such as climatic conditions, socioeconomic indicators, and historical disease incidence rates.

After pre-training the model, the le.arned features, patterns, and insights derived from the Dengue data are extracted. These features serve as a foundational knowledge base, enabling the model to transfer valuable insights to the target tasks of predicting Chikungunya and Zika outbreaks.

In the fine-tuning phase, the extracted features and parameters from the pre-trained Dengue model are used to initialize the models for predicting Chikungunya and Zika outbreaks. This process involves adjusting the final layers to fit the new tasks and retraining them with the target data. Fine-tuning enables the model to refine its understanding, focusing on the unique characteristics of Chikungunya and Zika while preserving the generalized knowledge acquired from Dengue.

Modeling evaluation layer

The modeling evaluation layer is a crucial step in the prediction model pipeline. At this layer, The performance of each model will be evaluated using metrics that are commonly used in outbreak prediction tasks like accuracy, precision, recall, area under the receiver operating characteristic curve (AUC) and F1-score to select the one that provides the highest classification accuracy. The values of these metrics are calculated using the parameter of the confusion metrics, such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN). These metrics can be defined mathematically in the following equations:

$$\Pr ecision = \frac{TP}{{TP + FP}}$$
$$\Pr ecision = \frac{TP}{{TP + FN}}$$
$$F1 - score = \frac{{\Pr ecision * {\text{Re}} call}}{{\Pr ecision + {\text{Re}} call}}$$
$$Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$

where true positive (TP) indicates a positive sample is correctly classified (correctly predicted outbreak cases), True Negative (TN) occurs when a negative sample is correctly classified (correct classification of outbreak negative). False positive (FP) occurs when a negative sample is mistakenly classified as positive (outbreak negative is classified as outbreak Positive). False negative (FN) occurs when a positive sample is mistakenly classified as negative (outbreak Positive is classified as outbreak Negative). The AUC value evaluates the overall performance across all classification thresholds, indicating the model’s ability to distinguish between classes. A higher AUC indicates better model performance.The final model is selected based on the one that best combines precision, recall, AUC, and F1 score, along with the highest classification accuracy, after evaluating all the models.

Results

The study uses four predictive models to predict disease outbreaks: Random Forest, XGBoost, ensemble technique, and Gradient Boosting. The study results are summarized in the tables below according to each disease.

Chikungunya outbreak prediction

Table 2 provides an overview of the performance evaluation of different models for predicting outbreaks of Chikungunya.

Table 2 Chikungunya performance evaluation.

The Ensemble model achieves the highest accuracy (93.31%) and precision (0.57), indicating reliable predictions and a balanced performance across metrics. Although the Gradient Boosting model achieves slightly lower accuracy (92.44%) than the Ensemble, it remains highly reliable with the highest recall (0.77) and F1 score (0.78), making it a strong competitor for detecting true positive cases. Despite demonstrating a slightly lower precision (0.51), it exhibites the best AUC (0.86), indicating excellent discrimination ability. The Random Forest model also shows high recall (0.63) but had a lower precision (0.5), resulting in a higher rate of false positives. The XGBoost model exhibites good overall performance with balanced metrics (accuracy 92.44%, precision 0.52, recall 0.51, and F1 score 0.57), however, it does not outperform the other models in any specific metric. Confusion matrices are created for each utilized model in this investigation to give a visual depiction of the prediction performance of the various models. The number of true positives, true negatives, false positives, and false negatives that each model predictes is clearly shown in these matrices, allowing for a better understanding of the model’s strengths and weaknesses in detecting Chikungunya outbreaks. The following figures illustrate the confusion matrices for Random Forest, XGBoost, Gradient Boosting, and the Ensemble model, offering insights into how each model distinguishes between outbreak and non-outbreak cases. Figure 3 shows that the Random Forest model correctly identifies 300 instances of non-outbreaks but misclassifies 17 actual outbreaks as non-outbreaks. This explains its lower precision, with more false positives affecting its accuracy and overall performance.

Fig. 3
figure 3

Random forest confusion matrix – Chikungunya.

The The XGBoost model, as shown Fig. 4, correctly predicts 301 non-outbreaks but misclassifies 16 actual outbreaks, leading to a slightly better performance in precision compared to RandomForest, but it still struggles with recall.

Fig. 4
figure 4

XGBoost confusion matrix – Chikungunya.

As shown in Fig. 5, the GradientBoosting model successfully identifies 21 outbreak cases while maintaining a balance between precision and recall. Its higher recall indicates superior performance in detecting actual outbreaks compared to the other models.

Fig. 5
figure 5

Gradient Boosting confusion matrix – Chikungunya.

Figure 6 shows that the Ensemble model deliveres the best overall performance with fewer misclassifications. By combining the strengths of the other models, it achieves the highest accuracy and maintaines a balanced trade-off between precision and recall.

Fig. 6
figure 6

Ensemble confusion matrix – Chikungunya.

Zika outbreak prediction

Table 3 provides an overview of the performance evaluation of different models for predicting outbreaks of Zika.

Table 3 Zika performance evaluation.

The RandomForest model demonstrates strong performance in predicting Zika outbreaks with an accuracy of 96.51% and a precision of 0.6667. However, the model also showed a relatively lower recall of 0.4, which indicates that it missed a number of actual outbreak cases. This is evident in Fig. 7, where the confusion matrix reveals 14 false positives and 3 false negatives.

Fig. 7
figure 7

Random forest confusion matrix – Zika.

The XGBoost model demonstrates balanced performance, achieving an accuracy of 95.93%, a precision of 0.5217, and a recall of 0.8667. This balance is evident in Fig. 8, where the confusion matrix shows 10 false positives and 2 false negatives, highlighting the model’s effectiveness in maintaining accuracy while minimizing misclassifications.

Fig. 8
figure 8

XGBoost confusion matrix – Zika.

In contrast, the GradientBoosting model has a slightly lower accuracy of 95.93% but maintaines high recall (0.8) and an F1 score of 0.6316. The confusion matrix in Fig. 9 demonstrates the model’s trade-off between false positives and false negatives, with 12 false positives and 3 false negatives.

Fig. 9
figure 9

Gradient Bossting confusion matrix – Zika.

The Ensemble model, which combines the strengths of the individual models, deliveres the most robust performance, achieving the highest accuracy of 96.80%, a precision of 0.7, and a recall of 0.4667. Figure 10 illustrates the confusion matrix for the Ensemble model, showing an effective balance of false positives (9) and false negatives (2), confirming its superior predictive power and reliability.

Fig. 10
figure 10

Ensemble confusion matrix – Zika.

Discussion

Vector-borne diseases, transmitted by blood-feeding insects such as mosquitoes, ticks, and fleas, pose a significant global health threat. Diseases like Dengue, Chikungunya, and Zika cause substantial morbidity and mortality, particularly in tropical and subtropical regions. Early prediction and detection of outbreaks are crucial for implementing timely public health interventions.

In this study, the researchers apply advanced machine learning techniques to predict Chikungunya and Zika outbreaks using a combination of transfer learning and the analytical hierarchy process (AHP) for feature selection. Initially, Random Forest, XGBoost, and Gradient Boosting are used, followed by an ensemble learning model to achieve the most accurate results. Transfer learning is leveraged to enhance performance by transferring knowledge from one task (Dengue prediction) to a related task (Chikungunya and Zika prediction).

To improve the model’s overall performance and adaptability, the ensemble approach combines the predictive strengths of multiple models to determine the most appropriate for Zika and Chikungunya. The ensemble technique also addresses the misclassification of positive and negative instances, mitigating overfitting and improving robustness and accuracy. The integration of AHP for feature selection further enhances the model’s ability to rank influential risk factors, boosting predictive performance by weighing and assessing their significance.

In transfer learning, pre-trained models are retrained with new datasets, combining prior knowledge with new data. This significantly enhances the model’s generalizability and robustness in predicting Zika and Chikungunya using information from the Dengue dataset. The class imbalance in the dataset is addressed using synthetic minority over-sampling technique (SMOTE), reducing bias towards the majority class and improving overall predictive accuracy by effectively classifying both majority and minority classes.

The performance analysis confirms that the ensemble model is the most effective model for forecasting infectious disease outbreaks in this study.

Chikungunya

As shown in Fig. 11, the RandomForest model for predicting Chikungunya outbreaks achieved an accuracy of 92.15%, with a precision of 51.61% and a recall of 62.96%. While this model performed solidly, its precision indicates a moderate rate of false positives. The GradientBoosting model, depicted in the same figure, achieves slightly higher accuracy (92.44%) and excelles in recall (77.78%), demonstrating strength in identifying true positives, with a high F1 score of 62.69%. Although XGBoost reaches a similar accuracy (91.86%) and precision (48.71%), it shows balanced performance across all metrics. The Stacking Ensemble model, which combined the strengths of these individual models, provides the highest overall performance with an accuracy of 93.31%, a precision of 57%, and a balanced recall of 63%. This illustrates how the ensemble approach effectively integrates the strengths of different models, enhancing prediction accuracy and robustness. Figure 13 displays the ROC curves, confirming that the GradientBoosting model achieves the best area under the curve (AUC = 0.86) for Chikungunya outbreak prediction, highlighting its superior discrimination between outbreak and non-outbreak cases.

Fig. 11
figure 11

Chikungunya model comparison.

Zika

For predicting Zika outbreaks, as illustrated in Fig. 12, the XGBoost model excelles with the highest accuracy (96.51%) and an AUC of 0.9181, indicating strong discriminative ability between outbreak and non-outbreak cases. The Random Forest model, while slightly trailing in accuracy (96.22%), maintaines a commendable AUC of 0.8848, demonstrating its robustness. Although the Gradient Boosting model has slightly lower accuracy (95.93%), it achieves high recall (80%), indicating its effectiveness in identifying actual outbreak cases. However, the Stacking Ensemble model, shown in Fig. 13, outperformes the individual models with the highest accuracy (96.80%) and balanced metrics (Precision = 70.27%, Recall = 86.67%), illustrating the effectiveness of the ensemble approach in consistently achieving high performance across various metrics for Zika predictions. These results, underscored by the ROC curves in Fig. 13, highlight the critical role of ensemble techniques in robust predictions, particularly for Chikungunya and Zika outbreaks.

Fig. 12
figure 12

Zika model comparison.

Fig. 13
figure 13

Roc curves.

To validate the differences in model performance specially between the ensemble model and XGBoost where the results approve slightly difference , a detailed statistical analysis was conducted using the Friedman test, a non-parametric method commonly applied to compare three or more paired groups across multiple conditions, focusing on key evaluation metrics to assess the ensemble model against other methods20. This test ranks the models’ performance for each metric (accuracy, precision, recall, and F1 score) within each disease (Chikungunya and Zika). The results presents that Significant differences between the models’ performances across measures were indicated by the p-values below 0.05 for each disease. Comparing the ensemble model against XGBoost and other models, it was shown to be robust and reliable as it continuously received higher rankings across accuracy and other criteria. Chikungunya: Friedman test statistic = 9.77, p-value = 0.0206. Zika: Friedman test statistic = 12.00, p-value = 0.0074. The ensemble model demonstrated consistent improvements in accuracy and precision compared to XGBoost, with statistical tests confirming these differences to be significant.

Conclusion

In this study, the potential of machine learning models in predicting outbreaks of vector-borne diseases such as Chikungunya and Zika is demonstrated. Using historical disease incidence data, the analytic hierarchy process (AHP) is applied for feature selection and transfer learning is employed to enhance model performance. The analysis involves RandomForest, XGBoost, Gradient Boosting, and ensemble models, comparing their effectiveness in accurately forecasting outbreaks. The results indicate that the Ensemble model achieves the highest accuracy for Zika outbreaks, with an accuracy of 96.80% and an AUC of 0.9197. This high level of performance underscores the model’s capability to discern between outbreak and non-outbreak cases effectively. For Chikungunya, the GradientBoosting model demonstrates the highest recall (77.0%) and a strong F1 score (0.78), although it has a slightly lower accuracy (92.44%) compared to the ensemble approach. The Ensemble model provides the best balance between precision (57.0%) and recall (63.0%), demonstrating its reliability in outbreak prediction across both diseases.

Studies that concentrate on particular diseases often have a more limited scope, which restricts the applicability of their findings to a wider population or different health conditions. So, Future research should explore the application of these models to other diseases and regions to ensure robustness and scalability. One of the study’s key contributions is showing that the proposed model can learn general features and patterns related to vector-borne diseases, thereby improving predictive accuracy and robustness. Also, The integration of the analytic hierarchy process (AHP) in outbreak prediction significantly enhances the feature selection process by incorporating expert opinions, which are often more insightful than automated methods. By leveraging AHP, the study concentrates on critical risk factors that might be overlooked by open-source libraries, leading to more robust predictions. Therefore, the result proves that the highest-ranked factors include barriers to health services, dependency rate, and lack of health insurance, barriers to childhood services, precipitation, informal work, school absence, and elevation. This approach builds confidence and trust among public health stakeholders while also improving the overall quality of the predictions. Finally, employing AHP in feature selection helps improve disease outbreak preparedness and response tactics.

While many studies rely solely on clinical profiles and often overlook environmental factors, The research incorporates both environmental and socio-economic variables. Future work will aim to further refine these models by integrating additional data sources and advancing deep learning techniques. It is also proposed to enhance the model by including clinical patient profiles, such as medical histories.

Limitations

This study faced certain constraints primarily related to data availability, with limited labeled datasets for Chikungunya and Zika. The AHP feature selection method successfully identified significant risk factors associated with disease outbreak prediction. Nevertheless, the study lacked comprehensive clinical data at the patient level, which could have added new information and enhanced the predictive power of the model. Additionally, while AHP proved valuable for prioritizing features, its broader application in public health for outbreak prediction remains an area open for future exploration and validation across diverse infectious diseases. Furthermore, the ensemble approach in this study utilized a limited number of base models (Random Forest, Gradient Boosting, and XGBoost). While these models are robust and widely used, incorporating additional or more diverse models, such as neural networks or other advanced techniques, could potentially improve the ensemble’s performance and generalizability. Finally, clinical patient profiles, various lifestyle factors, and the existence of chronic and non-communicable diseases are not included in the study. These elements will be examined in future studies to improve the model’s comprehensiveness and predictive ability, as they may have a substantial impact on outbreak forecasts.