Comparative performance of bagging and boosting ensemble models for predicting lumpy skin disease with multiclass-imbalanced data

Gouda, Hagar F.; Abdallah, Fatma D. M.

doi:10.1038/s41598-025-23846-7

Download PDF

Article
Open access
Published: 10 November 2025

Comparative performance of bagging and boosting ensemble models for predicting lumpy skin disease with multiclass-imbalanced data

Hagar F. Gouda¹ &
Fatma D. M. Abdallah¹

Scientific Reports volume 15, Article number: 39373 (2025) Cite this article

1257 Accesses
Metrics details

Subjects

A Correction to this article was published on 15 January 2026

This article has been updated

Abstract

Ensemble machine learning (ML) algorithms, such as bagging and boosting, are powerful decision-support tools that enhance disease prediction and risk management in the veterinary field. Lumpy Skin Disease (LSD) poses a significant threat to livestock health and results in substantial economic losses. This study aims to predict LSD using 1,041 data records collected from six Egyptian governorates between June 2020 and October 2022. The dataset exhibits a multiclass imbalance with three outcome classes: Dead (6%), Diseased (32%), and Healthy (62%). To address this imbalance, we applied SMOTE, Random Oversampling (ROS), and Random Undersampling (RUS). Five ensemble models: Decision Tree (DT), Random Forest (RF), AdaBoost, Gradient Boosting (GBoost), and XGBoost were evaluated on both imbalanced and balanced datasets, with hyperparameter tuning via grid search and 10-fold cross-validation. Our findings highlight the superior performance of the RF model combined with ROS (RF-ROS), achieving the highest accuracy (82%) and AUC (0.93), followed by balanced XGBoost (81.25%, AUC = 0.93). AdaBoost and GBoost also improved significantly after oversampling and tuning. SHAP analysis identified vaccination status as the most important predictor, emphasizing targeted interventions. These results demonstrate that combining resampling with hyperparameter tuning enhances ML performance on imbalanced veterinary data.

Design a model to predict incomplete immunization among Ethiopian children using ensemble machine learning algorithms

Article Open access 12 December 2025

Enhanced detection of Mpox using federated learning with hybrid ResNet-ViT and adaptive attention mechanisms

Article Open access 09 July 2025

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Article Open access 02 March 2024

Introduction

Lumpy skin disease virus (LSDV) poses a serious threat to cattle production, causing both acute and subacute illness in cattle and water buffalo. All breeds are susceptible, with lactating cows and calves being at higher risk¹. Monitoring risk factors such as deworming methods, vaccination, grazing patterns, use of disinfectants, and fly repellents characteristics can aid in determining their impact on LSD risk². Additionally, factors including breed, age, season, water supply, feeding methods, importation of breeding stock, and exposure to other species such as birds and insects play important roles in the prevalence of LSD³. From an economic standpoint, LSD represents an existential threat to cattle-dependent economies, notably in Asia and Africa. This disease reduces dairy productivity, and in outbreak resulted in significant losses due to abortions, weight loss, and reduced infertility. The World Organization for Animal Health has classified LSD as a notifiable disease, requiring timely reporting⁴. In many countries, vaccination is the primary means of controlling and preventing LSD^5,6. However, effective preventive measures are still limited. Restricting the movement of sick cattle, implementing quarantine, and sacrificing infected animals are strongly advised⁷.

Early and precise detection is critical for effective epidemic management and mitigation. This can be achieved by integrating advances in computer vision and artificial intelligence⁸. The modelling of LSD risk contributes significantly to addressing challenges in LSD epidemiology and control, particularly in the areas of risk factors, disease transmission, diagnosis and forecasting, and intervention techniques⁹. Machine learning (ML) techniques like Artificial Neural Networks (ANN), Decision Trees (DTs), and Random Forest (RF) can considerably improve the accuracy of LSD prediction based on geographical and climate features. This powerful tool could help to build targeted monitoring and awareness initiatives, as well as preventive measures like vaccination campaigns, in areas prone to LSDV infection¹⁰. However, class imbalance, which occurs when the majority class outnumbers the minority class by a large amount, poses a significant barrier to ML prediction accuracy¹¹. This issue is particularly prevalent in veterinary medicine, where rare outcomes such as mortality are significantly underrepresented, further impairing the model’s ability to accurately learn and predict these minority classes¹². The impact of class imbalance is more severe in multi-class classification compared to binary classification. As a result, there has been growing attention on the challenges of multi-class imbalance classification in recent years¹³. So, building powerful ML algorithms with high model accuracy requires careful attention to class imbalance, which affects data quality¹⁴. A widely adopted strategy to address class imbalance is resampling, which aims to balance the dataset either by reducing the majority class (undersampling) or expanding the minority class (oversampling). Undersampling removes instances from the majority class, thereby improving computational efficiency but potentially compromising the loss of valuable information and introducing bias, especially in complex datasets¹⁵. The common methods include random undersampling and Tomek Links, which eliminate overlapping majority samples. In contrast, oversampling increases minority class representation by duplicating existing instances, preserving data integrity but risking overfitting^15,16. To mitigate this, SMOTE (Synthetic Minority Over-sampling Technique) was introduced by Chawla¹⁷, which generates synthetic samples to enhance generalization. Choosing the appropriate resampling method depends on multiple factors, including the dataset’s structure, size, and characteristics, as well as research objectives. As highlighted in¹⁸, resampling effectiveness is influenced not just by imbalance ratios, but also by the intrinsic nature of the data, emphasizing the need for context-specific strategies. Ensemble learning is considered one of the most effective strategies for addressing class imbalance in machine learning tasks¹⁹. By aggregating predictions from multiple models, hence reducing forecasting errors and improving accuracy²⁰. Consequently, it has been regarded as one of the most effective ML methods²¹. Bagging and boosting are two powerful ensemble learning methods that improve prediction accuracy by combining multiple models. Bagging, or bootstrap aggregation, involves training several models on randomly selected subsets of the training data and aggregating their outputs through majority voting or averaging. In contrast, boosting trains models sequentially, with each model focusing on correcting the errors of its predecessor by assigning greater weight to misclassified instances. By leveraging the strengths of multiple learners, both bagging and boosting enhance predictive performance and model robustness²².

In the veterinary field, ensemble ML models have emerged as powerful tools for improving predictive accuracy and robustness, particularly in veterinary epidemiology, where complex and imbalanced datasets are common. Table 1 summarizes recent studies that have applied ensemble techniques for predicting livestock disease, highlighting the models used, methods for handling data imbalance, and key findings of those studies. However, a clear research gap remains: most studies have focused on binary classification problems and have not systematically evaluated ensemble models in multiclass imbalance scenarios, especially for LSD. To the best of our knowledge, no prior study has comprehensively compared the performance of various ensemble models, including Decision Tree (DT), Random Forest (RF), Adaptive Boosting (AdaBoost), Gradient Boosting (GBoost), and eXtreme Gradient Boosting (XGBoost), specifically for multiclass predictions of LSD. Moreover, the impact of widely used resampling techniques such as the Synthetic Minority Over-sampling Technique (SMOTE), Random Oversampling (ROS), and Random Undersampling (RUS), within these ensemble frameworks, has not been fully explored. This study aims to fill this gap by conducting a thorough comparative analysis of these ensemble models, combined with different resampling strategies, on a real-world multiclass imbalanced LSD dataset. Specifically, it seeks to answer the following research questions:

Table 1 Recent applications of ensemble machine learning models in veterinary disease prediction with a focus on class imbalance Handling.

Full size table

1.
Are there significant differences in predictive performance between bagging and boosting algorithms?
2.
Does addressing data imbalance improve predictive performance, and which resampling technique is most effective?
3.
Can hyperparameter tuning enhance model performance even when data remains imbalanced?

The key Contributions of this study are summarized as follows:

This study provides a comprehensive comparative evaluation of five ensemble learning algorithms (DT, RF, AdaBoost, GBoost, and XGBoost) for the prediction of LSD on a real-world, multiclass-imbalanced dataset.
This study systematically investigates the impact of three distinct resampling techniques (SMOTE, ROS, and RUS) on the performance of these models in addressing class imbalance for LSD prediction.
This study identifies the RF algorithm combined with ROS (RF-ROS) as the most effective approach for predicting LSD under the studied conditions, particularly for the critical minority “Dead” class.
This study offers insights into the effectiveness of hyperparameter tuning in improving the performance of ensemble models on both imbalanced and resampled LSD datasets.
The study emphasizes the importance of translating ML results into interpretable insights for practical use in real-world veterinary settings. The application of SHAP analysis proved effective, revealing that vaccination status is the most significant predictor of LSD risk.

Materials and methods

Source of the dataset

This study included data from a total of 1041 cows across 6 governorates, collected between June 2020 and October 2022. The animals were sourced as follows:

Field Outbreaks: Cattle from 31 herds were included if the herds experienced suspected lumpy skin disease (LSD) outbreaks during the study period. Herds were identified through notifications from local veterinary authorities and active surveillance programs. All animals within these herds underwent a clinical examination.
Veterinary Clinic Admissions: An additional 275 cases were included from cattle admitted to the Zagazig University Veterinary Clinic in Sharkia governorate, Egypt. These admissions were either referrals from field veterinarians or direct presentations by owners for suspected LSD.

Sampling approach

A census sampling approach was used. In each affected herd and clinic admission group, all available animals were examined and included based on clinical presentation and laboratory confirmation. No random or systematic sampling was applied.

Inclusion criteria

Cattle were included if they belonged to herds with at least one animal showing clinical signs consistent with LSD (e.g., skin nodules, fever, lymphadenopathy) during the outbreak period.
For clinic cases, only animals presenting with clinical suspicion of LSD were considered.
Both field and clinic cases were further classified based on clinical outcome at the time of data collection:
- Dead: Animals that died as a direct result of LSD, confirmed by clinical history and, where possible, post-mortem findings.
- Diseased: Animals showing clinical signs of LSD but surviving at the time of data collection.
- Healthy: Animals from the same herds or clinic admissions that showed no clinical signs of LSD during the outbreak period and tested negative by PCR.

Case confirmation

All suspected LSD cases (both dead and diseased) were confirmed by a combination of clinical diagnosis and laboratory testing. Skin nodule biopsies and nasal swabs were collected and tested for LSDV DNA using PCR at the Virology Department of the Animal Health Research Institute, Dokki, Giza, following established protocols as described in a previous study²⁷.

Ethical compliance and consent

All methods were conducted in accordance with the relevant guidelines and regulations, including those of the Zagazig University Animal Care and Use Committee (Permit No. ZU-IACUC/2/F/114/2022) and the ARRIVE guidelines. All procedures involving animals were explained to and approved by the cattle owners, and informed consent was obtained prior to data collection.

Feature engineering and data preprocessing

Data on demographic and management variables (breed, sex, age, season, feeding/watering system, introduction of new cattle, vaccination status) were collected for each animal using standardized questionnaires and farm records. The data were accessed through a data sharing-agreement with the study author. Both the laboratory analytical output and the necessary questionnaire response data were recorded, coded, and filtered in Microsoft Excel before being uploaded to R. We utilized the R programming language along with the following packages for data processing and model development: tidyverse²⁸, readxl²⁹, RandomForest³⁰, caret³¹, xgboost³², adabag³³, and gbm³⁴.

The clinical cases (categorized as Healthy, Diseased, or Dead) were used as the target multiclass variable. We used both univariable and multivariable multinomial logistic regression to identify the key factors influencing lumpy skin disease (LSD). Variables with a P-vaue < 0.05 were considered statistically significant and retained as important predictors. The data revealed severe class imbalance as the ratios are (Dead 0.06, Diseased 0.32 and Healthy 0.62). The predictor features are presented in Table 2. The data are categorical, so we preprocess the data by one-hot-encoding. OneHotEncoder is used to convert the categorical features into binary form and subsequently intoa sparse [0,1] matrix, which was then fed into the model. The dataset was then split into two subsets: 80% for training and 20% for testing the model’s predictive performance. To address the class imbalance in the training data, we applied three resampling techniques: Synthetic Minority Over-sampling Technique (SMOTE), Random Over-sampling (ROS), and Random Under-sampling (RUS).

Table 2 Distribution of LSD clinical outcomes across predictor Categories.

Full size table

Hyperparameter tuning procedure

To tune the classification algorithms, a customized grid search was used, and the sets of hyperparameter values were evaluated using 10-fold cross-validation (10-fold CV) repeated 5 times. The range of the hyperparameter values and their justification are presented in Table 3. After obtaining the optimal hyperparameter values, each classification model was trained and tested, and the accuracy, precision, recall, F1 score and ROC-AUC were extracted.

Table 3 Selected hyperparameter values and tuning justification based on repeated 10-Fold Cross-Validation.

Full size table

Ensemble learning algorithms

Five ML algorithms, including DT, RF, AdaBoost, GBoost, and XGBoost, were trained to predict the clinical case of lumpy skin disease. Their performance was evaluated using metrics derived from the confusion matrix to determine the best model. Each model was assessed both with default parameters and after hyperparameter tuning, and evaluated before and after balancing the training set.

Decision tree

Classification and Regression Tree (CART) is a non-parametric tree-structured recursive partitioning method that hierarchically organizes the most influential variables to predict a response. This method works by recursively partitioning the data based on predictor-response relationships, forming a tree-like structure of decision rules. The root node initiates the process, followed by internal nodes representing further splits, and leaf nodes representing final classifications. The algorithm iteratively seeks optimal splits to maximize predictive accuracy³⁵.

In our study, we applied the DT to predict a multiclass LSD status response variable $\:{Y}_{\left(\text{l}\text{u}\text{m}\text{p}\text{y}\:\text{s}\text{k}\text{i}\text{n}\:\text{d}\text{i}\text{s}\text{e}\text{a}\text{s}\text{e}\:\text{c}\text{a}\text{s}\text{e}\right)}$ on the basis of p risk predictors: $\:{X}_{\left(\text{a}\text{g}\text{e}\right)}$, $\:{X}_{\left(\text{s}\text{e}\text{x}\right)}$, $\:{X}_{\left(\text{s}\text{a}\text{e}\text{s}\text{o}\text{n}\right)}$, $\:{X}_{\left(\text{b}\text{r}\text{e}\text{e}\text{d}\right)},\:{X}_{\left(\text{v}\text{a}\text{c}\text{c}\text{i}\text{n}\text{a}\text{t}\text{i}\text{o}\text{n}\:\text{s}\text{t}\text{a}\text{t}\text{u}\text{s}\right)},$ $\:{X}_{\left(\text{g}\text{r}\text{a}\text{z}\text{i}\text{n}\text{g}\:\text{s}\text{y}\text{s}\text{t}\text{e}\text{m}\right)}$, $\:{X}_{\left(\text{i}\text{n}\text{t}\text{r}\text{o}\text{d}\text{u}\text{c}\text{t}\text{i}\text{o}\text{n}\:\text{o}\text{f}\:\text{n}\text{e}\text{w}\:\text{c}\text{a}\text{t}\text{t}\text{l}\text{e}\text{s}\right)}$ observed on a learning sample of N units.

While growing, the CART algorithm performs binary recursive partitioning of the N data instances into increasingly homogeneous subsets (nodes). At each internal node t, all possible splits $s \in S$ across the covariates are evaluated, and the best split is chosen to maximize the reduction in impurity:

$$\:\varDelta\:\varvec{I}\left(\varvec{s},\varvec{t}\right)=\varvec{i}\left(\varvec{t}\right)-\varvec{P}\left({\varvec{t}}_{\varvec{L}}\right).\:\varvec{i}\left({\varvec{t}}_{\varvec{L}}\right)-\varvec{P}\left({\varvec{t}}_{\varvec{R}}\right).\:\varvec{i}\left({\varvec{t}}_{\varvec{R}}\right)$$

Where:

i(t) impurity measure at node t, and t_L, t_R are the resulting left and right child nodes, and P(t_L), P(t_R) are the proportions of observations falling into t_L and t_R, respectively.

The CART algorithm uses the Gini impurity index to select the best split variable. For a dataset D with m categories, the impurity is measured by the Gini index as:

$$\:\varvec{G}\varvec{i}\varvec{n}\varvec{i}\:\left(\varvec{D}\right)=\:1-\sum\:_{\varvec{i}=1}^{\varvec{m}}{\left({\varvec{P}}_{\varvec{i}}\right)}^{2}$$

with P_i is the probability recording in D belongs to class C_i and is estimated by $\:\frac{|\text{C}\text{i},\text{D}|}{\left|\text{D}\right|}$³⁶. The sum is computed over m classes.

The recursive partitioning process continues until no further meaningful splits can be made. To avoid overfitting, the fully grown tree is pruned using a cost-complexity criterion (CP):

$$\:{\varvec{C}}_{\varvec{\alpha\:}}\left(\varvec{T}\right)=\varvec{C}\left(\varvec{T}\right)+\varvec{\alpha\:}.\:\left|\stackrel{\sim}{\varvec{T}}\right|$$

Where C(T) represents the overall misclassification error of the tree, aggregated from the individual misclassification errors c(t) at each node, and $\:\left|\stackrel{\sim}{\varvec{T}}\right|$ is the number of terminal nodes, and α ≥ 0 is a penalty parameter controlling tree complexity. This pruning helps select the most predictive and generalizable subtree, often based on cross-validation performance³⁶.

Random forest

RF is widely recognized as one of the most effective ensemble methods, largely due to its simplicity and high predictive performance³⁷. It employs bootstrap aggregation (bagging) to combine multiple decision trees, enhancing the overall predictive performance. The feature with the lowest Gini index is selected as the optimal feature for data splitting:

$$\:\varvec{G}\varvec{i}\varvec{n}\varvec{i}\:\varvec{i}\varvec{n}\varvec{d}\varvec{e}\varvec{x}\:\left(\varvec{x}\right)=\:1-\sum\:_{\varvec{i}=1}^{\varvec{n}}{\left({\varvec{x}}_{\varvec{i}}\right)}^{2}$$

Notably, RF is an excellent predictive model for handling missing data, efficiently manages imbalanced datasets to reduce mistakes, and aids in determining the importance of variables in categorization. The algorithm uses a voting mechanism among the sub-algorithms to determine performance. The algorithm’s strength stems from the collective voting performance of similar trees within the forest. Meanwhile, RF is ideal for high-dimensional datasets with many features. It reduces variance by averaging and utilizing deep decision trees created from several subsets of training data. While this strategy may introduce some bias and reduce interpretability, it often results in a significant improvement in model performance³⁸. Despite being accurate, RF is often considered a black-box model due to its limited interpretability, as the ensemble of deep trees makes it difficult to isolate individual variable effects. In this study, RF was implemented using the “randomForest” package (version 4.6–12). In RF, each base learner (i.e., decision tree) has access to a random subset of feature vectors³⁹, which is defined as follows:

$$\:\varvec{x}=\left({\varvec{x}}_{1},{\varvec{x}}_{2},\dots\:,{\varvec{x}}_{\varvec{p}}\right),$$

where p is the dimensionality of the available vector for the base learner. The main goal is to find the prediction function as f(x) that predicts the Y parameter. The prediction function is defined as follows:

$$\:\varvec{L}\left(\varvec{Y},\:\varvec{f}\left(\varvec{x}\right)\right),$$

Here, L is known as the loss function, and the goal is to minimize the expected value of the loss. For classification applications, zero-one loss is a common choice. The function is defined as follows:

$$\:\varvec{L}\left(\varvec{Y},\varvec{f}\left(\varvec{x}\right)\right)=\varvec{I}\left(\varvec{Y}\ne\:\varvec{f}\left(\varvec{x}\right)\right)=\left\{\begin{array}{c}0,\:if\:Y=\left(\varvec{x}\right),\\\:1,\:otherwise\end{array}\right.$$

To create an ensemble, a set of base learners comes together. The base learners are defined as follows:

$$\:{\varvec{h}}_{1}\left(\varvec{x}\right),\:{\varvec{h}}_{2\:}\left(\varvec{x}\right),\dots\:,\:{\varvec{h}}_{\varvec{j}}\left(\varvec{x}\right),$$

For classification applications, the voting will be based on the following equation:

$$\:\varvec{f}\left(\varvec{x}\right)=\varvec{a}\varvec{r}\varvec{g}\varvec{m}\varvec{a}\varvec{x}\sum\:_{\varvec{j}=1}^{\varvec{J}}\varvec{I}(\varvec{y}={\varvec{h}}_{\varvec{j}}(\varvec{x}\left)\right)$$

The fundamental RF algorithm steps are summarized as:

Adaptive boosting (AdaBoost)

AdaBoost, the first boosting implementation, is a valuable boosting algorithm that uses shallow decision trees as base classifiers. It iteratively reweights training data to focus on previously misclassified samples, improving the model without compromising earlier classifiers. This method creates accurate, flexible models in a short amount of time⁴⁰.

The original AdaBoost algorithm was initially designed for binary classification problems, where the base classifiers predict the probability of a target class. In this method, the weight of each instance is adjusted proportionally to its probability of being correctly predicted and indirectly proportional to the error of the classifier. In addition, the decision of each classifier on the final prediction of a new example is also weighted by its accuracy during the training phase. Along with this method, a multi-class variant, called AdaBoost.M1, was proposed in⁴¹. Algorithm 1 shows the pseudo-code of Adaboost.M1. In this version, only the weights of the correctly predicted instances are decreased, as shown in Line 8. This decrease is still related to the error made by the base learner (Line 5). The predictions of each classifier are still weighted by their accuracy, as seen in Line 6 and Line 15.

Gradient boosting machine (GBM)

Gradient Boosting Machine (GBM) is a powerful ensemble learning technique that builds models in a stage-wise and additive manner. Each stage of the algorithm fits a new base learner to the residual errors of the combined ensemble learned so far. Conceptually, the process can be interpreted as performing steepest descent optimization with respect to a specified differentiable loss function.

One of the key strengths of GBM is its flexibility, as it can be applied to both regression and classification tasks with any loss function that is differentiable. In classification problems, GBM typically fits an additive logistic regression model, where the loss function is often the negative binomial log-likelihood for binary classification or the multinomial deviance for multi-class classification³⁷.

The general form of the GBM additive model is:

$$\:{\varvec{F}}_{\varvec{m}}\left(\varvec{x}\right)=\sum\:_{m=1}^{M}{\varvec{\gamma\:}}_{\varvec{m}}{\varvec{h}}_{\varvec{m}}\left(\varvec{x}\right)$$

Where $\:{h}_{m}\left(x\right)$ are the m_th weak learners (i.e., decision trees) that their contribution is controlled by a learning rate ($\:{\gamma\:}_{m})$. The model is built iteratively in a forward stage-wise fashion:

$$\:{\varvec{F}}_{\varvec{m}}\left(\varvec{x}\right)={\varvec{F}}_{\varvec{m}-1}\left(\varvec{x}\right)+{\varvec{\gamma\:}}_{\varvec{m}}{\varvec{h}}_{\varvec{m}}\left(\varvec{x}\right)$$

The weak learner $\:{h}_{m}\left(x\right)$ is chosen at each iteration so that loss function L is minimal. To achieve this goal the model becomes:

$$\:{\varvec{F}}_{\varvec{m}}\left(\varvec{x}\right)={\varvec{F}}_{\varvec{m}-1}\left(\varvec{x}\right)+\underset{\varvec{h}}{\mathbf{argmin}}\sum\:_{\varvec{i}=1}^{\varvec{n}}\varvec{L}({\varvec{y}}_{\varvec{i}},{\varvec{F}}_{\varvec{m}-1}\left({\varvec{x}}_{\varvec{i}}\right)-{\varvec{h}}_{\varvec{m}}(\varvec{x}\left)\right)$$

The improvement of minimization is guided by the negative gradient of the loss function with respect to the current prediction function $\:{F}_{m-1}$.

$$\:\varvec{F}\left(\varvec{x}\right)={\varvec{F}}_{\varvec{m}-1}\left(\varvec{x}\right)+{\varvec{\gamma\:}}_{\varvec{m}}\sum\:_{\varvec{i}=1}^{\varvec{n}}{\nabla\:}_{\:\varvec{F}}\varvec{L}({\varvec{y}}_{\varvec{i}},{\varvec{F}}_{\varvec{m}-1}\left({\varvec{x}}_{\varvec{i}}\right))$$

The following equation is used to detect the optimal step length γ_m :

$$\:{\varvec{\gamma\:}}_{\varvec{m}}=\underset{\varvec{\gamma\:}}{\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{m}\mathbf{i}\mathbf{n}}\sum\:_{\varvec{i}=1}^{\varvec{n}}\varvec{L}({\varvec{y}}_{\varvec{i}},{\varvec{F}}_{\varvec{m}-1}\left({\varvec{x}}_{\varvec{i}}\right)-\varvec{\gamma\:}\frac{\partial\:\varvec{L}({\varvec{y}}_{\varvec{i}},{\varvec{F}}_{\varvec{m}-1}\left({\varvec{x}}_{\varvec{i}}\right))}{\partial\:\:{\varvec{F}}_{\varvec{m}-1}\left({\varvec{x}}_{\varvec{i}}\right))})$$

This procedure is generally applicable to both regression and classification tasks; the only difference lies in the choice of the loss function⁴².

For multi-class problems, GBM approximates an additive function () for each class guided by the following loss function :

$$\:\mathcal{L}\mathcal{\:}{\left\{{\varvec{y}}_{\varvec{i}},{\varvec{F}}_{\varvec{l}}\left(\varvec{x}\right)\right\}}_{1}^{\varvec{L}}=\:-\:{\sum\:}_{\varvec{l}=1}^{\varvec{L}}\varvec{l}\varvec{o}\varvec{g}{\:\varvec{y}}_{\varvec{i}}{\varvec{p}}_{\varvec{i}}\:\left(\varvec{x}\right)$$

Where is the number of classes, takes the value 1 when sample belongs to class or 0, otherwise, and () is the probability of for the class. This probability () is estimated by the method as follows:

$$\:{\varvec{p}}_{\varvec{l}}\left(\varvec{x}\right)=\:\frac{{\varvec{e}}^{\varvec{F}{\varvec{l}}_{\left(\varvec{x}\right)}}}{{\sum\:}_{\varvec{j}=1}^{\varvec{L}}{\varvec{e}}^{\varvec{F}{\varvec{j}}_{\left(\varvec{x}\right)}}}$$

(37)

Extreme gradient boosting (XGBoost)

Extreme Gradient Boosting (XGBoost) is an advanced, optimized implementation of gradient boosting algorithms, particularly designed for performance and scalability⁴³. While traditional Gradient Boosting Decision Trees (GBDT) rely on the first-order derivative (gradient) of the loss function, XGBoost leverages both the first and second-order derivatives by performing a second-order Taylor expansion of the loss function. This allows for more precise and efficient model optimization.

Each tree in XGBoost is trained on residuals from the previous iteration, with the goal of progressively minimizing the overall prediction error. Unlike classical GBDT, which builds trees sequentially, XGBoost constructs trees in parallel, similar to the Random Forest approach, enabling significant computational efficiency. The way that the XGboost works is as follows: for a given data set with n examples and m features, defined as D = {(x_i, y_i)} where |D| = n, x_i ∈$\:{\mathbb{R}}^{m}$, y_i ∈$\:\mathbb{R}$, the tree ensemble model predicts the output by using the sum of K additive functions:

$$\:{\widehat{\varvec{y}}}_{\varvec{i}}=\varvec{\Phi\:}\left({\varvec{x}}_{\varvec{i}}\right)=\sum\:_{\varvec{K}=1}^{\varvec{K}}{\varvec{f}}_{\varvec{k}}\left({\varvec{x}}_{\varvec{i}}\right),\:\:{\varvec{f}}_{\varvec{k}}\in\:\mathcal{\:}\mathcal{F}\mathcal{\:}$$

Here, $\:\mathcal{F}=\left\{f\left(x\right)={w}_{q\left(x\right)}\right\},\:where;\:q:{\mathbb{R}}^{m}\to\:T$ with q representing the structure of each tree and the number of leaves is T, and $\:w\:\epsilon\:{\mathbb{R}}^{T}$ represents the scores on the leaves. Each $\:{f}_{k}$ denotes an independent Classification and Regression Tree (CART), and the final prediction is obtained by summing the scores from the corresponding leaves. To learn these functions, XGBoost minimizes a regularized objective function:

$$\:\mathcal{L}\left(\varvec{\varnothing\:}\right)=\sum\:_{\varvec{i}}\varvec{l}\left({\widehat{\varvec{y}}}_{\varvec{i}},{\varvec{y}}_{\varvec{i}}\right)+\sum\:_{\varvec{k}}\varvec{\Omega\:}\left({\varvec{f}}_{\varvec{k}}\right)$$

Here, l is a differentiable convex loss function, and the regularization term is defined as:

$$\:\varvec{\Omega\:}\left(\varvec{f}\right)=\varvec{\gamma\:}\varvec{{\rm\:T}}+\frac{1}{2}\:\varvec{\lambda\:}{\parallel\varvec{w}\parallel}^{2}$$

This regularization helps control model complexity, encouraging simpler trees and reducing overfitting. At each iteration t, a new function f_t is added to improve the current model, and the objective becomes:

$$\:{\mathcal{L}}^{\left(\varvec{t}\right)}=\sum\:_{\varvec{i}=1}^{\varvec{n}}\varvec{l}\left({\varvec{y}}_{\varvec{i}},{\widehat{\varvec{y}}}_{\varvec{i}}^{(\varvec{t}-1)}+{\varvec{f}}_{\varvec{t}}\left({\varvec{x}}_{\varvec{i}}\right)\right)+\varvec{\Omega\:}\left({\varvec{f}}_{\varvec{t}}\right)$$

Where $\:{\widehat{y}}_{i}^{(t-1)}$ represents the prediction of $\:i$ at iteration t-1, and ($\:{y}_{i},$ $\:{\widehat{y}}_{i}^{(t-1)}$) is the training loss function⁴⁴.

Evaluation metrics

A confusion matrix was constructed to evaluate the performance of the multiclass classification algorithms. From the classification outcomes, several performance metrics were calculated, including accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC). In the confusion matrix, correctly classified instances are recorded as true positives (TP) and true negatives (TN). A false positive (FP) occurs when a negative instance is incorrectly classified as positive, while a false negative (FN) occurs when a positive instance is incorrectly classified as negative. The efficiency of the classifier is evaluated and calculated using the following formulas:

$$\:\varvec{P}\varvec{e}\varvec{r}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}=\:\frac{\varvec{T}\varvec{P}}{\varvec{T}\varvec{P}+\varvec{F}\varvec{P}}$$

$$\:\varvec{R}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}=\:\frac{\varvec{T}\varvec{P}}{\varvec{T}\varvec{P}+\varvec{F}\varvec{N}}$$

The formula of the F-measure, also known as the F1 score, is defined as:

$$\:\varvec{F}-\varvec{m}\varvec{e}\varvec{a}\varvec{s}\varvec{u}\varvec{r}\varvec{e}=2\varvec{*}\:\frac{\varvec{p}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}\varvec{*}\varvec{r}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}}{\varvec{p}\varvec{r}\varvec{e}\varvec{c}\varvec{i}\varvec{s}\varvec{i}\varvec{o}\varvec{n}+\varvec{r}\varvec{e}\varvec{c}\varvec{a}\varvec{l}\varvec{l}}$$

Finally, the total classification accuracy is calculated using the following formula:

$$\:\varvec{A}\varvec{c}\varvec{c}\varvec{u}\varvec{r}\varvec{a}\varvec{c}\varvec{y}=\:\frac{\varvec{T}\varvec{P}+\varvec{T}\varvec{N}}{\varvec{T}\varvec{P}+\varvec{F}\varvec{N}+\varvec{T}\varvec{N}+\varvec{F}\varvec{P}}\varvec{*}100\varvec{\%}$$

Results

In this study, the effect of class imbalance on classification performance was investigated using multiclass imbalanced LSD data, along with an evaluation of the effectiveness of various resampling techniques in addressing this issue.

Comparing the performance of the models under the default condition and after tuning

The comparison between default and tuned conditions highlights the significant impact of hyperparameter tuning on model performance. Table 4 shows that, with default settings, training accuracy ranges from 72.5% to 84.86%, with XGBoost achieving the highest accuracy (84.86%). However, test accuracies are lower due to overfitting concerns. All models show high recall for predicting healthy cases, but recall for the “dead” class is low, with RF achieving the highest recall at 0.35 on the test set. Notably, RF consistently outperformed other models, achieving 83.65% test accuracy, perfect precision for the “dead” class (1.00), and the highest AUC value (0.92). Conversely, DT and GBoost show inconsistent precision, including undefined values (NA) for the “dead” class, and GBoost struggles with both sensitivity and precision, highlighting its difficulty in handling class imbalance.

Table 4 Evaluation metrics of ensemble machine learning algorithms using imbalanced data under default and tuned hyperparameter settings.

Full size table

After tuning the models with optimized hyperparameters, RF achieved the highest overall test accuracy (85.58%), while XGBoost demonstrated marked improvement across multiple metrics, particularly in predicting the minority “Dead” class. Although all models maintained high sensitivity for healthy cases after tuning, led by XGBoost at 0.94, the sensitivity for diseased cases varied from 0.59 (DT) to 0.89 (RF). Predicting the dead class remained challenging; AdaBoost performed the worst (test sensitivity of 0.25), while XGBoost improved to 0.62, outperforming RF (0.33) and other models. Both AdaBoost and XGBoost recorded the highest average AUCs (0.93), while DT exhibited the lowest AUC values. Overall, XGBoost stood out as the top-performing model, achieving high accuracy, recall, and F1-scores across classes. Nevertheless, RF proved to be the most balanced model, delivering the highest test accuracy and precision while maintaining robust performance across all metrics. Despite these improvements, detecting minority classes remains challenging, emphasizing the ongoing need for effective data balancing strategies to develop clinically reliable predictive models.

Effect of resampling methods with and without tuning

We evaluated three resampling techniques: Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance in the dataset. Among them, ROS emerged as the most effective, consistently improving both training and test performance across models. Random Forest (RF) showed the most stable and balanced behavior regardless of the resampling strategy. In contrast, XGBoost, despite achieving the highest training accuracy, was prone to overfitting under RUS. SMOTE, although theoretically robust, introduced synthetic noise that led to inconsistent generalization (Fig. 1). Given its superior performance, ROS was used to develop the final models. As shown in Table 5, RF achieved the best overall results, with training and test accuracies of 87.75% and 80.29%, respectively, indicating strong generalizability. While XGBoost attained the highest training accuracy (88%), its test accuracy declined sharply to 72.28%, reinforcing the presence of overfitting. Decision Tree yielded the lowest performance, with training and test accuracies of 70.29% and 66.35%, respectively, reflecting limited generalization and sensitivity to noise. These results highlight that while data balancing can enhance performance, model responses vary. RF generalized well, but models like XGBoost and DT remained susceptible to overfitting or underfitting, underscoring the need for model-specific strategies such as hyperparameter tuning beyond resampling alone.

Table 5 Evaluation metrics of ensemble machine learning algorithms using ROS data under default and tuned hyperparameter settings.

Full size table

When combined with resampling, hyperparameter tuning substantially improved model performance. Notably, RF and XGBoost achieved the highest test accuracies (82% and 81.25%) while maintaining strong training accuracy (88.8%). Although AdaBoost and GBoost showed moderate gains post-tuning, they still lagged behind, suggesting limitations in adapting to the dataset’s complexity. All models performed well in identifying healthy cases (recall > 0.78); however, only RF and XGBoost achieved high recall across all classes. RF notably reached a recall of 0.89 for the “Dead” class, emphasizing its superior capacity to detect minority outcomes. Both models also exhibited high precision (≥ 0.57) and strong AUCs (0.95–0.98 training; 0.93 test), reflecting robust class discrimination. In contrast, DT and AdaBoost struggled with the “Dead” class, showing low sensitivities (0.12 and 0.56, respectively) and reduced generalization. Poor precision in DT (0.44), GBoost (0.44), and AdaBoost (0.28), along with low F1-scores for DT (0.19) and AdaBoost (0.37), further indicate their difficulty in balancing precision and recall even after ROS. These findings suggest that although resampling and tuning enhance overall performance, certain models particularly DT, AdaBoost, and GBoost, may still require further optimization through alternative resampling techniques or hybrid strategies to more effectively address severe class imbalance.

Significant impact of hyperparameter tuning in balanced and imbalanced scenarios

The effect of hyperparameter tuning on model performance was assessed using paired t-tests comparing accuracy and AUC before and after tuning across imbalanced and balanced datasets.

For the imbalanced dataset, accuracy significantly improved for DT (p = 0.003), RF (p = 0.018), GBoost (p < 0.001), and XGBoost (p < 0.001), while AdaBoost showed no significant change (p = 0.137). AUC improvements were significant for AdaBoost (p < 0.001), GBoost (p < 0.001), and XGBoost (p < 0.001), whereas DT (p = 0.186) and RF (p = 0.585) did not exhibit significant changes. In the balanced dataset, tuning significantly enhanced accuracy for GBoost (p < 0.001) and XGBoost (p < 0.001), but not for DT (p = 0.160), RF (p = 0.074), or AdaBoost (p = 0.307). Regarding AUC, significant gains were found for RF (p = 0.005) and AdaBoost (p < 0.001), while DT (p = 0.322) and XGBoost (p = 1.000) remained statistically unchanged.

These findings underscore the importance of hyperparameter tuning in enhancing ensemble model performance, particularly for boosting algorithms, while also demonstrating the inherent robustness of Random Forest, which performed strongly under default settings and achieved further improvements in both accuracy and AUC after tuning across balanced and imbalanced datasets.

Computational complexity of the implemented models

The computational complexity of the algorithms was influenced by the experimental platform used in our study, which consisted of Windows 10 × 64-bit operating system, 4 GB of RAM, an Intel^® Core™ i5-7200U CPU @ 2.50 GHz, and the R software version 4.4.1 (2024-06-14 ucrt). The system’s memory limitations and CPU processing speed were key factors in the observed computational demands. Hyperparameter tuning, combined with 10-fold cross-validation repeated five times, introduced significant computational overhead.

Distinct variations in computational cost were observed across models. Decision Trees demonstrated the lowest demands, with training times ranging from 14 to 26 min and moderate memory usage (~ 281 MB). Random Forest (RF) required slightly more resources, with training times of 23–28 min and memory usage reaching 351 MB when applied to balanced data. In contrast, boosting algorithms exhibited significantly higher computational costs. XGBoost required over 3 h to train on imbalanced data and an additional hour for balanced data, with a peak memory usage of 735 MB. GBoost required approximately 1.15 h, while AdaBoost and its balanced counterpart took about 1 h and 1.5 h, respectively, with memory usage around 300 MB. These results indicate that simpler tree-based models demand fewer computational resources, whereas boosting algorithms, although often yielding superior performance, require substantially more time and memory. Furthermore, dataset balancing notably increased computational costs for boosting models, particularly for XGBoost. The extensive hyperparameter search during cross-validation further amplified training time across all models.

Feature importance analysis using the random forest ensemble model

We conducted SHAP (SHapley Additive Explanations) analysis on the best-performing ensemble model, Random Forest. The resulting summary plot (Fig. 2) provides valuable insights into the relative importance and directional influence of the features used. By evaluating the magnitude and direction of each variable’s contribution, we gain a deeper understanding of the mechanisms that enhance model performance. Notably, the Neethling vaccine type showed the strongest positive association with the healthy class, whereas the communal feeding system was closely linked to disease presence. Seasonal patterns also emerged, with winter associated with higher mortality risks compared to summer and autumn. Age was a critical determinant, as animals under one year of age were more susceptible to infection and mortality. While breed and age contributed to the model’s predictions, their influence was less significant than that of vaccination and feeding practices. These detailed findings uncover key risk factors shaping LSD outcomes and offer a data-driven basis for designing targeted, risk-based control strategies. A comprehensive summary of these contributions and the novelty of our findings is presented in Table 6.

Table 6 Key contribution and novelty of our approach against the used methods.

Full size table

Discussion

Lumpy Skin Disease presents a major threat to livestock health and food security. Despite advancements in disease management, accurate prediction of LSD outbreaks remains a challenge. Ensemble learning techniques, such as bagging and boosting, offer promising solutions for improving predictive performance. However, limited research has systematically compared the performance of bagging and boosting models for LSD classification, particularly in the context of highly imbalanced, multiclass data. To the best of the author’s knowledge, no prior study has employed ensemble ML algorithms to forecast the risk of LSD using multiclass imbalanced data and evaluate the performance using different resampling approaches. This study addresses this gap by evaluating and comparing the predictive capabilities of bagging and boosting methods, investigating the effects of hyperparameter tuning, and assessing the effects of three resampling techniques: SMOTE, Random Oversampling (ROS), and Random Undersampling (RUS).

The data exhibited a significant class imbalance, a common challenge in machine learning. Some ensemble learning algorithms. particularly Random Forest (RF), XGBoost, and AdaBoost, performed well under default imbalanced conditions. These results are consistent with previous research showing that conventional ML algorithms often underperform on imbalanced datasets, while ensemble approaches tend to offer improved performance⁴⁵. Similarly, Zhu et al.⁴⁶ demonstrated that ensemble algorithms enhance predictive accuracy in medical datasets. In contrast, other algorithms were more adversely affected, showing reduced predictive accuracy and generalization when trained on imbalanced data. Notably, the Decision Tree model performed poorly, which aligns with findings by Mienye and Sun⁴⁵, who observed that DTs perform adequately on balanced data but deteriorate under class imbalance. This performance drop has been further attributed by Silaghi and Mathew⁴⁷ to the DT’s tendency to overfit the majority class by favoring splits that maximize information gain while neglecting minority classes.

To mitigate the detrimental effects of class imbalance, three resampling techniques: Random Oversampling, Random Undersampling, and SMOTE, were evaluated. Our results indicated that ROS consistently outperformed the other methods in terms of model accuracy. This finding aligns with previous research by Kamalov et al.⁴⁸, who highlighted the effectiveness and computational efficiency of ROS compared to more complex techniques like SMOTE. This reinforces the notion that simpler methods can sometimes yield better results than more sophisticated approaches. In contrast, SMOTE was found to enhance the accuracy of imbalanced LSD data in a study by Venkata Pratyusha Kumari⁴⁹. Similarly, Kim and Hwang⁵⁰ reported that ROS and SMOTE outperformed other resampling techniques, while undersampling often led to decreased performance. Overall, oversampling appeared generally more effective than undersampling for improving classification outcomes. However, as discussed by Kovács⁵¹, the effectiveness of resampling strategies can vary depending on the degree of class imbalance and the specific method used. Cieslak et al.⁵² likewise emphasized that oversampling tends to outperform undersampling in scenarios involving severe imbalance.

Moreover, fine-tuning significantly improved model performance, particularly for minority classes. The most substantial gains were observed in DT, AdaBoost, and GBoost models, which initially exhibited poor performance under imbalanced conditions. Hyperparameter optimization proved essential, though its impact varied across algorithms. These findings are consistent with those of Probst et al.⁵³, who reported that Gradient Boosting Machines, unlike Random Forests, exhibit considerable variability in performance depending on hyperparameter configurations, necessitating more strategic tuning. Similarly, a previous study⁵⁴ emphasized that hyperparameter tuning effectively mitigates overfitting and enhances deep learning model performance. In contrast, Carreira-Perpiñán and Zharmagambetov⁵⁵ noted that although RF, AdaBoost, and GBoost are generally considered robust to hyperparameter selection, some level of tuning is often necessary to achieve optimal performance, depending on dataset-specific characteristics.

Building on the results of class imbalance and tuning effects, an overall evaluation of the five ensemble algorithms revealed that RF consistently achieved the highest performance across all scenarios, whether on imbalanced or balanced datasets, and under both tuned and default settings. This demonstrates its robustness and effectiveness in addressing data-related challenges. RF particularly excelled in precision and F1 score, even within the minority class. These findings align with those of Mirzaeian et al.⁵⁶, who reported that RF outperformed other ensemble models such as XGBoost, GBoost, and AdaBoost. Additionally, RF’s strong performance in handling imbalanced data is consistent with the previous study³⁸. Among the boosting algorithms, XGBoost ranked second, reinforcing its status as a strong alternative, particularly for datasets with uneven class distributions, as noted previously by Fitriyani et al.⁵⁷. Moreover, we observed notable improvements in accuracy, precision, and recall for both RF and XGBoost after hyperparameter tuning and data balancing, further supporting the results of a previous research⁴⁵. Collectively, these results highlight RF and XGBoost as the most robust and high-performing models across various conditions, consistent with the conclusions drawn by Gurcan and Soylu⁵⁸. Overall, these findings underscore the critical importance of carefully selecting both the ensemble model, tuning and resampling strategies when addressing classification tasks involving imbalanced data.

While tuning plays a vital role in optimizing model performance by ensuring the most effective parameter settings, it often entails high computational costs. This is particularly evident in boosting algorithms like XGBoost, which are more resource-intensive compared to bagging methods like RF. Computational complexity is influenced by several factors, including dataset size and structure, algorithm type, number of iterations, and hardware limitations. These observations are consistent with the findings of a previous study⁵⁹, which emphasized that selecting optimal hyperparameters for both bagging and boosting techniques is a challenging and time-consuming process, yet crucial for enhancing classification performance. Moreover, our findings underscore that computational complexity can impact model performance, aligning with the conclusions of Ziolkowski⁶⁰. Prior studies have also highlighted that both computational complexity and model performance can be improved through the use of resampling techniques and feature selection. For example, Khan et al.⁶¹ used the NearMiss method to address class imbalance, improving both reliability and computational efficiency by reducing dataset size and minimizing overfitting risks. Similarly, previous research⁶² demonstrated improved accuracy and reduced computational cost through optimal feature selection, which involved eliminating redundant or noisy variables. These insights underscore the importance of maintaining a balance between accuracy and computational efficiency.

To enhance the real-world interpretability of our ensemble learning models and to gain deeper insights into the factors contributing to LSD risk, we employed SHAP value analysis, as recommended by Gurcan and Soylu⁵⁸. The results identified vaccination status as the most influential predictor. Animals vaccinated with the Neethling vaccine were more likely to be classified as healthy, whereas unvaccinated animals were more frequently classified as dead. This finding reflects the strong protective effect of the Neethling LSDV strain and its association with improved health outcomes, consistent with previous research⁶³. In contrast, the Sheeppox vaccine demonstrated lower efficacy in reducing LSD morbidity. This may be attributed to the higher viral doses typically used in heterologous Sheeppox virus vaccines, which, although considered safe, are less effective in cattle than homologous vaccines, as noted in earlier studies^64,65. Other key risk factors identified include grazing systems, the introduction of new animals, season, breed, and age. Communal grazing and new animal introductions significantly increased LSD risk, consistent with Selim et al.³. Seasonally, LSD prevalence peaked in autumn, followed by summer, aligning with previous findings¹, that attribute this trend to warm, humid climates favorable for vector activity. However, other researchers⁶⁶ reported a higher prevalence in winter. In our study, mortality was notably higher during winter, potentially due to stress-related factors and management challenges. This is supported by EFSA⁶⁷, which suggested that winter LSD cases may result from vector-independent transmission routes and delays in outbreak reporting. Regarding demographic factors, sex was not a significant risk predictor, consistent with Selim et al.³. Age showed mixed associations; while young calves (< 1 year) were highly susceptible^68,69, some studies indicated a higher risk in older cattle^3,66. Conversely, other authors indicated that neither sex nor age was significantly related to LSD risk prediction^70,71.

Based on the insights derived from the SHAP analysis, several practical disease management strategies can be proposed for more effective LSD control. The identification of vaccination status as the most influential risk factor underscores the importance of prioritizing effective vaccination campaigns, particularly with Neethling-based vaccines, which have demonstrated superior protection. In contrast, the limited effectiveness of the Sheeppox vaccine highlights the need to adopt more efficacious, homologous alternatives. Focused immunization efforts should target unvaccinated animals and young calves in high-risk areas, as informed by risk modeling. The association between communal grazing and increased LSD risk underscores the importance of promoting controlled grazing systems and raising farmer awareness. Additionally, strict quarantine measures, including disease testing and sourcing from reputable suppliers, are essential to mitigate risks linked to the introduction of new animals. Given the seasonal rise in LSD cases during autumn and summer, enhanced surveillance and vector control during these periods is warranted. Overall, these findings demonstrate how machine learning outputs can be translated into actionable, field-level recommendations, reinforcing the value of explainable AI in veterinary disease management.

Limitations and future work

While this study demonstrates promising results in predicting LSD outcomes using ensemble ML techniques, several limitations should be acknowledged to provide a balanced perspective. The dataset comprising 1,041 samples, while informative, may limit generalizability, especially across larger or more diverse populations. Real-world variability, including environmental, management, and breed differences, was not fully captured, suggesting the need for future studies with more extensive datasets to improve robustness and applicability. The study focused on SMOTE, ROS, and RUS for resampling, but advanced techniques like Tomek Links, NearMiss, or hybrid strategies could further enhance performance in addressing class imbalance. Moreover, the current evaluation was limited to five ensemble algorithms; incorporating more diverse modeling approaches, including stacking ensembles or deep learning architectures, could offer a more comprehensive understanding of model behavior across different scenarios. Hyperparameter tuning, conducted via grid search, was computationally intensive. This, along with algorithmic design and hardware limitations, affected scalability. Future work should explore more efficient optimization methods, such as Bayesian Optimization, to reduce computational cost and enhance model scalability across larger datasets or resource-constrained environments. Another limitation is the classification of “Dead” cases, which may be subject to misclassification bias due to field diagnosis constraints. Moreover, the current framework focused solely on clinical outcomes, excluding broader economic impacts of LSD, such as reductions in milk yield, fertility, and culling rates. Future studies should aim to integrate these factors to provide a more holistic understanding of LSD’s impact. Moving forward, future work should also expand the dataset to include additional risk factors, geographic diversity, and post-resampling data cleaning techniques. Addressing these limitations will improve model precision and robustness, facilitating more informed, data-driven decisions in livestock health management.

Conclusion

This study developed and evaluated ensemble machine learning models for predicting LSD in livestock. The findings demonstrate that ensemble algorithms, particularly Random Forest and XGBoost, can effectively predict LSD occurrence, even in the presence of class imbalance. Model performance was significantly enhanced through hyperparameter tuning and 10-fold cross-validation. The study highlights that tuning must be tailored to the algorithm and data characteristics. Boosting methods, known for their sensitivity to hyperparameters, showed the greatest gains, indicating their dependency on careful parameter optimization. Meanwhile, bagging methods like RF exhibited more stable performance but still benefited from tuning in specific contexts.

Among resampling techniques, SMOTE and ROS outperformed RUS in managing class imbalance, contributing to more reliable model outcomes. The analysis identified key risk factors for LSD, including vaccination status (with Neethling vaccine showing higher effectiveness), communal grazing, recent animal introductions, seasonal patterns (peaking in autumn and summer), breed susceptibility, and younger age groups. Notably, while vector-borne transmission remains central, vector-independent transmission especially during winter, also plays a role. By analyzing various risk factors, these models can assist farmers and decision-makers in implementing targeted prevention and control strategies. The models demonstrate significant potential to improve the accuracy of LSD predictions.

Data availability

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Change history

15 January 2026
A Correction to this paper has been published: https://doi.org/10.1038/s41598-026-36175-0

References

Namazi, F. & Khodakaram, A. Tafti. Lumpy skin disease, an emerging transboundary viral disease: A review. Vet. Med. Sci. 7 (3), 888–896 (2021).
Article PubMed PubMed Central Google Scholar
Sarkar, S., Meher, M. M., Parvez, M. M. M. & Akther, M. Occurrences of lumpy skin disease (LSD) in cattle in Dinajpur Sadar of Bangladesh. RALF 7, 445–455 (2020).
Google Scholar
Selim, A., Manaa, E. & Khater Seroprevalence and risk factors for lumpy skin disease in cattle in Northern Egypt. Trop. Anim. Health Prod. 53 (3), 350 (2021).
Article PubMed Google Scholar
Sweilam, N. H. et al. New crossover lumpy skin disease: numerical treatments. Partial Differ. Equations Appl. Math. 12, 100986 (2024).
Article Google Scholar
Gari, G. et al. Evaluation of the safety, immunogenicity and efficacy of three capripoxvirus vaccine strains against lumpy skin disease virus. Vaccine 33 (28), 3256–3261 (2015).
Article CAS PubMed Google Scholar
Klement, E. et al. Neethling vaccine proved highly effective in controlling lumpy skin disease epidemics in the Balkans. Prev. Vet. Med. 181, 104595 (2020).
Article PubMed Google Scholar
Wolff, J. et al. Development of a safe and highly efficient inactivated vaccine candidate against lumpy skin disease virus. Vaccines (Basel). 9 (1), 4 (2020).
Article PubMed Google Scholar
Senthilkumar, C., Vadivu, G. & Neethirajan, S. Early detection of lumpy skin disease in cattle using deep Learning—A comparative analysis of pretrained models. Vet. Sci. 11 (10), 510 (2024).
PubMed PubMed Central Google Scholar
Renald, E., Buza, J., Tchuenche, J. M. & Masanja, V. G. The role of modeling in the epidemiology and control of lumpy skin disease: a systematic review. Bull. Natl. Res. Cent. 47, 141 (2023).
Article Google Scholar
Kaur, A. & Singh, K. Evaluating machine learning methods voting system for predicting the occurrence of lumpy skin condition. SAMRIDDHI - J. Phys. Sci. Eng. Technol. 15 (03), 326–330 (2023).
Google Scholar
Liu, C. L. & Hsieh, P. Y. Model-based synthetic sampling for imbalanced data. IEEE Trans. Knowl. Data Eng. 32 (8), 1543–1556 (2019).
Article Google Scholar
Ali, H. et al. Imbalance class problems in data mining: A review. Indonesian J. Electr. Eng. Comput. Sci. 14 (3), 1560–1571 (2019).
Google Scholar
Wang, S. & Yao, X. M. Imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man. Cybern B Cybern. 42 (4), 1119–1130 (2012). Part B (Cybernetics.
Article ADS Google Scholar
Hassan, F. A., Moawed, S. A., El-Araby, I. E. & Gouda, H. F. Machine learning based prediction for solving veterinary data problems: A review. J. Adv. Veterinary Res. 12 (6), 798–802 (2022).
Google Scholar
Hasib, K. M. et al. A survey of methods for managing the classification and solution of data imbalance problem. J. Comput. Sci. 16 (11), 15461557 (2020).
Article Google Scholar
Fergus, P., Huang, D. S. & Hamdan, H. Chapter 6: prediction of intrapartum hypoxia from cardiotocography data using machine learning. In Applied Computing in Medicine and Health (pp. 125–146) (Morgan Kaufmann, 2016).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 (1), 321–357. https://doi.org/10.1613/JAIR.953 (2002).
Article Google Scholar
López, V. et al. An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013).
Article ADS Google Scholar
Hassan, H., Ahmad, N. B. & Anuar, S. Improved students’ performance prediction for multi-class imbalanced problems using hybrid and ensemble approach in educational data mining. in Journal of Physics: Conference Series. Vol. 1529. No. 5. IOP Publishing (2020).
Ali, R., Hardie, R. C., Narayanan, B. N. & De Silva, S. Deep learning ensemble methods for skin lesion analysis towards melanoma detection. in 2019 IEEE National Aerospace and electronics conference (NAECON), 311–316. IEEE (2019).
Fernández-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? JMLR 15 (1), 3133–3181 (2014).
MathSciNet Google Scholar
Witten, I. H., Frank, E. & Hall, M. A. Chap. 8 - Ensemble Learning, in Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), I.H. Witten, E. Frank, and M.A. Hall, Editors, Morgan Kaufmann: Boston. pp. 351–373 (2011).
CİHAN, P. Horse surgery and survival prediction with artificial intelligence models: performance comparison of original, imputed, balanced, and feature-selected datasets. Kafkas Univ. Vet. Fak Derg. 30 (2), 233–241 (2024).
Google Scholar
Tito, M. H. et al. A comparative study of ensemble machine learning algorithms for brucellosis disease prediction: detection of brucellosis using artificial intelligence. LIAB 3 (2), 23–27 (2023).
MathSciNet Google Scholar
Keshavarzi, H., Sadeghi-Sefidmazgi, A., Mirzaei, A. & Ravanifard, R. Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle. Prev. Vet. Med. 175, 104869 (2020).
Article PubMed Google Scholar
Punyapornwithaya, V., Klaharn, K., Arjkumpa, O. & Sansamur, C. Exploring the predictive capability of machine learning models in identifying foot and mouth disease outbreak occurrences in cattle farms in an endemic setting of Thailand. Prev. Vet. Med. 207, 105706 (2022).
Article PubMed Google Scholar
Elsheikh, H. E. Advanced studies on some viral diseases in Ruminants. PhD thesis, in Department of Animal Medicine (Infectious Diseases), Zagazig university. (2022).
Wickham, H. et al. Welcome to the {tidyverse}. J. Open. Source Softw. 4 (43), 1686 (2019).
Article ADS Google Scholar
Wickham, H. & Bryan, J. Read Excel Files. R package version 1.4.3. (2023). https://CRAN.R-project.org/package=readxl
Liaw, A. & Wiener, M. Classification and regression by randomforest. R News. 2 (3), 18–22 (2002).
Google Scholar
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Soft. 28(5), 1–26. (2008). Available from:https://www.jstatsoft.org/index.php/jss/article/view/v028i05
Chen, T. & He, T. Extreme Gradient Boosting. R package version 1.7.8.1. (2024). https://CRAN.R-project.org/package=xgboost
Alfaro, E., Gámez, M., García, N. & {adabag} An {R} package for classification with boosting and bagging. J. Stat. Softw. 54 (2), 1–35 (2013).
Ridgeway, G. & Developers, G. Generalized Boosted Regression Models. R package version 2.2.2. (2024). https://CRAN.R-project.org/package=gbm
Shi, G. Chapter 5: Decision Trees. Data Mining and Knowledge Discovery for Geoscientists. : pp. 111–138. (2014).
Breiman, L., Friedman, J., Olshen, R. A. & Stone, C. J. Classification and Regression Trees 368 (Chapman and Hall/CRC, 1984).
González, S., García, S., Del Ser, J., Rokach, L. & Herrera, F. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion. 64, 205–237 (2020).
Article Google Scholar
Khalilia, M., Chakraborty, S. & Popescu, M. Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Mak. 11, 1–13 (2011).
Google Scholar
Lee, T. H., Ullah, A. & Wang, R. Bootstrap aggregating and random forest. In Macroeconomic Forecasting in the Era of Big Data Vol. 52 (ed. Fuleky, P.) (Springer, 2020). https://doi.org/10.1007/978-3-030-31150-6_13.
Chapter Google Scholar
Zhu., J., Rosset, S., Zou, H. & Hastie, T. Multi-class adaboost. Stat. Its Interface. 2 (3), 349–360 (2006).
MathSciNet Google Scholar
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), 119–139 (1997).
Article MathSciNet Google Scholar
Friedman, J. H. J.A.o.s., Greedy function approximation: a gradient boosting machine. : pp. 1189–1232. (2001).
Sagi, O. & Rokach, L. Approximating XGBoost with an interpretable decision tree. Inf. Sci. 572, 522–542 (2021).
Article MathSciNet Google Scholar
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA. (2016). https://doi.org/10.1145/2939672.2939785. 785–794.
Mienye, I. D. & Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inf. Med. Unlocked. 25, 100690 (2021).
Article Google Scholar
Zhu, M., Su, B. & Ning, G. Research of medical high-dimensional imbalanced data classification ensemble feature selection algorithm with random forest. In 2017 International Conference on Smart Grid and Electrical Automation (ICSGEA), pp. 273–277. IEEE (2017).
Silaghi, M. & Mathew, B. Applying Minority Range to Gini Index to Handle Imbalanced Dataset in Decision Tree classifiers. Preprint (2023).
Kamalov, F., Leung, H. H. & Cherukuri, A. K. Keep it simple: random oversampling for imbalanced data. In 2023 Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–4. IEEE (2023).
Venkata, P., Kumari, S. A. & Novel Ensemble Learning technique for lumpy skin disease classification. Int. J. Intell. Syst. Appl. Eng. 12 (3), 4238–4247 (2024).
Google Scholar
Kim, M. & Hwang, K. B. An empirical evaluation of sampling methods for the classification of imbalanced data. PLOS ONE. 17 (7), e0271260 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kovács, G. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Apll Soft Comput. 83, 105662 (2019).
Article Google Scholar
Cieslak, D. A., Chawla, N. V. & Striegel, A. Combating imbalance in network intrusion datasets. IEEE International Conference on Granular Computing, Atlanta, GA, USA, 732–737 (2006), Atlanta, GA, USA, 732–737 (2006) (2006). https://doi.org/10.1109/GRC.2006.1635905
Probst, P., Boulesteix, A. L., Bischl, B. & Tunability Importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20 (53), 1–32 (2019).
MathSciNet Google Scholar
Noor, S. et al. Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinform. 25 (1), 360 (2024).
Article CAS Google Scholar
Carreira-Perpiñán, M. Á. A. Zharmagambetov. Ensembles of bagged TAO trees consistently improve over random forests, AdaBoost and gradient boosting. In Proceedings of the 2020 ACM-IMS on foundations of data science conference, pp. 35–46 (2020).
Mirzaeian, R. et al. Which are best for successful aging prediction? Bagging, boosting, or simple machine learning algorithms? Biomed. Eng. Online. 22 (1), 85 (2023).
Article PubMed PubMed Central Google Scholar
Fitriyani, N. L., Syafrudin, M., Alfian, G. & Rhee, J. HDPM: an effective heart disease prediction model for a clinical decision support system. Ieee Access. 8, 133034–133050 (2020).
Article Google Scholar
Gurcan, F. & Soylu, A. Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis Prognosis Cancers 16(19), 3417 (2024).
Jafarzadeh, H., Mahdianpari, M., Gill, E., Mohammadimanesh, F. & Homayouni, S. Bagging and boosting ensemble classifiers for classification of Multispectral, hyperspectral and PolSAR data: A comparative evaluation. Remote Sens. 13 (21), 4405 (2021).
Article ADS Google Scholar
Ziolkowski, P. Computational complexity and its influence on predictive capabilities of machine learning models for concrete mix design. Materials 16 (17), 5956 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Khan, S. et al. XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites. BioData Min. 18 (1), 12 (2025).
Article CAS PubMed PubMed Central Google Scholar
Khan, S., AlQahtani, S. A., Noor, S. & Ahmad, N. PSSM-Sumo: deep learning based intelligent model for prediction of sumoylation sites using discriminative features. BMC Bioinform. 25 (1), 284 (2024).
Article CAS Google Scholar
Haegeman, A. et al. Comparative Evaluation of Lumpy Skin Disease Virus-Based Live Attenuated Vaccines. 9(5), 473 (2021).
Ben-Gera, J., Klement, E., Khinich, E., Stram, Y. & Shpigel, N. Y. Comparison of the efficacy of neethling lumpy skin disease virus and x10RM65 sheep-pox live attenuated vaccines for the prevention of lumpy skin disease - The results of a randomized controlled field study. Vaccine 33 (38), 4837–4842 (2015).
Article CAS PubMed Google Scholar
Zhugunissov, K. et al. Goatpox virus (G20-LKV) vaccine strain elicits a protective response in cattle against lumpy skin disease at challenge with lumpy skin disease virulent field strain in a comparative study. Vet. Microbiol. 245, 108695 (2020).
Article CAS PubMed Google Scholar
Ezzeldin, A., Bashandy, E., Ahmed, Z. & Ismail, T. Epidemiology of lumpy skin disease in Egypt during 2006–2018. J. Appl. Vet. Sci. 8 (1), 90–96 (2023).
Google Scholar
European Food Safety Authority (EFSA). Lumpy skin disease II. Data collection and analysis. EFSA J. Eur. Food Saf. Auth. 16 (2), e05176 (2018).
Google Scholar
Amenu, A., Bekuma, F., Abafaji, G. & Abera, D. Review on epidemiological aspects and economic impact of lumpy skin disease. Dairy Vet. Sci. J. 7 (4), 555716 (2018).
Google Scholar
Farah, A. Review on lumpy skin disease and its economic impacts in Ethiopia. J. Dairy. Vet. Anim. Res. 7 (2), 39–46 (2018).
Google Scholar
Moudgil, G., Chadha, J., Khullar, L., Chhibber, S. & Harjai, K. Lumpy skin disease: insights into current status and geographical expansion of a transboundary viral disease. Microb. Pathog. 186, 106485 (2024).
Article CAS PubMed Google Scholar
Elhaig, M. M., Selim, A. & Mahmoud, M. Lumpy skin disease in cattle: frequency of occurrence in a dairy farm and a preliminary assessment of its possible impact on Egyptian buffaloes. Onderstepoort J. Vet. Res. 84 (1), e1–e6 (2017).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors thank Dr. Hend Elsheikh, Lecturer of Infectious Diseases, for providing the dataset used in this study.

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Animal Wealth Development Department (Biostatistics subdivision), Faculty of Veterinary Medicine, Zagazig University, Zagazig, 44511, Sharkia, Egypt
Hagar F. Gouda & Fatma D. M. Abdallah

Authors

Hagar F. Gouda
View author publications
Search author on:PubMed Google Scholar
Fatma D. M. Abdallah
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization Hagar F. Gouda and Fatma D.M Abdallah; methodology, Hagar F. Gouda and Fatma D.M Abdallah; software, Hagar F. Gouda; validation, Hagar F. Gouda and Fatma D.M Abdallah; formal analysis, Hagar F. Gouda and Fatma D.M Abdallah; investigation, Fatma D.M Abdallah; resources, Fatma D.M Abdallah; writing—original draft preparation, Hagar F. Gouda; writing review and editing, Fatma D.M Abdallah; All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Hagar F. Gouda.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: In the original version of this Article, Figures 1 did not display correctly.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gouda, H.F., Abdallah, F.D.M. Comparative performance of bagging and boosting ensemble models for predicting lumpy skin disease with multiclass-imbalanced data. Sci Rep 15, 39373 (2025). https://doi.org/10.1038/s41598-025-23846-7

Download citation

Received: 22 February 2025
Accepted: 09 October 2025
Published: 10 November 2025
Version of record: 10 November 2025
DOI: https://doi.org/10.1038/s41598-025-23846-7

Subjects

Abstract

Similar content being viewed by others

Design a model to predict incomplete immunization among Ethiopian children using ensemble machine learning algorithms

Enhanced detection of Mpox using federated learning with hybrid ResNet-ViT and adaptive attention mechanisms

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Introduction

Materials and methods

Source of the dataset

Sampling approach

Inclusion criteria

Case confirmation

Ethical compliance and consent

Feature engineering and data preprocessing

Hyperparameter tuning procedure

Ensemble learning algorithms

Decision tree

Random forest

Adaptive boosting (AdaBoost)

Gradient boosting machine (GBM)

Extreme gradient boosting (XGBoost)

Evaluation metrics

Results

Comparing the performance of the models under the default condition and after tuning

Effect of resampling methods with and without tuning

Significant impact of hyperparameter tuning in balanced and imbalanced scenarios

Computational complexity of the implemented models

Feature importance analysis using the random forest ensemble model

Discussion

Limitations and future work

Conclusion

Data availability

Change history

15 January 2026

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links