Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Sui, Qian; Li, Gaoxu; Peng, Yaqi; Zhang, Jiasheng; Zhang, Yibo; Zhao, Riyang

doi:10.1038/s41598-025-00085-4

Download PDF

Article
Open access
Published: 28 May 2025

Scalable and robust machine learning framework for HIV classification using clinical and laboratory data

Qian Sui¹,
Gaoxu Li^2,5,
Yaqi Peng³,
Jiasheng Zhang⁴,
Yibo Zhang^5,6 &
…
Riyang Zhao¹

Scientific Reports volume 15, Article number: 18727 (2025) Cite this article

1985 Accesses
1 Citations
Metrics details

Subjects

Abstract

Human Immunodeficiency Virus (HIV) is a retrovirus that weakens the immune system, increasing vulnerability to infections and cancers. HIV spreads primarily via sharing needles, from mother to child during childbirth or breastfeeding, or unprotected sexual intercourse. Therefore, early diagnosis and treatment are crucial to prevent the disease progression of HIV to AIDS, which is associated with higher mortality. This study introduces a machine learning-based framework for the classification of HIV infections crucial for preventing the disease’s progression and transmission risk to improve long-term health outcomes. Firstly, the challenges posed by an imbalanced dataset is addressed, using the Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique, which was chosen over two alternative methods based on its superior performance. Additionally, we enhance dataset quality by removing outliers using the interquartile range (IQR) method. A comprehensive two-step feature selection process is employed, resulting in a reduction from 22 original features to 12 critical variables. We evaluate five machine learning models, identifying the Random Forest Classifier (RFC) and Decision Tree Classifier (DTC) as the most effective, as they demonstrate higher classification performance compared to the other models. By integrating these models into a voting classifier, we achieve an overall accuracy of 89%, a precision of 90.84%, a recall of 87.63%, and a F1-score of 98.21%. The model undergoes validation on multiple external datasets with varying instance counts, reinforcing its robustness. Furthermore, an analysis focusing solely on CD4 and CD8 cell counts which are essential lab test data for HIV monitoring, demonstrates an accuracy of 87%, emphasizing the significance of these clinical features for the classification task. Moreover, these outcomes underscore the potential of combining machine learning techniques with critical clinical data to enhance the accuracy of HIV infection classification, ultimately contributing to improved patient management and treatment strategies. These findings also highlight the scalability of the approach, showing that it can be efficiently adapted for large-scale use across various healthcare environments, including those with limited resources, making it suitable for widespread deployment in both high- and low-resource settings.

Cytokine profiles as predictors of HIV incidence using machine learning survival models and statistical interpretable techniques

Article Open access 02 December 2024

Lightweight federated learning for STIs/HIV prediction

Article Open access 19 March 2024

Proactive recognition and early detection in communities through targeted HIV screening

Article Open access 26 July 2025

Introduction

Acquired Immune Deficiency Syndrome (HIV), caused by the Human Immunodeficiency Virus (HIV), continues to present a formidable global health crisis. HIV attacks the immune system, specifically targeting CD4 + T cells, which leads to immunodeficiency and exposes individuals to opportunistic infections and eventual mortality if untreated. Despite advancements in antiretroviral therapy (ART), HIV remains a public health challenge due to the high rates of new infections and insufficient global progress toward eradication¹. The World Health Organization (WHO) estimates that 38.4 million people were living with HIV at the end of 2021, with a significant portion of these cases located in low- and middle-income countries². The UNHIV “Fast-track” initiative set ambitious 95–95–95 targets to combat the HIV epidemic by 2030: 95% of people living with HIV knowing their status, 95% of diagnosed individuals receiving ART, and 95% of those on ART achieving viral suppression^1,3.

The power of Machine learning (ML) in transforming various fields, especially healthcare, is becoming increasingly evident. Deep learning models, for example, are significantly enhancing medical diagnostics, demonstrated by their effectiveness in analyzing medical images for highly critical conditions like cardiac amyloidosis and neurological disorders, leading to improved accuracy and quicker diagnosis^4,5. Beyond medical imaging, machine learning’s utility extends to optimizing processes in other health-related areas, such as enhancing the nutritional quality and sensory properties of food products⁶. Furthermore, advanced ML models are setting new benchmarks for early and accurate detection of diseases like cancer, showcasing their broad applicability and impact⁷. ML offers promising solutions in the fight against HIV/HIV. ML algorithms HIV to understand the patterns within challenging datasets, particularly useful for forecasting HIV infection risk and progression analysis⁸. To improve patient outcomes and reduce mortality rates, early detection and prediction of disease progression are critical⁹. With the ability to predict disease progression and identify those at the highest risk, ML could play a transformative role in efforts to control and eventually end the HIV/HIV epidemic. Despite the increasing use of ML in disease classification, several challenges remain. A persistent issue in HIV datasets is class imbalance, where the scarcity of HIV-positive cases skews classifiers toward majority classes (HIV-negative), compromising sensitivity, a critical metric for infectious disease detection. For instance, in HIV screening programs, imbalance ratios as severe as 5:100 are common due to the low incidence of new infections relative to the tested population, as observed in large-scale surveillance studies¹⁰. Additionally, medical datasets are often high-dimensional and noisy, making feature selection a critical challenge, as irrelevant features can degrade model performance and interpretability. Scalability is another concern, as models trained on small or region-specific datasets may not generalize well to larger, more diverse populations. Finally, the computational inefficiencies and need for extensive feature sets in real-world clinical settings hinder the integration of ML models into fast, cost-effective diagnostic workflows. Therefore, the impact of dataset size variation on model accuracy is a critical factor that needs to be addressed for robust performance in real-world applications.

To address the challenges, in this study, a ML-based approach is proposed to classify HIV infection utilizing effective augmentation and noise handling techniques. The initial dataset contained 50,000 instances, which was highly imbalanced. The dataset comprises two classes with HIV infected and non-infected class, where the data of infected class is given half of the non-infected class. To address this, we applied Synthetic Minority Over-sampling Technique (SMOTE), an oversampling method to balance the dataset¹¹. SMOTE generates synthetic data points by interpolating between existing minority class samples, effectively balancing the dataset without simply duplicating instances. This approach enhances the model’s ability to learn decision boundaries for the minority class, improving its sensitivity to correctly classify HIV-infected instances while minimizing overfitting. In addition to data imbalance, the dataset contains outliers across all 22 features, with six features exhibiting a particularly high number of extreme outliers. Outliers in medical datasets can significantly distort statistical distributions and impair model performance by skewing decision boundaries. To address this, we employed the interquartile range (IQR) method, which is robust to non-Gaussian distributions often observed in healthcare data. The IQR method identifies outliers based on the spread of the middle 50% of the data, providing a more reliable approach for removing extreme values without excluding informative data points. This preprocessing step helps the model to generalize better, particularly when applied to external datasets with varying data distributions. Additionally, while all 22 features contain some outliers, six classes exhibit a significant number of large outliers, further complicating the classification task. These outliers can distort the learning process and hinder the model’s ability to generalize effectively. To address this issue, we employed outlier detection and removal using the interquartile range (IQR) method to eliminate unnecessary outliers. To further refine the dataset, we applied a two-stage feature selection process. First, Recursive Feature Elimination (RFE) was used, followed by feature ranking based on Median Absolute Deviation (MAD). This process revealed that 12 features significantly influenced the classification task, and thus, they were selected for further analysis. For the classification, we evaluated ten ML models: Random Forest Classifier (RFC), Decision Tree Classifier (DTC), Logistic Regression (LR), AdaBoost (AB), and K-Nearest Neighbor (KNN). The models were tested using the complete set of 22 features and the selected 12 features. Based on superior performance, RFC and DTC were chosen to form a voting classifier. This combined model achieved an accuracy of 89% in classifying HIV. To assess the model’s scalability, we validated its performance on four additional datasets containing similar features. These datasets included 2,139, 5,000, and 15,000 instances, respectively, and a merged dataset of 72,139 instances. The model consistently performed well across all datasets, achieving comparable results using both the full 22 features and the selected 12 features, demonstrating its robustness against feature redundancy and dataset variation. Moreover, this consistent performance across varying dataset sizes highlights the model’s adaptability to different data distributions. Additionally, the minimal performance gap between the full and reduced feature sets underscores its ability to extract relevant patterns without over-relying on redundant features.

Furthermore, an additional performance analysis is conducted using only laboratory test data, specifically CD4, CD4/20, CD8, CD8/20. The CD4 and CD8 cell counts are some key elements that are tested in laboratories and considered to be an integral component of monitoring HIV progression^3,12,13. They are used to assess the degree of immunosuppression in HIV-positive individuals and guide treatment decisions. Notably, the model achieved an accuracy of 87%, just 2% lower than when using all 22 or 12 selected features, after testing these four lab test features alone (CD4, CD4/20, CD8, CD8/20). This highlights the significance of these lab test features in HIV classification. Overall, this study demonstrates the efficacy of the proposed voting classifier and emphasizes the importance of laboratory test data in HIV classification. Despite eliminating 18 features, the model maintained strong performance, reinforcing the relevance of CD4 and CD8 cell counts in the classification process. The major contributions of this study are as follows:

Data augmentation using SMOTE and outlier elimination with the IQR method to address class imbalance and improve data quality.
Presented a two-stage feature selection process using Recursive Feature Elimination (RFE) and Median Absolute Deviation (MAD) to identify the most influential features, enhancing model performance.
Development of a robust ML-based voting classifier combining Random Forest Classifier (RFC) and Decision Tree Classifier (DTC) models, achieving high classification accuracy.
Comprehensive performance evaluation on datasets of varying sizes to investigate the impact of dataset size on model accuracy and scalability.
Demonstration of the significance of laboratory test features (CD4, CD4/20, CD8, CD8/20) in HIV classification, achieving competitive performance with a minimal feature set, highlighting the model’s potential for practical clinical applications.

Moreover, this study explicitly investigates the impact of dataset size variation on model performance. By systematically evaluating ML models on datasets of different sizes this research provides valuable insights into how dataset characteristics influence HIV classification accuracy. The findings contribute to the ongoing effort to develop scalable, data-efficient ML models for real-world HIV diagnosis, emphasizing the importance of balanced datasets, robust preprocessing techniques, and key laboratory test features in achieving reliable HIV classification.

The rest of this paper is organized as follows: Sect. 2 reviews related works, Sect. 3 describes the dataset and preprocessing, Sect. 4 details the models, Sect. 5 presents experimental results, and Sect. 6 presents the discussion and limitations, and lastly Sect. 7 concludes with future directions.

Literature review

The utilization of machine learning techniques in research on AIDS has been investigated in diverse contexts, especially in the realms of mortality prediction, disease progression, and associated health complications. Sara Domínguez-Rodríguez et al.¹⁴ conducted a comparative analysis of seven algorithms aimed at predicting one-year mortality and clinical progression to AIDS among a cohort of infants diagnosed with HIV in South Africa and Mozambique. The Random Forest (RF) algorithm was identified as the most effective performer, achieving an accuracy of 82.8%, a sensitivity rate of 78%, and an area under the curve (AUC) of 0.73. This algorithm demonstrated superior sensitivity and specificity relative to its counterparts. In a related study, G.B. Silva Júnior et al.¹⁵ investigated the occurrence of acute kidney injury (AKI) in patients with AIDS, utilizing the RIFLE classification system to forecast clinical outcomes. Their findings indicated that AKI was present in 37% of the studied population, with patients categorized into various risk levels, thereby underscoring the significance of predictive modeling in assessing clinical outcomes associated with AIDS. Bankole Olatosi et al.¹⁶ conducted an investigation into the utilization of linked integrated health data pertaining to individuals living with HIV (PLWH) for the purpose of predicting future HIV care status. Their study demonstrated that a Bayesian Network model yielded superior performance metrics, achieving an F1 score of 0.87 and an Area Under the Curve (AUC) of 0.94, alongside commendable precision and recall values. Furthermore, Huaxin Pang et al.¹⁷ incorporated deep learning methodologies by introducing an attention mechanism that allocates varying weights to different symptomatic features. This framework exhibited consistent enhancements over conventional approaches, thereby reinforcing the significance of feature weighting in augmenting model efficacy for AIDS classification. Additional relevant research has explored diverse dimensions of AIDS diagnosis and prognosis. Yvonne A.L. Lim et al.¹⁸ investigated the presence of Cryptosporidium oocysts and Giardia cysts in fecal specimens collected from HIV/AIDS patients. Their study employed small subunit (SSU) sequencing techniques, successfully identifying the species Cryptosporidium parvum in 64% of the analyzed samples. This investigation underscores the interplay between AIDS and other co-infectious diseases in the context of disease advancement. In a computationally oriented study, You-Shyang Chen et al.¹⁹ employed a hybrid rough set classifier to analyze the factors that influence adherence to Highly Active Antiretroviral Therapy (HAART). The integration of both linear and nonlinear feature selection methodologies yielded enhanced classification outcomes, thereby demonstrating the potential of hybrid models to augment research findings in the context of acquired immunodeficiency syndrome (AIDS). Furthermore, Paulo E Rauber et al.²⁰ underscored the significance of dimensionality reduction techniques within predictive modeling, advocating for projection-based visual analytics to improve classification effectiveness. Their research laid the groundwork for the application of low-dimensional mappings to refine feature selection processes and enhance classification accuracy. Moreover, the use of deep learning models was further investigated by Margaret C. Steiner et al.²¹, who determined that convolutional neural networks (CNNs) emerged as the most effective architecture for the identification of drug resistance mutations (DRMs), thereby highlighting the importance of biologically relevant features in optimizing model performance. V. Mageshwari et al.²² explored AIDS-related discussions on Twitter, achieving high accuracy (94.65%) and sensitivity (94.56%) in tweet identification. Their methodology showed the relevance of social media data in AIDS-related research, particularly for monitoring public health trends. Kuifang Du et al.²³ developed a deep learning system to detect AIDS-related conditions from ultra-wide-field (UWF) fundus images, achieving impressive AUC scores across multiple clinical datasets. Their work highlights the potential of computer vision in enhancing AIDS diagnosis through medical imaging. Yufeng Zhao et al.²⁴ examined a dataset where AIDS patients were characterized by a list of symptoms and labeled with Traditional Chinese Medicine (TCM) syndromes. Their model achieved an average precision of 0.7736 and recall of 0.7111, showcasing the integration of alternative medical approaches in AIDS research. In addition to machine learning approaches for AIDS classification, various studies have explored optimization techniques and adaptive AI strategies that enhance predictive accuracy and computational efficiency in healthcare and beyond. El-Sayed M. El-Kenawy et al.²⁵ have discussed various hybrid optimization strategies in healthcare AI, integrating metaheuristic algorithms to enhance diagnostic accuracy and decision-making efficiency. These approaches, including hybrid swarm-based and evolutionary methods, optimize medical image analysis, disease prediction, and patient outcome assessment by balancing exploration and exploitation in search spaces. The findings highlight significant improvements over traditional models, demonstrating the potential of hybrid techniques in advancing intelligent healthcare solutions. Mona Ahmed Yassen et al.²⁶ have explored advancements in scalable machine learning, focusing on adaptive models that efficiently process dynamic datasets in renewable energy forecasting. Their study integrates deep learning and hybrid optimization techniques to enhance prediction accuracy while addressing challenges like data volatility and computational complexity. The findings underscore the effectiveness of AI-driven frameworks in managing large-scale, evolving datasets for sustainable energy solutions. El-Sayed M. El-Kenawy²⁷ et al. have examined resilient modeling frameworks designed to function effectively under fluctuating resource constraints, emphasizing adaptive AI-driven strategies for healthcare forecasting. Their work integrates machine learning with metaheuristic optimization to enhance predictive accuracy while ensuring computational efficiency in environments with limited data availability. The study highlights the necessity of scalable, interpretable models that maintain robustness despite variations in resource allocation. While these studies focus on optimization techniques and adaptive AI strategies in diverse domains, their insights into hybrid models, computational efficiency, and predictive accuracy provide a broader perspective on enhancing machine learning applications, including in medical diagnostics like HIV classification. In the pursuit of improving classification tasks like HIV diagnosis, leveraging cutting-edge artificial intelligence (AI) techniques is crucial for improving diagnostic accuracy and understanding complex disease patterns. Aiesha Mahmoud Ibrahim et al.²⁸ explored the potential of AI methods for early diagnosis of Parkinson’s disease (PD), reviewing various deep learning and machine learning techniques that enhance prediction accuracy by processing large datasets. Similarly, Ali Mahmoud Ali et al.²⁹ proposed an advanced feature selection method using Cat Swarm Optimization (CSO) and Support Vector Machines (SVM), achieving a significant 100% accuracy in cancer subtype classification, which greatly outperforms existing models and contributes to the interpretability of cancer research. In addition, another work by Ali Mahmoud Ali et al.³⁰ focused on reviewing AI applications in omics data analysis, highlighting the challenges and solutions for preprocessing, dataset validation, and model testing, thus offering invaluable insights for future applications in omics-based disease classification.

This study distinguishes itself from the presented literature in several key aspects. Unlike prior works that broadly investigate AIDS-related predictions such as mortality or disease progression, our research specifically focuses on HIV classification, directly addressing the diagnostic challenge. Furthermore, our approach emphasizes rigorous data preprocessing with SMOTE and IQR to handle class imbalance and outliers, alongside feature selection techniques are not explicitly detailed in reviewed studies. Notably, we developed and validated a robust voting classifier, and uniquely investigated the impact of dataset size on model scalability, offering insights into model performance across varying data volumes. Finally, a crucial differentiator is our emphasis on the clinical relevance of readily available lab features (CD4, CD8), demonstrating strong classification performance with these minimal features and highlighting the potential for practical clinical application, a focus less pronounced in the existing literature. Therefore, this study provides a unique and clinically relevant approach to HIV classification, distinct from the broader predictive modeling trends prevalent in the current literature.

Methodology

The principal methodology employed in this research encompasses several critical phases, advancing from gathering data to assessing the model. The stages are elaborated upon in Fig. 1.

The methodology of this study begins with a dataset comprising 2,100 patient records and 22 attributes, containing 50,000 attributes. To address class imbalance, the SMOTE is applied, generating synthetic samples for the minority class to create a well-balanced dataset, ensuring that the classification model does not become biased toward the majority class. Following this, outlier removal is performed to eliminate any extreme values that could distort model training, thereby enhancing the reliability and integrity of the dataset. To standardize feature scales and improve model convergence, Min-Max Scaling is employed for data normalization. Next, feature selection is conducted using two robust techniques including, RFE and MAD. Based on these methods, 12 of the most significant features are chosen for the final classification process. To build a high-performing model, ten machine learning classifiers are initially evaluated. Among these, RFC and DTC demonstrate the highest accuracy, making them the most suitable candidates for further optimization. Given their strong individual performance, a Voting Classifier is designed using a hard voting strategy, which aggregates predictions from both classifiers to improve overall classification accuracy and robustness. The performance of the proposed Voting Classifier is evaluated in multiple ways. First, it is tested using both the original 22-feature dataset and the refined 12-feature subset to assess the impact of feature selection on classification performance. Furthermore, to investigate the model’s scalability and robustness, experiments are conducted using datasets of varying sizes, specifically 2,139, 5,000, 15,000, 50,000, and 72,139 data points. These evaluations demonstrate that the Voting Classifier maintains consistent performance across different dataset sizes, proving its reliability in handling large-scale data. Additionally, to explore the model’s effectiveness in clinical settings, classification is performed using only four key clinical lab test features: CD80, CD820, CD40, and CD420. Despite the significant reduction in input features, the Voting Classifier continues to exhibit stable and reliable classification performance. This result underscores the model’s ability to function effectively even with minimal but crucial medical indicators, highlighting its potential applicability in resource-constrained healthcare environments for HIV classification.

Dataset description

The dataset utilized in this investigation comprises healthcare statistics and categorical data about 2,139 patients diagnosed with acquired immunodeficiency syndrome (AIDS), initially released in 1996³¹. The dataset was created by the AIDS Clinical Trials Group (ACTG) of the National Institute of Allergy and Infectious Diseases (NIAID). Additionally, the General Research Center units funded by the National Center for Research Resources contributed to the creation and maintenance of this dataset. The dataset includes 34,494 instances of non-infected patients and 15,506 instances of infected patients. It includes demographic and clinical information such as age, weight, gender, race, treatment types, laboratory results, and behavioral aspects, including a history of intravenous drug use and homosexual conduct. The data is sourced from the AIDS Clinical Trials Group and funded by NIAID and the National Center for Research Resources. It contains no missing values and has not undergone any preprocessing. The dataset reflects medical history elements, including hemophilia and previous antiretroviral treatments. Furthermore, it incorporates variables related to treatment, identifying whether the patients received particular medications, such as Zidovudine (ZDV), Didanosine (ddI), or Zalcitabine (Zal), along with specifications of the treatment group to which they were assigned. Laboratory results, specifically CD4 and CD8 lymphocyte counts, are documented at baseline and after 20 weeks. Additional information about the attributes provided in Table 1.

Table 1 Special attribute information from the dataset.

Full size table

Performance status is measured through the Karnofsky scale, which ranges from 0 to 100. Additionally, the dataset includes information on time-to-failure or censoring alongside a binary classification indicating whether patients were infected with AIDS. This extensive array of variables affords a comprehensive understanding of the patients’ clinical profiles and treatment responses, facilitating detailed analysis and classification.

Feature explanation

The dataset encompasses various attributes that are crucial for predicting HIV infection. Age and weight serve as fundamental baseline measurements, where age can significantly affect disease progression and treatment efficacy, while weight is measured in kilograms at baseline, as weight loss can be an indicator of disease progression. Gender is categorized as female ‘0’ or male ‘1’, as it may influence disease susceptibility and progression, while race is classified as White ‘0’ or non-white ‘1’, accounting for potential genetic or socioeconomic factors that might impact disease outcomes. The dataset documents several risk factors, such as hemophilia, marking its presence ‘1’ or absence ‘0’ as a known risk factor for HIV transmission. The dataset records the homosexual activity as yes ‘1’ or no ‘0’, a known risk factor, and notes the history of intravenous drug use as yes ‘1’ or no ‘0’, another significant factor in HIV transmission. The treatment regimen (trt) indicates specific protocols, ranging from ZDV only ‘0’ to various combinations of antiretroviral medications. The ‘oprior’ and ‘preanti’ variables capture previous antiretroviral therapy exposure, providing valuable information on prior treatment history. Antiretroviral history is categorized as naive or experienced, with stratification based on the duration of prior therapy, offering insights into potential treatment responses and resistance patterns. These treatment-related features are essential for understanding patient history and predicting potential outcomes. CD4 and CD8 T-lymphocytes play crucial roles in the immune system and are key markers for HIV/HIV progression and treatment efficacy. The dataset captures these values at two critical timepoints.

CD4 T-cell counts (cd40 and cd420)

CD4 + T-cells, also known as T-helper cells, are the primary targets of HIV infection³². These cells coordinate the immune response by stimulating other immune cells to fight infection. The dataset tracks CD4 counts at baseline (cd40) and after 20 weeks of treatment (cd420). Normal CD4 counts in healthy adults range from 500 to 1,500 cells/mm³³³. In HIV-positive individuals, CD4 counts typically decline over time, with counts below 200 cells/mm³ indicating HIV³⁴. The baseline measurement (cd40) provides crucial information about the patient’s immune status at the start of treatment and is a strong predictor of disease progression. The 20-week follow-up measurement (cd420) is particularly valuable as it indicates the patient’s response to antiretroviral therapy. An increase in CD4 count from baseline to 20 weeks suggests treatment effectiveness, while a continued decline may indicate treatment failure or resistance³⁵. The rate of CD4 count change between these two timepoints (cd40 to cd420) can also predict clinical outcomes and inform treatment decisions.

CD8 T-cell counts (cd80 and cd820)

CD8 + T-cells, or cytotoxic T-lymphocytes, are immune cells that identify and destroy virus-infected cells³⁶. Unlike CD4 cells, CD8 cells typically increase during HIV infection as the immune system attempts to control the virus. The dataset includes CD8 counts at baseline (cd80) and after 20 weeks (cd820). Normal CD8 counts range from 150 to 1,000 cells/mm³³⁷, but in HIV infection, they often become elevated, sometimes exceeding 1,500 cells/mm³³⁸. The baseline CD8 count (cd80) provides information about the initial immune activation state, while the 20-week measurement (cd820) helps track changes in immune response. The CD8 response is complex; while these cells are crucial for controlling HIV replication, persistently elevated CD8 counts can indicate chronic immune activation, which is associated with faster disease progression³⁹. The ratio of CD4 to CD8 cells (calculated from cd40:cd80 and cd420:cd820) is another valuable metric, with a lower ratio generally indicating more advanced disease.

Data preprocessing

The data preprocessing phase involved multiple critical steps to prepare the dataset for effective modeling. To understand the class balance, we first examined the dataset for the distribution of the target variable ‘infected’, distributed in two classes 0 and 1. Additionally, we checked for duplicate records and discovered no duplicate entries in the dataset. The code also implicitly verified that there were no null values in the dataset, as subsequent operations proceeded without any null value handling, indicating a complete dataset with no missing values.

SMOTE (Synthetic minority Over-sampling Technique)

To address the class imbalance issue in the target variable ‘infected,’ we applied the SMOTE technique. SMOTE works by generating synthetic samples for the minority class instead of merely duplicating existing samples. It creates new instances by selecting a data point from the minority class and generating synthetic samples along the line joining that point to its nearest neighbors⁴⁰. This process helps to balance the dataset by increasing the representation of the minority class, thus preventing the model from being biased toward the majority class. By applying SMOTE, we ensured that both the infected and non-infected classes were equally represented, which significantly improved the model’s ability to learn and predict effectively. After applying SMOTE, the number of instances in the infected class was increased to 34,494, adding an additional 18,988 samples (See Fig. 2). As a result, the overall dataset after augmentation consists of 68,988 instances. This balanced dataset ensures equal representation of both classes, improving the model’s ability to learn from both outcomes.

Outlier elimination

Then, we moved on to identify and address outliers in the dataset. Figure 2 illustrates the initial visualization of the data distribution using box plots. Using the Interquartile Range (IQR) method, features with significant spread (IQR > 2) were identified, which included all 22 features.

As visualized in Fig. 3, our feature distributions revealed significant outliers, particularly in ‘time’, ‘cd40’, ‘cd420’, ‘cd80’, and ‘cd820’, with orange dots scattered far beyond the main blue box in the higher value range. These outliers, especially in the ‘cd’ markers, reflect extreme concentrations of immune cell surface proteins in HIV infection and pose a risk to the reliability and generalizability of our model. IQR-based outlier removal generally enhances predictive accuracy in medical classification tasks by reducing overfitting to anomalies^41,42. So, we used an IQR-based outlier handling method, which effectively filters out noise while preserving valid biological extremes, ensuring that our classifier focuses on the core, clinically meaningful trends in HIV biomarkers.

To handle these outliers, we applied a Quantile Transformer with normal distribution output, transforming the data and capping extreme values. The Quantile Transformer maps the original data to a normal distribution, which both reduces the influence of extreme values and makes the data more amenable to models that assume a Gaussian input. This approach is particularly advantageous when the data do not follow a standard normal distribution and when robust performance in the presence of outliers is desired. The transformation process involved calculating the 25th and 75th percentiles for each feature and defining upper and lower limits as 1.5 times the IQR beyond these percentiles. We capped values outside these limits to prevent outliers from disproportionately influencing the model. The effectiveness of this outlier treatment can be observed in Fig. 4. The post-processing visualization demonstrates a more normalized distribution of values across all features, with significantly reduced outlier presence and more consistent value ranges.

After comparing Figs. 3 and 4, the impact of outlier management is clearly visible. In Fig. 3, numerous orange dots representing outliers are scattered far beyond the box plot whiskers. However, in Fig. 4, these distant orange dots have significantly diminished and are now concentrated much closer to, or within, the box plot whiskers. This demonstrates IQR’s effectiveness in identifying and managing extreme data points. Overall, the application of IQR has led to a more refined dataset representation for the HIV infection classification task, where the data points are now more concentrated within the typical ranges defined by the box plots. This indicates a reduction in the influence of outliers, ensuring a more consistent and reliable dataset for training the machine learning model.

After treating outliers, we performed feature selection using two methods: Recursive Feature Elimination (RFE) with Logistic Regression and Mean Absolute Difference (MAD) scoring. Based on these analyses, we reduced the dataset from 22 to 13 features by eliminating less significant predictors like ‘hemo’, ‘drugs’, ‘oprior’, ‘z30’, ‘gender’, ‘str2’, ‘strat’, ‘symptom’, ‘treat’, and ‘offtrt’. Finally, we applied MinMaxScaler to normalize all features to a range between 0 and 1, ensuring a consistent scale across all predictors for optimal model performance.

Feature selection

This study utilized two feature selection methods: Recursive Feature Elimination (RFE) and Feature Ranking based on Median Absolute Deviation (MAD). RFE identifies complex relationships between features by recursively eliminating the least significant ones, ensuring model-driven feature ranking. However, it is sensitive to outliers. To address this, we applied MAD, which is robust to extreme values, ensuring stable feature selection. MAD is outlier-resistant which is crucial in medical datasets where noisy or extreme values can distort statistical measures, ensuring that important features are not overlooked. This two-step approach ensures that both global feature relevance and local feature variability are considered, offering a more comprehensive and reliable feature selection process compared to using either method alone or other conventional techniques like correlation-based or variance threshold methods.

While alternative feature selection methods such as LASSO and PCA are commonly used, they have certain limitations in this context. LASSO tends to eliminate correlated features, which is problematic in medical datasets where multiple biomarkers may be jointly informative⁴³. PCA transforms features into principal components, reducing interpretability, a crucial factor in clinical decision-making⁴⁴. In contrast, RFE captures complex feature relationships, while MAD ensures robustness to outliers, making their combination particularly suitable for medical datasets. Prior studies^45,46,47 have also demonstrated the effectiveness of RFE and MAD in similar tasks, reinforcing our choice.

Recursive feature elimination (RFE)

RFE is a wrapper-based feature selection method that works with any machine learning model capable of assigning importance to each feature, typically models like decision trees, support vector machines (SVMs), and linear regression models. RFE functions based on backward feature elimination, whereby it iteratively constructs models and discards the least significant features following the feature weights (or coefficients) determined by the model⁴⁸. This approach is especially advantageous in high-dimensional datasets, wherein irrelevant or redundant features may adversely affect model efficacy, extend computation duration, and result in overfitting. Figure 5 illustrates the results of RFE in this dataset.

Based on the RFE ranking chart shown in Fig. 5, we can select features with a ranking of 10 or higher as influential for the classification task. Firstly, by observing the highest RFE ranking achieved by any feature, like ‘cd820’ and ‘cd80’, is equal and slightly above 20, represents the upper limit of feature importance in this context. Secondly, a significant number of features exhibit RFE rankings between 15 and 20, as seen with features like ‘cd420’, ‘cd40’, ‘preanti’, and others in the top half of the chart. This concentration in the upper range suggests a set of features with moderate to high influence. Choosing a threshold of 10 allows us to effectively capture a substantial portion of these moderately to highly ranked features. By setting this cutoff, we prioritize and retain top 12 features including, ‘cd820’, ‘cd80’, ‘cd420’, ‘cd40’, ‘preanti’, ‘karnof’, ‘drugs’, ‘homo’, ‘wtkg’, ‘age’, ‘trt’, and ‘time’, all of which clearly surpass the ranking of 10. Conversely, features falling below this threshold, such as ‘race’, ‘gender’, and ‘hemo’, exhibit considerably lower RFE rankings, indicating their relatively diminished importance or weaker contribution to the model’s performance. Therefore, a threshold of 10 serves as a reasonable and visually informed cut-off to select a set of features demonstrably more influential based on their RFE ranking values in the provided Fig. 5.

Feature ranking with median absolute deviation (MAD)

MAD is a statistical metric employed to assess the average divergence of data points from the arithmetic mean of a dataset. This measure indicates the dispersion or variability present within a particular feature. MAD is frequently employed for unsupervised feature selection, wherein the objective is to prioritize features based on their variability or significance within the dataset, devoid of utilizing labeled outcomes⁴⁹. Features that exhibit elevated levels of variability are generally indicative of increased informational content and are likely to exert a more pronounced influence on model performance. The mathematical formation can be used as,

$$\:\:\text{M}\text{A}\text{D}=MAD=\:\frac{1}{n}\sum\:_{i=1}^{n}|{x}_{i}-{x}^{{\prime\:}}|$$

(1)

Where $\:{x}_{i}\:$is an individual data point, $\:{x}^{{\prime\:}}$ is the mean of the feature, and n is the total number of observations. Figure 6 shows the results of MAD in this dataset.

As depicted in Fig. 6, MAD scores indicate that preanti and time variables exhibit the greatest variability. This observation implies a substantial capacity for these variables to elucidate significant distinctions within the dataset. Conversely, features exhibiting low MAD scores may offer limited informational value; however, their removal must be approached cautiously to prevent the inadvertent exclusion of potentially beneficial features. Therefore, we selected top six features as, there is a discernible drop in MAD score after the sixth feature, ‘cd80’, as visually evident in the Fig. 6.

In this two-step feature selection process, RFE first identified 12 important features, while MAD then pinpointed the six most impactful ones: preanti, time, cd420, cd40, cd820, and cd80. These top MAD features are also present within the 12 features selected by RFE, which strongly validates their significance in the classification task. MAD highlights features with the most substantial individual predictive power, complementing RFE’s model-centric feature ranking. This overlap strengthens confidence in these six features as highly relevant for classification. The convergence of both RFE and MAD on these features provides a robust and well-justified final feature set. Thus, we selected 12 features from RFE to capture a broader range of relevant variables, while the six features identified by MAD, including four key clinical lab test data, were highlighted for their strong individual predictive power and significant relevance to HIV classification.

Normalization with minmaxscaler

Lastly, to enhance the model’s performance, we used MinMaxScaler to normalize all features in the dataset to a consistent scale between 0 and 1. Normalization is crucial because it ensures that each feature contributes equally to the model’s predictions, preventing certain features from dominating due to their larger magnitudes. The MinMaxScaler works by subtracting the minimum value of each feature and then dividing by the range of that feature.This technique transforms all features into a uniform range, making it easier for the model to converge during training and leading to better overall performance.

Model

For the classification task of this study, we utilized ten machine learning models. The selection of these models was driven by their diverse strengths, each contributing to the robustness and scalability of the classification task. Tree-based models like RFC, DTC, GB, LightGBM, and XGBoost are well-suited for handling complex, non-linear relationships and interactions within the data. These models excel in capturing intricate patterns and are less sensitive to noisy data, making them highly effective for large datasets and real-world applications. Their ensemble nature (in RFC, GB, and XGBoost) further enhances their robustness, as combining multiple weak learners reduces overfitting and increases generalization. On the other hand, models like LR and NB are linear and probabilistic, respectively, offering simplicity and interpretability, which is valuable for understanding feature importance and making decisions in settings where transparency is critical. MLP and KNN bring flexibility in capturing both linear and non-linear relationships, with MLP being effective for more complex datasets due to its deep learning nature. Each model’s ability to handle different data structures and learn from them efficiently makes them widely adopted for classification tasks, particularly in large-scale applications where scalability and computational efficiency are paramount. Upon conducting a comprehensive evaluation of the performance of five distinct machine learning models, it became apparent that the RFC and DTC consistently exhibited superior accuracy. This noteworthy performance underscores their efficacy in identifying and capturing the underlying patterns within the dataset. Acknowledging the complementary strengths inherent in both RFC and DTC, we opted to employ an ensemble approach that integrates the predictions of these models. Subsequently, we developed a Voting Classifier as our proposed model. This ensemble methodology not only harnesses the individual strengths of the RFC and DTC but also significantly enhances the overall robustness and accuracy of the classification task.

Random forest classifier (RFC)

The primary goal of RFC is to enhance the accuracy and robustness of decision trees by reducing overfitting, which is a common problem with single decision trees⁵⁰. RFC generates numerous decision trees utilizing random subsets of the training dataset. Each individual tree is constructed from a distinct bootstrap sample; consequently, certain instances may be replicated while others may be excluded entirely. This methodology is referred to as bagging, or Bootstrap Aggregating. In order to enhance randomness and diversity within the tree ensembles, the Random Forest Classifier (RFC) utilizes a random subset of features when determining the optimal split at each node, rather than evaluating all available features. This methodological approach aids in minimizing the correlation among the individual trees and consequently enhances the overall performance of the model. RFC generates an extensive array of decision trees (denoted as T1, T2, ., Tn) by employing bootstrapped samples derived from the training dataset^51,52. Each individual tree undergoes training in isolation, utilizing a randomly chosen subset of the data alongside a randomly selected subset of features⁵¹. The final prediction of the RFC is made by majority voting across all the trees can be presented as Eq. 2:

$$\:Prediction=mode\:({T}_{1}\left(x\right),{T}_{2}\left(x\right),\dots\:,\:{T}_{n}\left(x\right))$$

(2)

In this equation $\:x$ is the input feature vector. Additionally, this algorithm provides an understanding of the significance of features, which is determined by the reduction in Gini impurity (or the enhancement of information gain) resulting from the splits that incorporate the feature across all decision trees.

Decision tree classifier (DTC)

The Decision Tree Classifier constitutes a fundamental algorithm within machine learning, applicable to classification and regression objectives⁵³. This algorithm operates by partitioning the input space into distinct regions predicated on feature values, thereby facilitating decision-making through a hierarchical, tree-like framework. A significant benefit of decision trees lies in their interpretability. The configuration of a decision tree can be graphically represented, facilitating comprehension of the decision-making process based on various features⁵³. This level of interpretability is especially pertinent in fields where transparency is paramount, such as healthcare and finance.

DTC uses measures like Gini impurity or entropy to decide how to split the data at each node. These measures assess how mixed or pure a dataset is at a given split, influencing the tree’s ability to classify data efficiently. Gini Impurity: It quantifies the likelihood of incorrectly classifying a randomly chosen element if it were labeled according to the class distribution in a subset. The Gini impurity is computed as:

$$\:Gini\:\left(D\right)=1-\:\sum\:_{i=1}^{n}{p}_{i}^{2}$$

(3)

here $\:{p}_{i}^{2}$ is the proportion of class i in dataset D. A lower Gini impurity value indicates that a node contains more homogeneous class distributions, meaning less uncertainty in classification⁵⁴.

Entropy measures the amount of uncertainty or disorder within a dataset. The equation of entropy can be represented as,

$$\:Entropy\:\left(D\right)=\:-\:{\sum\:}_{i=1}^{n}{p}_{i}{log}_{2}\left({p}_{i}\right)$$

(4)

A high entropy value signifies more disorder, meaning the data contains a mix of multiple classes, while an entropy of zero indicates a perfectly pure subset⁵⁵. Decision trees aim to reduce entropy by maximizing information gain, which is the difference in entropy before and after a split.

In the context of classification, the prediction is derived by navigating through the decision tree from the root node to a terminal node⁵⁶, guided by the feature values of the input.

$$\:Prediction=class\left(leaf\right)$$

(5)

DTCs are extensively applied across diverse domains, such as medical diagnostics, financial modeling, and customer segmentation, attributed to their user-friendly nature and proficiency in managing both categorical and continuous data.

Logistic regression (LR)

Logistic Regression constitutes a statistical framework predominantly employed for the resolution of binary classification challenges⁵⁷. It was initially posited in the early twentieth century and has since established itself as a cornerstone methodology within the realms of statistics and machine learning. In contrast to conventional linear regression, which forecasts continuous variables, logistic regression estimates the likelihood that a particular observation is associated with a designated categorical class. This is achieved by employing the logistic function, which transforms the predicted values into a probability scale ranging from 0 to 1⁵⁸. The logistic function, commonly referred to as the sigmoid function, provides a transformation for any real number, constraining its output to the interval [0, 1]. This function can be represented as:

$$\:P\left(Y=1|X\right)=\:\frac{1}{1+{e}^{-({\beta\:}_{0}+{\beta\:}_{1}{X}_{1}+{\beta\:}_{2}{X}_{2+\dots\:.{\beta\:}_{n}{X}_{n}})}}$$

(6)

Where, $\:Y$ is the dependent variable,$\:\:X$ is the feature vector, and $\:{\beta\:}_{n}$ are the coefficients to be learned. Logistic regression is extensively employed across various disciplines⁵⁹, including the medical field for the diagnostic classification of diseases, marketing for forecasting consumer response, and the social sciences for the evaluation of binary outcome variables.

AdaBoost (AB)

The fundamental concept underlying AdaBoost is the integration of numerous weak classifiers, typically in the form of decision stumps, to formulate a singular, robust classifier. A weak learner is characterized as a classifier that exhibits performance marginally superior to random chance⁶⁰. The fundamental concept underlying AdaBoost is the integration of numerous weak classifiers, typically in the form of decision stumps, to formulate a singular, robust classifier. A weak learner is characterized as a classifier that exhibits performance marginally superior to random chance. The weight for each sample is updated based on the misclassification error. This can be represented as:

$$\:{\alpha\:}_{t}=\:\frac{1}{2}\:\text{l}\text{o}\text{g}\left(\frac{1-\:{\in\:}_{t}}{{\in\:}_{t\:}}\right)$$

(7)

Where, $\:{\in\:}_{t\:}$ is the weighted error of the weak learner. While boosting methods can be prone to overfitting, AdaBoost is relatively robust when used with simple classifiers, especially when the number of iterations is carefully controlled.

K-Nearest neighbors (KNN)

KNN is one of the simplest and most intuitive machine learning algorithms. It is a non-parametric method, meaning it does not make any assumptions about the underlying data distribution⁶¹. The K-Nearest Neighbors algorithm employs various metrics, including Euclidean distance, to ascertain the proximity between data points.

$$\:d\left({x}_{i},{x}_{j}\right)=\:\sqrt{\sum\:_{l=1}^{n}{({x}_{i}l-\:{x}_{j}l)}^{2}}$$

(8)

where $\:{x}_{i}$ and $\:{x}_{j}\:$are feature vectors for the points being compared. he choice of distance metric (e.g., Euclidean, Manhattan, Minkowski) can affect the performance of KNN⁶². Euclidean distance is commonly used, but other metrics can be more appropriate for certain types of data. The k-Nearest Neighbors algorithm exhibits a pronounced sensitivity to the scale of the dataset. Consequently, it is frequently imperative to either standardize or normalize the feature variables to ensure that each attribute has an equitable influence on the computation of distance metrics.

Proposed model explanation

The Voting Classifier constitutes an ensemble learning methodology that integrates various models to enhance the aggregate efficacy of classification initiatives⁶³. This approach is especially proficient when disparate models discern distinct patterns within the dataset.

Voting Mechanism: In the context of a voting classifier, multiple foundational classifiers, such as the Random Forest Classifier and Decision Tree Classifier, are trained autonomously on an identical dataset. During the prediction phase, each classifier casts a vote for a particular class label, with the ultimate decision being ascertained through either the predominant class (hard voting) or the mean of the predicted probabilities (soft voting)⁶⁴.

Hard Voting: The equation for soft voting is represented as Eq. 8

$$\:\text{F}\text{i}\text{n}\text{a}\text{l}\:\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{i}\text{o}\text{n}=\text{m}\text{o}\text{d}\text{e}\left(\text{P}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{i}\text{o}\text{n}\text{s}\:\text{f}\text{r}\text{o}\text{m}\:\text{C}\text{l}\text{a}\text{s}\text{s}\text{i}\text{f}\text{i}\text{e}\text{r}\text{s}\right)$$

(9)

Soft Voting: The equation for soft voting is represented as Eq. 9

$$\:Final\:Prediction=argmax\:\left(\sum\:_{i=1}^{n}P\left(c\right|x\right))$$

(10)

where p ($\:c\left|x\right)$ is the predicted probability of class C for instance X. The efficacy of the voting classifier is predominantly contingent upon the heterogeneity of the constituent classifiers⁶⁵. The amalgamation of various models facilitates the ensemble in capitalizing on their distinct advantages while mitigating their inherent limitations. By averaging the predictions from multiple models, the voting classifier can reduce the risk of overfitting to the training data, making it more robust to variations in the input. Ensemble methods, specifically voting classifiers, are extensively employed across a multitude of fields such as image classification, text classification, and medical diagnostic processes, owing to their capacity to attain superior accuracy in comparison to solitary predictive models.

Due to the superior performance of RFC and DTC model over other ML models in the classification task, they were selected for the voting classifier model. The DTC is a simple yet powerful model that captures decision boundaries effectively and works well on datasets with non-linear relationships. However, it is susceptible to overfitting when the dataset is noisy or lacks sufficient training samples⁶⁶. In contrast, the RFC is an ensemble model that reduces overfitting by averaging multiple decision trees, thus improving the overall model’s stability and generalization ability⁶⁷. This characteristic makes RFC more robust and less prone to errors compared to individual decision trees. In the Voting Classifier, the complementary nature of RFC and DTC enhances the ensemble’s performance. The DTC’s ability to make quick, decisive predictions is balanced by RFC’s aggregated decision-making process, which mitigates the risk of overfitting. The voting classifier handles feature interactions differently, with DTCs modeling deep specific interactions and RFCs capturing broader patterns across features. By averaging the predictions, the ensemble approach reduces the model’s variance and improves its ability to generalize across different subsets of the dataset. This combined approach leverages both the sharp decision boundaries of DTCs and the smoother boundaries of RFCs, potentially improving performance on complex datasets and making it particularly effective for this task.

Model configurations and hyperparameters

In this study, we evaluated ten different machine learning models for the classification task, including RFC, DTC, GB, LightGBM, XGBoost, MLP, NB, LR, AB, and KNN. For the RFC, we set the number of estimators to 100 and used the Gini impurity criterion, with no maximum depth restriction to allow the model to grow until the stopping criteria are met. In the case of the DTC, we employed the default settings, as the model’s hyperparameters such as criterion and depth were sufficient for achieving good results in preliminary tests. For GB, we used a learning rate of 0.1, a maximum depth of 3, and 100 estimators to balance model performance and training time. Similarly, LightGBM was configured with a learning rate of 0.1, and the number of leaves was set to 31, optimizing for speed and accuracy in large datasets. The XGBoost model was configured with a learning rate of 0.1, 100 estimators, and a maximum depth of 3 to capture nonlinear relationships without overfitting. For the MLP, the hidden layer size was set to 100, with ReLU activation and the Adam optimizer to enable efficient training. NB was used with its default Gaussian distribution, as this approach is simple and effective for certain classification problems. Logistic Regression was configured with the ‘lbfgs’ solver and a maximum iteration of 100 to ensure convergence during optimization. AdaBoost was implemented using a base DTC with 50 estimators and a learning rate of 1.0. Lastly, the KNN model was set with 5 neighbors and used the Minkowski distance metric to evaluate proximity between data points.

For our proposed model, we employed a Voting Classifier, combining the RFC and DTC models with hard voting. This method aggregates the predictions of both models by selecting the class that receives the majority vote, which helps improve robustness by reducing the risk of overfitting or bias toward any one model. The selection of RFC and DTC was based on their complementary learning mechanisms, enabling the ensemble to capture various aspects of the dataset. The hard voting mechanism enhances generalization, making the model more scalable and suitable for practical applications. This combination of diverse machine learning models ensures a comprehensive and balanced approach between model interpretability and scalability, which is critical for real-world deployment.

Result analysis

For proposed approach ten ML models were evaluated with several performances metrices. The RFC, DTC, LR, AB, KNN are best ML models. In first experiment all 22 features were evaluated and then selected 12 features. the dataset has 50,000 instances. The selected performance metrices are (Test Accuracy, Sensitivity, Precision, Specificity, Recall. MCC, NPV, FPR, FDR, FNR, F1 Score) derived. The F1 score is the most significant of utilizing the models. The excellent test accuracy (89%) and AUC (94%) of our voting classifier which is built utilizing two best models RFC (86.49%) and DTC (79.97%). The classifier provides validation accuracies close to the highest accuracy in both Accuracy and AUC.

Evaluation of class imbalance mitigation techniques

In this section, we evaluate various techniques employed to address class imbalance in our dataset, comparing their impact on classification accuracy. The methods assessed include SMOTE, Random Over-sampling, and Adaptive Synthetic Sampling (ADASYN). The performance of these techniques is compared against the baseline model without any class imbalance mitigation.

The results presented in Table 2 indicate that applying SMOTE outperforms other evaluated methods in mitigating class imbalance, achieving an accuracy of 89.00%. In contrast, the baseline model without any mitigation achieved an accuracy of 84.69%, highlighting the negative impact of class imbalance on model performance. Random Over-sampling, which involves duplicating minority class instances⁶⁸, resulted in an accuracy of 86.49%. This method improves performance compared to the baseline but is less effective than SMOTE. Similarly, ADASYN, which generates synthetic samples based on the feature space of minority class instances⁶⁹, achieved an accuracy of 85.55%. While it offers some improvement over the baseline, it does not surpass the performance of SMOTE. These findings suggest that SMOTE is the most effective technique among those evaluated for addressing class imbalance in our dataset. Its ability to generate synthetic samples that enhance the representation of the minority class contributes to improved model performance. In contrast, methods like Random Over-sampling and ADASYN, while beneficial, do not provide the same level of improvement, possibly due to their propensity to introduce noise or overfitting by duplicating or inadequately generating minority class samples. Therefore, based on our analysis, SMOTE is recommended as the preferred method for addressing class imbalance in similar classification tasks.

Table 2 Accuracy of different class imbalance mitigation techniques.

Full size table

The performance analysis of ten ML models

Table 3 presents the test accuracies of ten ML methods for HIV classification, comparing performance with a dataset containing 22 features and a reduced dataset with the top 12 selected features.

Table 3 Test accuracies for HIV classification utilizing ten ML models.

Full size table

The RFC achieved the best accuracy, improving slightly from 86.16% with 22 features to 86.49% with 12 features. The DTC experienced a minor decline, decreasing from 80.14 to 79.97% after feature reduction. Both LR and AB showed decreases in accuracy, with LR falling from 65.05 to 62.97% and AB decreasing from 66 to 64%. The KNN model maintained a stable accuracy of 66%, indicating that the reduction in the number of features did not impact its performance. The GB model showed a slight improvement from 60.57 to 60.63%, while LightGBM experienced a minimal decline from 60.50 to 60.41%. XGBoost remained relatively stable, increasing marginally from 59.19 to 59.24%. Notably, the MLP model benefited from feature reduction, improving from 62.98 to 64.60%. Similarly, the NB model saw a significant accuracy boost, rising from 61.36 to 64.58%. Overall, the impact of feature reduction on model performance was mixed across the different models.

The variation in performance after feature reduction can be attributed to differences in how each model processes information. RFC is an ensemble learning method, benefits from feature selection by reducing redundancy and noise, leading to a slight improvement in accuracy. DTC builds hierarchical splits, is relatively robust to feature reduction but can experience minor accuracy drops when key discriminative features are removed. GB, LightGBM, and XGBoost remained stable or slightly improved since they inherently prioritize important features, making them less sensitive to feature reduction. MLP benefited from a smaller feature set, as removing redundant features reduced noise and improved learning efficiency. NB showed a significant boost because feature reduction eliminated redundant attributes, enhancing its probability estimations. Boosting models and deep learning models can perform well with both large and small feature sets, while probabilistic models like NB benefit from fewer, more relevant features. In contrast, models relying on feature-rich representations, such as LR and AB, experience performance declines when critical features are removed. LR and AB, both of which rely on linear decision boundaries, tend to perform better with a higher number of relevant features, as reducing features may remove critical information needed for effective classification. KNN, a distance-based algorithm, is less affected by feature reduction when the removed features are not highly informative, resulting in stable accuracy. Thus, models that prioritize important features, like RFC, boosting models, and MLP, adapt well to feature reduction, while probabilistic models (NB) benefit from removing redundancy. In contrast, linear models (LR, AB) and DTC model may decline if key features are lost, highlighting the need for careful feature selection.

Feature importance analysis of HIV classification using tree-based models and linear models

The Fig. 7 presents a comparative analysis of feature importance for an HIV classification task, as determined by various machine learning algorithms. It showcases the relative importance assigned to different clinical features by DTC, RFC, GB, XGBoost, LightGBM, and LR models.

A consistent trend across tree-based methods reveals features such as ‘preanti’, ‘cd420’, ‘cd40’, ‘time’, and ‘wtkg’ as notably important, suggesting their significant predictive power in distinguishing HIV status. However, the emphasis and ranking of these features vary somewhat between models; for instance, ‘preanti’ is particularly prominent in Gradient Boosting, while ‘cd40’ and ‘time’ are highly emphasized by LightGBM. Interestingly, LR also highlights ‘cd420’, ‘preanti’, and ‘cd40’ as important, aligning with the tree-based model findings and reinforcing the relevance of these clinical markers in HIV classification. The summary table (See Table 4) consolidates this information, presenting the average feature importance scores numerically.

Table 4 Average feature importance across tree-based model and LR feature weights.

Full size table

Table 4 summarizes the average feature importance scores from tree-based models alongside the feature weights from Logistic Regression (LR), providing a clear picture of which factors contribute most to HIV classification. Among the tree-based models, cd40 stands out with the highest importance score (88.2819), followed closely by time (85.8808) and cd420 (81.5046). These features likely play a crucial role in distinguishing HIV status. Other notable features, including cd80 (72.6612), wtkg (69.8646), and cd820 (62.0620), suggest that immune markers and weight significantly impact classification. Preanti (55.0111) and age (49.4478) also rank relatively high, emphasizing the role of prior antiretroviral treatment and age in predicting outcomes. In contrast, trt (20.0318), karnof (6.4203), homo (5.0173), and race (4.6167) show much lower importance, indicating they contribute less to the model’s decision-making. Interestingly, LR highlights similar key features, with cd420 (0.3205), preanti (0.2463), and cd40 (0.2242) receiving the highest weights, aligning well with the tree-based findings. Time (0.1726) and wtkg (0.0776) remain important across both methods, reinforcing their relevance in HIV classification. Unlike tree-based models, LR assigns slightly more weight to trt (0.0903) and race (0.0392), suggesting these factors may still have some predictive value in a linear setting. Features such as cd820 (0.0231), age (0.0155), and cd80 (0.0103) hold minor influence in both approaches. Overall, this comparison highlights the consistency of key predictors, particularly cd40, cd420, preanti, and time, across different modeling techniques, further confirming their significance in HIV classification.

Confusion matrix analysis of top five ML models with different feature sets

The confusion matrices for the top five machine learning models, including RFC, DTC, LR, AB, and KNN, are presented. These models were evaluated based on their classification accuracy when trained and tested on both the full set of 22 features and the reduced set of 12 features. These five models exhibited the highest performance in classifying HIV, demonstrating their ability to handle both the complete and reduced feature sets effectively. These matrices visually represent the model’s classification abilities, highlighting true positives, false positives, and false negatives.

Figure 8 shows the confusion matrices for RFC, indicating robust performance across both feature sets. Figure 9 illustrates the confusion matrices for DTC, which reveals a slight decline in classification accuracy with the reduced feature set. Figure 10 presents the confusion matrices for LR, highlighting a noticeable decrease in true positives when using 12 features. Figure 11 depicts the confusion matrices for AB, reflecting changes in misclassification patterns between the two feature sets. Finally, Fig. 12 shows the confusion matrices for KNN, which maintains consistent performance regardless of the number of features used.

Proposed classifier’s performance analysis

The performance of the five machine learning models was evaluated based on classification accuracy, which measures the proportion of correctly classified instances. Both the RFC and DTC achieved the highest accuracy scores with both the full set of 22 features and the reduced 12 features. While RFC outperformed DTC, the latter still showed promising results. In contrast, other models demonstrated significantly lower performance. Due to their strong classification accuracy, RFC and DTC were selected for the voting classification ensemble. This choice was made because of their consistent ability to distinguish between classes effectively. These models were identified as the most effective for improving overall classification performance. So, a voting classifier is built utilizing RFC and DTC models which achieves the highest accuracy for the classification task. The confusion matrix and ROC curves for the proposed classifier is given in Figs. 13 and 14.

Figures 13 and 14 represents the confusion matrix and ROC curve of the proposed model for 22 features and 12 features respectively. In the confusion matrix, Class 0 represents non-infected patients and class 1 represents infected patients. Receiver Operating Characteristic (ROC) Curve, which plots the True Positive Rate (y-axis) against the False Positive Rate (x-axis) at diverse threshold settings for a binary classifier. The ROC curve evaluates the trade-off between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate). In this situation, the curve is close to the pinnacle left corner, indicating that the classifier has an excellent balance of excessive sensitivity and specificity. The Area Under the ROC Curve (AUC-ROC) is stated as 0.94 (for both cases) The 95% confidence interval (0.9357–0.9475) indicates that if the experiment were repeated multiple times, the AUC would fall within this range 95% of the time. This confirms the statistical reliability of the results, suggesting that the model consistently distinguishes between classes with high accuracy. The narrow confidence interval further highlights the model’s stability and minimal performance variability. Notable that, the model gives similar performance with all features and selected features. This indicates the efficiency of the feature selection. After feature selection, we dropped 10 features, which did not cause any accuracy gain or loss. This indicates that, the model’s performance has near to no effect on the dropped features.

The model’s performance with selected features is further analyzed through statistical analysis. Table 5 shows the statistical evaluations.

Table 5 Statistical analysis of the proposed model on selected features.

Full size table

The model’s comprehensive performance across various metrics, including Accuracy, Sensitivity, Precision, Specificity, MCC, Recall, NPV, FPR, FDR, and FNR, is detailed in Table 5. These results, derived from the confusion matrix values and provide a thorough evaluation. According to Table 5, the model demonstrates strong predictive capabilities, achieving an overall accuracy of 89%. Sensitivity and Precision are recorded at 87.63% and 90.84%, respectively. The model’s specificity stands at 90.42%, reflecting its ability to accurately identify true negatives. Additionally, it exhibits FPR of 9.58%, FDR of 9.16%, NPV of 87.08%, and FNR of 12.37%. The model’s low FPR of 9.58% ensures minimal misclassification of negative cases, reducing unnecessary medical interventions. Its FNR of 12.37% remains within an acceptable range, demonstrating strong sensitivity in detecting positive cases. The high MCC of 77.99% confirms the model’s balanced and reliable performance, particularly in handling class imbalances. These metrics collectively highlight the model’s robustness, minimizing errors while ensuring accurate and trustworthy predictions. Finally, the F1 Score, which represents the harmonic mean of Precision and Recall, is an impressive 98.21%. These high scores indicate the model’s robustness and balanced performance across various evaluation criteria, effectively capturing the trade-offs between Sensitivity, Precision, and error rates.

Moreover, in our classification study, the Voting Classifier achieved an F1-score of 98.21%, demonstrating a strong balance between precision and recall. The AUC-ROC of 0.94 indicates excellent discriminative ability in distinguishing HIV-positive and negative cases. Additionally, the Gini Index of 0.88 reinforces the robustness of the ensemble model’s predictive performance. These results confirm the effectiveness and reliability of our Voting Classifier for HIV classification task.

Model’s performance on five different datasets

The proposed model was evaluated on five different datasets with varying sizes, ranging from 2139 to 72,139 instances. Importantly, no changes were made to the model’s architecture or hyperparameters across the evaluation process of these datasets. The model consistently demonstrated high performance, with only minor fluctuations in accuracy across different dataset sizes. The dataset used in this study, contained four csv files containing same features but different in size and values. All datasets were obtained from the same healthcare institution, representing different patient cohorts. For model training, we used the largest dataset, consisting of 50,000 instances. To evaluate model performance across varying data volumes, we selected three additional datasets containing 2,139, 5,000, and 15,000 instances. Although sourced from the same institution, these datasets encompass diverse patient demographics and clinical contexts, providing a comprehensive assessment of model robustness. We generated an additional dataset of 72,139 data, merging all the given datasets. The proposed model’s performance on each dataset is showed in Table 6. Table 6 presents the accuracy and ROC curve values for three scenarios: using all 22 features, using the selected 12 features, and using features generated from laboratory test data.

Table 6 Proposed model’s performance on five datasets.

Full size table

For the dataset with 2139 instances, the model achieves an accuracy of 95% with all features and 98% with the selected 12 features. The accuracy drops slightly to 96% when utilizing lab test-generated features. In the 5000 instances dataset, the model’s accuracy is 87% with all features and remains the same with the selected 12 features, but it drops to 85% with lab test features. The 15,000 instances dataset also shows a consistent accuracy of 88% with all features and the selected 12 features, while it decreases to 86% with lab test data. For the dataset with 50,000 instances, the model achieves an accuracy of 89% across all feature sets, whereas in the 72,139 instances merged dataset, the accuracy remains at 89% for both the full feature set and the selected 12 features but drops to 87% with lab test features. The model’s performance across all five datasets, using both the full set of 22 features and the reduced set of 12 features, does not significantly outperform its performance with only the laboratory test data. Despite the laboratory test data consisting of just four features, the model shows only a 2% decrease in accuracy compared to the larger feature sets. The model’s performance using only CD4/CD8-related features remains competitive compared to models with all features, as shown in Table 6. The high accuracy (e.g., 96% for 2139 samples) and AUC values (99%) indicate that CD4 and CD8 counts alone provide strong discriminatory power. This supports the clinical significance of these features, as they directly relate to immune function and HIV progression. For the 5k dataset, the model with all features and the one using only CD4/CD8 features both achieved 87% accuracy and 94% AUC, suggesting that additional features provided no significant benefit. While utilizing 15k dataset, the performance remained consistent, with 88% accuracy and 93% AUC for both models, further demonstrating that CD4/CD8 features contribute the most critical information for classification. On the other hand, for 50k dataset, the model using all features achieved 89% accuracy and 94% AUC, while the CD4/CD8-only model had a very similar performance (88% accuracy and 93% AUC), again reinforcing that these immune features are sufficient for strong classification. For the largest dataset (72,139 samples, merged data), a small drop was observed when using only CD4/CD8 features (87% accuracy, 92% AUC) compared to the full-feature model (89% accuracy, 94% AUC). This suggests that while CD4/CD8 counts are highly informative, some additional features might provide minor refinements in classification for large-scale data. These results collectively highlight that CD4/CD8 cell counts and their ratio are not just relevant but are among the most significant features for classification. The minimal performance drops across datasets when excluding other features underscores their predictive value, demonstrating that immune system dysregulation captured by these markers is a key determinant in classification accuracy. The minimal drop in accuracy demonstrates that these lab features are highly informative and critical for the model’s overall performance.

The model demonstrated consistent performance across external datasets of varying sizes, as shown in Table 6. No significant challenges were encountered during validation, with the model maintaining high accuracy and AUC scores across all datasets. Its robustness to different dataset sizes suggests effective generalization without requiring adjustments to the architecture or hyperparameters. This stability indicates that the model is well-regularized, ensuring reliable performance without overfitting. To provide a visual representation of the Voting Classifier’s performance, Fig. 15 illustrates the model’s accuracy across varying dataset sizes under different feature configurations.

Scalability of the voting classifier

The Fig. 16 illustrates the scalability and deployment characteristics of our voting classifier model by showcasing the training and inference times against varying dataset sizes (2k, 5k, 15k, 50k, and 72k data points). All experiments were performed on a system equipped with an Intel Core i5-8400 processor, 16GB RAM, and an NVIDIA GeForce GTX 1660 GPU, offering adequate computational capacity for deep learning tasks. The system also utilized a 256GB DDR4 SSD, ensuring efficient data storage and retrieval.

As depicted, the training time exhibits a sublinear increase with dataset size, rising from approximately 0.3 s at 2k to 1.6 s at 15k, and further to 6.8 s at 50k, culminating at 8.6 s for 72k. This trend underscores the model’s ability to efficiently handle larger datasets, indicating strong scalability. In contrast, the inference time remains consistently low, fluctuating between 0.02 and 0.4 s across all dataset sizes, demonstrating the model’s rapid prediction capabilities and suitability for real-time applications. This minimal inference time, even with substantial dataset growth, highlights the model’s deployability potential. However, a potential negative impact lies in the increasing training time for very large datasets, which could pose a challenge in scenarios requiring frequent model updates or retraining. Nonetheless, the overall trend of manageable training times and consistently low inference times suggests that our voting classifier model is both scalable and deployable, making it a viable solution for various classification tasks.

A distribution of prediction errors is also analyzed to assess the model’s reliability. The results are illustrated in Fig. 17.

The Fig. 17 depicts the distribution of prediction errors for our voting classifier model across different dataset sizes (2k, 5k, 15k, 50k, and 72k). The x-axis represents the error, ranging from − 1.0 to 1.0, while the y-axis indicates the frequency of these errors. As observed, the error distributions for all dataset sizes are centered around zero, indicating a general tendency for the model to make accurate predictions. However, the spread of the distributions varies significantly. For smaller datasets (2k and 5k), the error distributions are broader, with a higher frequency of errors at the extreme ends (-1.0 and 1.0), suggesting that the model’s predictions are less consistent with limited data. As the dataset size increases (15k, 50k, and 72k), the distributions become narrower and more concentrated around zero, reflecting improved prediction accuracy and consistency. Notably, the 72k dataset exhibits the narrowest distribution, with the highest peak near zero and the lowest frequency of extreme errors, signifying the model’s enhanced performance with larger datasets. This trend underscores the importance of data volume in achieving robust and accurate predictions with our voting classifier model. The reduced error spread with larger datasets indicates that the model benefits from increased data diversity, leading to more stable and reliable predictions. This analysis highlights the model’s ability to leverage larger datasets for improved performance, demonstrating its effectiveness in handling varying data volumes.

Proposed model’s performance comparison with prior studies

In this section, we compared the performance of our proposed model against existing studies, highlighting its effectiveness in predictive accuracy and robustness. By incorporating a voting classifier that synergizes the strengths of both the RFC and DTC, our approach addresses class imbalance and enhances prediction reliability. This comparative analysis shown in Table 7 sets the stage for demonstrating how our methodology advances the state-of-the-art in HIV-related classification tasks.

Table 7 Proposed model’s performance comparison with existing studies.

Full size table

Table 7 summarizes key performance metrics, including Accuracy, AUC, and F1 score, from various studies. Notably, while previous research reports accuracies ranging from 82.8 to 87.1% and AUC values mostly in the 0.73–0.94 range, our proposed model achieves a superior accuracy of 89% with an equivalent AUC of 0.94 and an outstanding F1 score of 98.21%. This performance improvement underscores the robustness of our ensemble strategy over traditional models, providing compelling evidence of its enhanced predictive capabilities and overall effectiveness compared to earlier studies.

Discussion and limitations

This study highlights the efficacy of combining data preprocessing, feature selection, and ensemble learning techniques for classifying AIDS. The incorporation of SMOTE for addressing class imbalance and advanced feature selection methods notably contributed to improving model performance. In real-world implementations, the proposed Voting Classifier, given its strong performance and stability, can be seamlessly integrated into existing diagnostic workflows in clinical settings. This can be achieved by embedding the model into hospital information systems (HIS) or electronic health records (EHR) platforms, allowing automated pre-screening for HIV using routine lab test results. Additionally, mobile health (mHealth) applications can leverage the model to provide rapid, on-the-go diagnostics, particularly in resource-limited settings where access to specialists is scarce. These applications can facilitate early detection and prompt intervention, significantly improving patient outcomes. From an economic standpoint, the scalability of the model ensures cost-effective implementation, as machine learning-based solutions require minimal infrastructure once deployed. The reduction in reliance on extensive, costly laboratory tests, while maintaining diagnostic accuracy, a result in substantial cost savings for healthcare providers, particularly in low-income regions. Furthermore, our study demonstrates that the classifier maintains robust performance across varying dataset sizes, indicating strong generalizability. This adaptability suggests that the model can be fine-tuned for diagnosing other infectious diseases by training on different datasets, including tuberculosis, hepatitis, or COVID-19, thereby enhancing its broader utility. Moreover, the ability to maintain high accuracy with only four key clinical lab features demonstrates its feasibility for use in diverse healthcare settings, including regions with limited diagnostic capabilities. By addressing these aspects, our research underscores the transformative potential of machine learning in enhancing diagnostic precision, reducing healthcare costs, and broadening accessibility to reliable medical classification models.

While the study demonstrates the effectiveness of the Voting Classifier, it is important to acknowledge certain limitations that may impact its broader applicability. As demonstrated in Table 6, our results demonstrate the strong predictive power of CD4 and CD8 cell counts in our classification model. However, there are several implications, limitations, and potential biases associated with relying predominantly on these features. The reliance on CD4 and CD8 cell counts as primary features in the classification model highlights their predictive potential in assessing immune function, particularly in resource-constrained settings where streamlined, cost-effective diagnostics are critical⁷³. This approach could facilitate the development of rapid point-of-care testing systems, improving accessibility in underserved regions. However, CD4/CD8 counts alone do not fully represent the complex interplay of immune responses, as various immune cells, cytokines, and signaling pathways contribute to disease progression⁷⁴. The predictive utility of these markers may vary across different disease stages, with early or subclinical conditions potentially remaining undetected. Additionally, comorbidities such as autoimmune disorders or immunosuppressive treatments could independently alter CD4/CD8 counts, confounding the model’s performance. Demographic variations in CD4/CD8 reference ranges across age, sex, and ethnicity may introduce biases if not adequately represented in the training data. Furthermore, the model does not account for temporal fluctuations in CD4/CD8 counts or inconsistencies in measurement methods across clinical laboratories, which could impact real-world performance. Addressing these limitations requires validating the model across diverse populations, incorporating longitudinal data, and integrating additional biomarkers to improve robustness and generalizability.

Besides CD features, if we can integrate more clinical features, the diagnostic accuracy of HIV classification may improve. Viral load plays a critical role in HIV progression, as higher levels correlate with lower CD4 counts and increased immune activation⁷⁵. Inflammatory markers like IL-6, TNF-α, and CRP provide insights into immune system activation, which can refine classification models. Expanding the dataset to include CD38, HLA-DR, and β2-microglobulin could enhance predictive performance by capturing immune system turnover and activation. Hematological parameters such as absolute lymphocyte count, neutrophil-to-lymphocyte ratio, and platelet-to-lymphocyte ratio could indicate systemic inflammation ⁷⁴. These features, combined with CD4 and CD8 counts, can provide a more comprehensive view of disease progression. Another key limitation is the reliance on a single dataset for model evaluation. Though the results show promise, the generalizability of the Voting Classifier to different datasets or demographic groups remains uncertain. Additionally, it is important to acknowledge that the dataset used in our study is from 1996, which is quite old, it remains valuable for understanding the foundational patterns and trends in HIV diagnosis. However, despite its age, it continues to be a useful reference for understanding historical trends in the disease. Moving forward, our focus will be on utilizing more recent, private datasets collected from hospitals, which will allow us to develop a more reliable detection system and ensure that our model is applicable to current HIV diagnostic practices. Future research should consider validating the model on diverse and larger datasets, particularly those representing various geographical regions and patient profiles, to assess its robustness and versatility. Another limitation pertains to the model’s reliance on traditional machine learning algorithms, which may not fully capture the complex patterns in medical data. Incorporating advanced deep learning techniques, such as CNNs, could potentially improve the model’s performance further, especially in detecting subtle patterns. Furthermore, the interpretability of the model could be enhanced by integrating explainable AI approaches, which would facilitate clinical adoption by ensuring transparency in decision-making. Lastly, while feature selection reduced computational costs, the generalizability of the model may still be affected by smaller or imbalanced datasets, which could hinder its performance in real-world applications.

Therefore, In the future, we aim to address these challenges and further enhance the robustness and applicability of our model in future research. One promising direction is the development of hybrid machine learning models that combine deep learning techniques with traditional classifiers. Integrating Convolutional Neural Networks (CNNs) with tree-based classifiers can enhance feature extraction and decision-making, particularly in complex medical datasets. Furthermore, incorporating ensemble techniques, including boosting methods like XGBoost and LightGBM alongside the existing Voting Classifier, has the potential to refine model performance and scalability. These hybrid models are expected to improve predictive accuracy and robustness for medical classification tasks. Optimizing computational efficiency through model quantization and knowledge distillation will make the model suitable for real-time deployment in mobile health applications and clinical settings. Additionally, cross-domain validation with datasets from other medical conditions will assess the model’s adaptability and generalizability across diverse populations and healthcare systems. By pursuing these future directions, we aim to further improve machine learning models’ accuracy, scalability, and real-world applicability, ultimately advancing diagnostic efficiency and improving patient outcomes in healthcare settings.

Conclusions

This research introduces a robust machine learning methodology for classifying AIDS, utilizing sophisticated techniques in data preprocessing, feature selection, and ensemble modeling. The application of SMOTE to mitigate class imbalance, coupled with the use of Recursive Feature Elimination and Median Absolute Deviation for feature selection, significantly enhanced the model’s predictive accuracy. The RFC and DTC were assessed, with the ensemble Voting Classifier emerging as the most effective model. The main contributions of this study are the application of data balancing and feature selection techniques to enhance the performance of machine learning models, especially in healthcare settings with imbalanced datasets. Additionally, this work highlights the potential of ensemble learning methods, demonstrating the ability of the Voting Classifier to achieve superior classification results, making it a promising tool for medical diagnosis, particularly for conditions like AIDS.

From a practical perspective, the proposed ensemble Voting Classifier offers a scalable solution for automating AIDS diagnosis, particularly in resource-constrained healthcare settings. The model’s robustness to varying dataset sizes makes it suitable for deployment in both small clinics and large hospitals. Furthermore, the integration of feature selection techniques reduces computational costs, making the model highly efficient for real-time applications. These practical benefits suggest that the framework could be a valuable tool in enhancing diagnostic capabilities in healthcare, improving the speed and accuracy of AIDS detection.

However, small, or medium-sized datasets may limit the model’s generalization performance. Future studies could overcome this by collecting larger datasets with more samples of infected patients and selecting the most discriminative features to improve the model’s capability in detecting AIDS more effectively. Nevertheless, the model’s performance was evaluated on a single dataset, and external validation using diverse datasets from different populations is necessary to ensure its generalizability across various demographic groups. In addition, future research could explore advanced deep learning architectures, such as convolutional neural networks or hybrid models, to improve predictive performance further. Incorporating explainable AI techniques could enhance the model’s interpretability, making it more transparent and reliable for clinical decision-making.

The findings of this study underscore the potential benefits of integrating data balancing methodologies with feature selection and ensemble learning to augment the accuracy and dependability of classification models in medical research, particularly for conditions characterized by imbalanced datasets such as AIDS. Moreover, the proposed framework provides a strong foundation for future research focused on enhancing diagnostic systems and supporting clinical decision-making.

Data availability

The datasets generated and/or analysed during the current study are available in the AIDS Virus Infection Prediction repository at https://www.kaggle.com/datasets/aadarshvelu/aids-virus-infection-prediction/data.

References

Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reproductive Health 4, (2022).
Tu, W. et al. Machine learning models reveal neurocognitive impairment type and prevalence are associated with distinct variables in HIV/AIDS. J. Neurovirol. 26, 41–51 (2020).
Article CAS PubMed Google Scholar
Seboka, B. T., Yehualashet, D. E. & Tesfa, G. A. Artificial intelligence and machine learning based prediction of viral load and CD4 status of people living with HIV (PLWH) on Anti-Retroviral treatment in Gedeo zone public hospitals. Int. J. Gen. Med. 16, 435–451 (2023).
Article PubMed PubMed Central Google Scholar
Maman, A., Pacal, I. & Bati, F. Can deep learning effectively diagnose cardiac amyloidosis with 99mTc-PYP scintigraphy? J. Radioanal Nucl. Chem. 334, 1033–1048 (2025).
CAS Google Scholar
Bayram, B., Kunduracioglu, I., Ince, S. & Pacal, I. A systematic review of deep learning in MRI-based cerebral vascular occlusion-based brain diseases. Neuroscience vol. 568 76–94 Preprint at (2025). https://doi.org/10.1016/j.neuroscience.2025.01.020
Yıkmış, S. et al. Optimization of bioactive compounds and sensory quality in thermosonicated black Carrot juice: A study using response surface methodology, gradient boosting, and fuzzy logic. Food Chem. X 25, (2025).
Ozdemir, B. & Pacal, I. An innovative deep learning framework for skin cancer detection employing ConvNeXtV2 and focal self-attention mechanisms. Results Eng. 25, (2025).
Saha, R., Malviya, L., Jadhav, A. & Dangi, R. Early stage HIV diagnosis using optimized ensemble learning technique. Biomed. Signal. Process. Control 89, (2024).
Li, Y. et al. The predictive accuracy of machine learning for the risk of death in HIV patients: a systematic review and meta-analysis. BMC Infectious Diseases vol. 24 Preprint at (2024). https://doi.org/10.1186/s12879-024-09368-z
Delaney, K. P. et al. Evaluation of the performance characteristics of 6 rapid HIV antibody tests. Clin. Infect. Dis. 52, 257–263 (2011).
Article CAS PubMed Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority Over-Sampling technique. J. Artif. Intell. Res. 16 (2002).
Singh, Y., Narsai, N. & Mars, M. Applying machine learning to predict patient-specific current CD 4 cell count in order to determine the progression of human immunodeficiency virus (HIV) infection. Afr. J. Biotechnol. 12, 3724–3730 (2013).
Google Scholar
Alehegn, M. Application of machine learning and deep learning for the prediction of HIV/AIDS. HIV AIDS Rev. 21, 17–23 (2022).
Article Google Scholar
Rodriguez, S. D. et al. Machine learning outperformed logistic regression classification even with limit sample size: A model to predict pediatric HIV mortality and clinical progression to AIDS. PLoS One 17, (2022).
Júnior Silva, G. B. et al. Acute kidney injury in AIDS: frequency, RIFLE classification and outcome. Braz. J. Med. Biol. Res. 43, 1102–1108 (2010).
Article Google Scholar
Olatosi, B. et al. Application of machine-learning techniques in classification of HIV medical care status for people living with HIV in South Carolina. AIDS 35, S19–S28 (2021).
Article PubMed Google Scholar
Pang, H. et al. Effective attention-based network for syndrome differentiation of AIDS. BMC Med. Inf. Decis. Mak. 20, (2020).
Lim, Y. A. L. et al. First genetic classification of Cryptosporidium and giardia from HIV/AIDS patients in Malaysia. Infect. Genet. Evol. 11, 968–974 (2011).
Article PubMed Google Scholar
Chen, Y. S. A comprehensive identification-evidence based alternative for HIV/AIDS treatment with HAART in the healthcare industries. Comput. Methods Programs Biomed. 131, 111–126 (2016).
Article PubMed Google Scholar
Rauber, P. E., Falcaõ, A. X. & Telea, A. C. Projections as visual aids for classification system design. Inf. Vis. 17, 282–305 (2018).
Article PubMed Google Scholar
Steiner, M. C., Gibson, K. M. & Crandall, K. A. Drug resistance prediction using deep learning techniques on HIV-1 sequence data. Viruses 12, (2020).
Mageshwari, V. & Aroquiaraj, I. L. A novel technique for identification and classification of HIV/AIDS related social media data using LD-KMEANS and DBN-LSTM. Multimed Tools Appl. https://doi.org/10.1007/s11042-024-19283-9 (2024).
Article Google Scholar
Du, K. et al. Deep learning system for screening AIDS-related cytomegalovirus retinitis with ultra-wide-field fundus images. Heliyon 10, (2024).
Zhao, Y. et al. A novel classification method for syndrome differentiation of patients with AIDS. Evidence-based Complementary and Alternative Medicine (2015). (2015).
El-kenawy, E. S. M. et al. Greylag Goose optimization: Nature-inspired optimization algorithm. Expert Syst. Appl. 238, (2024).
Abdel-Fattah, Ismail, M. G. A., Kenawy, I., El, E. S. M., Moustafa, H. & admin,, , & E.-D. An AI-Based system for predicting renewable energy power output using advanced optimization algorithms. J. Artif. Intell. Metaheuristics. 8, 01–08 (2024).
Article Google Scholar
El-Sayed, E., Eid, M. M. & Abualigah, L. Machine learning in public health forecasting and monitoring the Zika virus. Metaheuristic Optim. Rev. 1, 01–11 (2024).
Google Scholar
Mahmoud Ibrahim, A., Abed Mohammed, M. A. & Comprehensive Review on advancements in artificial intelligence approaches and future perspectives for early diagnosis of Parkinson’s disease. Int. J. Math. Stat. Comput. Sci. 2, 173–182 (2024).
Article Google Scholar
Ali, A. M. & Mohammed, M. A. Optimized Cancer subtype classification and clustering using Cat swarm optimization and support vector machine approach for Multi-Omics data. J. Soft Comput. Data Min. 5, 223–244 (2024).
Google Scholar
Ali, A. M. & Mohammed, M. A. A comprehensive review of artificial intelligence approaches in omics data processing: evaluating progress and challenges. Int. J. Math. Stat. Comput. Sci. 2, 114–167 (2023).
Article Google Scholar
Hammer, S. & Katzenstein, D. Dataset. https://www.kaggle.com/datasets/aadarshvelu/aids-virus-infection-prediction/data
Govindaraj, S. et al. CD4 + T cells in HIV: A friend or a foe?? Front. Immunol. 14, (2023).
Maciej et al. Recommendations on public health strategies for sexually transmitted infections in China. Uniwersytet Śląski. 7, 343–354 (2024).
Google Scholar
Parekh, N., Westerman, L., … L. V.-L. and point-of & 2023, undefined. Human immunodeficiency virus (HIV) infectionsQ1BMC Medicine; H-Index: 155 SJR: Q1 CORE: NA CCF: NA ABDC:NA FT50: NA. books.google.comB Parekh, L Westerman, L Vojnov, C YangLaboratory and point-of-care diagnostic testing for sexually, 2023•books.google.com.
Gebremedhin, Q. Q., Aynalem, R., Adem, M., Reports, M. & Adem, M. 0. 9NAABS Nas.,D. G.-S. & undefined. Dolutegravir based therapy showed CD4 + T cell count recovery and viral load suppression among ART naïve people living with HIV AIDS: a pilot evaluationQ1Scientific Reports; H-Index: 242 SJR: Q1 CORE: NA CCF: NA ABDC: NA FT50: NA 1Scientific Reports; H-Index: 242 VHB: NA FNEGE: NA CoNRS: NA HCERE: NA BFI: 1 AJG: NA +. nature.comT Gebremedhin, M Aynalem, M Adem D Geremew, Y Aleka, A KiflieScientific Reports, 2024•nature.com. (2024).
immunology, Q. Q. 4. 163NAABS Nat. in immunologyY S.-T. in & undefined. Factors contributing to the potency of CD8 + T cellsQ1Trends in immunology; H-Index: 236 SJR: Q1 CORE: NA CCF: NA ABDC: NA FT50: NA 2Trends in immunology; H-Index: 236 VHB: NA FNEGE: NA CoNRS: NA HCERE: NA BFI: 2 AJG: NA +. cell.comY SykulevTrends in immunology, 2023•cell.com. (2023).
Meintjes, G., Bekker, L. G. & Wood, R. Clinical features and management of HIV/AIDS in adults. Manson’s Trop. Dis. Fourth Ed. 110–133. https://doi.org/10.1016/B978-0-7020-7959-7.00015-4 (2023).
Xiaoyan, X. & Sato, M. K. Quality of life and related factors among people living with HIV in China. J. Nurs. Healthc. Chronic Illn. 3, 513–520 (2011).
Article Google Scholar
Guedes, Q. Q. 1. 14NAABS Nav. D. S., Viruses, W. C.-S.- & undefined. Thymic exhaustion and increased immune activation are the main mechanisms involved in impaired immunological recovery of HIV-positive patients under Q1Viruses; H-Index: 101 SJR: Q1 CORE: NA CCF: NA ABDC: NA FT50: NA. mdpi.comMC Dos Santos Guedes, WHV Carvalho-Silva, JL Andrade-Santos, MCA Brelaz-de-CastroViruses, 2023•mdpi.com. (2023).
Blagus, R. & Lusa, L. SMOTE for High-Dimensional Class-Imbalanced Data. vol. 14 (2013). http://www.biomedcentral.com/1471-2105/14/106
Lee, J. S., Paintsil, E., Gopalakrishnan, V. & Ghebremichael, M. A comparison of machine learning techniques for classification of HIV patients with antiretroviral therapy-induced mitochondrial toxicity from those without mitochondrial toxicity. BMC Med. Res. Methodol. 19, (2019).
Ramadhan, N. G. & Adiwijaya, Maharani, W. Akbar Gozali, A. Chronic diseases prediction using machine learning with data preprocessing handling: A critical review. IEEE Access. 12, 80698–80730 (2024).
Article Google Scholar
Wang, S., Nan, B., Rosset, S. & Zhu, J. Random Lasso. Annals Appl. Stat. 5, 468–485 (2011).
Article MathSciNet Google Scholar
Pechenizkiy, M., Tsymbal, A. & Puuronen, S. PCA-Based Feature Transformation for Classification: Issues in Medical Diagnostics. (2004).
Yang, M. et al. Establishing a prediction model of severe acute mountain sickness using machine learning of support vector machine recursive feature elimination. Sci. Rep. 13, (2023).
Johannes, M. et al. Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients. Bioinformatics 26, 2136–2144 (2010).
Article CAS PubMed Google Scholar
Gong, X., Shen, L. & Lu, T. Refining training samples using median absolute deviation for supervised classification of remote sensing images. J. Indian Soc. Remote Sens. 47, 647–659 (2019).
Article Google Scholar
Alshanbari, H. M. et al. Prediction and Classification of COVID-19 Admissions to Intensive Care Units (ICU) Using Weighted Radial Kernel SVM Coupled with Recursive Feature Elimination (RFE). Life 12, (2022).
Al-Hadeethi, H., Abdulla, S., Diykh, M., Deo, R. C. & Green, J. H. Adaptive boost LS-SVM classification approach for time-series signal classification in epileptic seizure diagnosis applications. Expert Syst. Appl. 161, 113676 (2020).
Article Google Scholar
Mathew, D. T. E. An Improvised Random Forest Model for Breast Cancer Classification. NeuroQuantology 20, 713–722 (2022).
De, A. & Chowdhury, A. S. DTI based Alzheimer’s disease classification with rank modulated fusion of CNNs and random forest. Expert Syst. Appl. 169, (2021).
Farombi, O., Farombi, O. E., Adetunji, A. B. & Ajala, F. A. DEVELOPMENT OF RFC, SVC AND GAUSSIAN NB CLASSIFIER MODELS FOR DISEASE PREDICTION IN MACHINE LEARNING. https://www.researchgate.net/publication/382495280
Grethlein, D. & Kandadai, V. Classification of Drivers with HIV-Associated Neurocognitive Disorders Using Virtual Driving Test Performance Data. (2023).
Jiang, L., Zhang, B., Ni, Q., Sun, X. & Dong, P. Prediction of SNP sequences via Gini impurity based gradient boosting method. IEEE Access. 7, 12647–12657 (2019).
Article Google Scholar
Huang, C. Y., Chen, Y. A., Hwang, I. S. & Wu, R. M. Improving Dual-Task control with a Posture-Second strategy in Early-Stage Parkinson disease. Arch. Phys. Med. Rehabil. 99, 1540–1546e2 (2018).
Article PubMed Google Scholar
Onah, E. et al. Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors. BMC Bioinform. 23, (2022).
Etowa, J., Hannan, J., Etowa, E. B., Babatunde, S. & Phillips, J. C. Determinants of infant feeding practices among black mothers living with HIV: a multinomial logistic regression analysis. BMC Public. Health 21, (2021).
Joshi, R. D. & Dhakal, C. K. Predicting type 2 diabetes using logistic regression and machine learning approaches. Int. J. Environ. Res. Public. Health 18, (2021).
Khandezamin, Z., Naderan, M. & Rashti, M. J. Detection and classification of breast cancer using logistic regression feature selection and GMDH classifier. J. Biomed. Inf. 111, (2020).
Belete, D. M. & Huchaiah, M. D. Performance Evaluation of Classification Models for HIV/AIDS Dataset. in Data Management, Analytics and Innovation: Proceedings of ICDMAI 2021, Volume 1 109–125Springer, (2021).
Alotaibi, A. S. Hybrid model based on relieff algorithm and K-Nearest neighbor for Erythemato-Squamous diseases forecasting. Arab. J. Sci. Eng. 47, 1299–1307 (2022).
Article CAS Google Scholar
Pal, O. K. Skin Disease Classification: A Comparative Analysis of K-Nearest Neighbors (KNN) and Random Forest Algorithm. in International Conference on Electronics, Communications and Information Technology (ICECIT) 1–5 (2021). (2021). https://doi.org/10.1109/ICECIT54077.2021.9641120
Fili, M. et al. A classification algorithm based on dynamic ensemble selection to predict mutational patterns of the envelope protein in HIV-infected patients. Algorithms Mol. Biology 18, (2023).
Belete, D. M. & H, M. D. Classification of Binary Class HIV/AIDS Test Results Using Ensemble Learning Models. in International Conference on Decision Aid Sciences and Application (DASA) 1240–1247 (2020). (2020). https://doi.org/10.1109/DASA51403.2020.9317082
Osamor, V. C. & Okezie, A. F. Enhancing the weighted voting ensemble algorithm for tuberculosis predictive diagnosis. Sci. Rep. 11, 14806 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Leiva, R. G., Anta, A. F., Mancuso, V. & Casari, P. A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. IEEE Access. 7, 99978–99987 (2019).
Article Google Scholar
Raza, A., Rustam, F., Ur Rehman Siddiqui, H., de la Torre Diez, I. & Ashraf, I. Predicting microbe organisms using data of living micro forms of life and hybrid microbes classifier. PLoS One 18, (2023).
Yang, C., Fridgeirsson, E. A., Kors, J. A., Reps, J. M. & Rijnbeek, P. R. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J. Big Data 11, (2024).
Park, H. & Kim, H. AR-ADASYN: angle radius-adaptive synthetic data generation approach for imbalanced learning. Stat. Comput. 34, 166 (2024).
Article MathSciNet Google Scholar
Lin, Y. H. et al. Application of Machine Learning Algorithms in Predicting HIV Infection among Men who Have Sex with Men: Model Development and Validation.
Paul, R. H. et al. Machine-learning classification of neurocognitive performance in children with perinatal HIV initiating de Novo antiretroviral therapy. AIDS 34, (2020).
Li, Y., Tian, Y., Qin, Z. & Yan, A. Classification of HIV-1 protease inhibitors by machine learning methods. ACS Omega. 3, 15837–15849 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kagendi, N. & Mwau, M. A. Machine learning approach to predict HIV viral load hotspots in Kenya using Real-World data. Health Data Sci. 3, (2023).
Turk, G. et al. Biomarkers of progression after HIV acute/early infection: Nothing compares to CD4 + T-cell count? Viruses 10, (2018).
Farooq, S. A. et al. Revealing HIV viral load patterns using unsupervised machine learning and cluster summarization. F1000Res 7, 1144 (2018).
Article ADS Google Scholar

Download references

Author information

Authors and Affiliations

The Fourth Hospital of Hebei Medical University, Shijiazhuang, China
Qian Sui & Riyang Zhao
Department of Mathematics, Xi’an Jiaotong-Liverpool University, Xi’an, China
Gaoxu Li
The Second Hospital of Hebei Medical University, Hebei, China
Yaqi Peng
School of international business, Anhui International Studies University, Wuhu, Anhui, China
Jiasheng Zhang
Gezhi Future Research Institute, No.1501 Building L, HaiDian District, Beijing, China
Gaoxu Li & Yibo Zhang
School of Systems and Computing, UNSW Australia, UNSW Canberra, R118, B15, Canberra, ACT, 2600, Australia
Yibo Zhang

Authors

Qian Sui
View author publications
Search author on:PubMed Google Scholar
Gaoxu Li
View author publications
Search author on:PubMed Google Scholar
Yaqi Peng
View author publications
Search author on:PubMed Google Scholar
Jiasheng Zhang
View author publications
Search author on:PubMed Google Scholar
Yibo Zhang
View author publications
Search author on:PubMed Google Scholar
Riyang Zhao
View author publications
Search author on:PubMed Google Scholar

Contributions

Q.S. (Qian Sui) contributed to the conceptualization and design of the study, conducted data analysis, and drafted the manuscript. G.L. (Gaoxu Li) participated in data collection, performed statistical analysis, and contributed to the revision of the manuscript. Y.P. (Yaqi Peng) assisted in data validation and manuscript editing. J.Z. (Jiasheng Zhang) was responsible for data interpretation and provided critical revisions for important intellectual content. Y.Z. (Yibo Zhang) supervised the study, contributed to project administration, and reviewed the manuscript for final approval. R.Z. (Riyang Zhao) coordinated the revision process, contributed to methodological refinement, and critically reviewed the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yibo Zhang or Riyang Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sui, Q., Li, G., Peng, Y. et al. Scalable and robust machine learning framework for HIV classification using clinical and laboratory data. Sci Rep 15, 18727 (2025). https://doi.org/10.1038/s41598-025-00085-4

Download citation

Received: 26 November 2024
Accepted: 24 April 2025
Published: 28 May 2025
DOI: https://doi.org/10.1038/s41598-025-00085-4

Subjects

Abstract

Similar content being viewed by others

Cytokine profiles as predictors of HIV incidence using machine learning survival models and statistical interpretable techniques

Lightweight federated learning for STIs/HIV prediction

Proactive recognition and early detection in communities through targeted HIV screening

Introduction

Literature review

Methodology

Dataset description

Feature explanation

CD4 T-cell counts (cd40 and cd420)

CD8 T-cell counts (cd80 and cd820)

Data preprocessing

SMOTE (Synthetic minority Over-sampling Technique)

Outlier elimination

Feature selection

Recursive feature elimination (RFE)

Feature ranking with median absolute deviation (MAD)

Normalization with minmaxscaler

Model

Random forest classifier (RFC)

Decision tree classifier (DTC)

Logistic regression (LR)

AdaBoost (AB)

K-Nearest neighbors (KNN)

Proposed model explanation

Model configurations and hyperparameters

Result analysis

Evaluation of class imbalance mitigation techniques

The performance analysis of ten ML models

Feature importance analysis of HIV classification using tree-based models and linear models

Confusion matrix analysis of top five ML models with different feature sets

Proposed classifier’s performance analysis

Model’s performance on five different datasets

Scalability of the voting classifier

Proposed model’s performance comparison with prior studies

Discussion and limitations

Conclusions

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links