Abstract
Human Immunodeficiency Virus (HIV) remains a critical public health concern, and is a significant global health challenge, particularly in developing countries. Early HIV detection supports targeted interventions, and substantially reduce the HIV burden. In many resource-limited settings, early detection of HIV is hindered by stigma, limited access to testing, and low risk awareness. This study aims to enhance HIV screening in resource-limited settings by employing machine learning models to predict HIV risk using demographic and lifestyle variables. We analyzed data from 39,295 individuals in Shiraz, Iran, identifying key predictors, including drug injection, age, having a spouse with a history of HIV, occupation, and prison record. We trained and validated an Extreme Gradient Boosting (XGBoost) model using stratified five-fold cross-validation on the dataset. The XGBoost model achieved high accuracy (0.89; Confidence Interval (CI) 95% [0.88–0.89]), very-good discriminatory ability (Area Under the ROC Curve (AUC = 0.84 [0.83–0.84], with a fair-to-good agreement (Cohen’s Kappa of 0.51 [0.51–0.52]). Moreover, the performance of the proposed method (PREDICT-HIV) was consistent across test folds. Our findings align with previous studies, emphasizing the importance of socio-demographic and behavioral factors in HIV risk prediction. The model’s robustness suggests its potential for practical implementation, aiding early identification and intervention in high-risk groups. Future research should incorporate additional socioeconomic variables and validate the model in diverse populations to enhance global HIV prevention efforts. The web application, implemented using the Django framework, is freely available online for public access. PREDICT-HIV may support earlier identification and intervention in underserved populations, improving the efficiency of HIV screening programs.
Similar content being viewed by others
Introduction
The Human Immunodeficiency Virus (HIV) remains a significant global health challenge, responsible for millions of deaths worldwide. According to the World Health Organization (WHO), there were approximately 39 million people worldwide living with HIV at the end of 20221. The WHO African Region remains most severely affected, with nearly 1 in every 25 adults living with HIV. The burden of disease caused by HIV is profound, impacting not only the health and life expectancy of individuals but also affecting economies and societies at large. The global age-standardized HIV/Acquired Immune Deficiency Syndrome (AIDS) disability-adjusted life years (DALYs) rate was 601.49 (95% UI 536.16–703.92) per 100,000 cases in 20192. While global HIV incidence and AIDS-related deaths declined during this period, both metrics increased in the Middle East and North Africa (MENA) region. This trend contrasted sharply with global improvements3. It aligns with recent data from Iran indicating that only 40.9% of Iranian adults have adequate HIV transmission knowledge, and less than 10% underwent HIV testing in the past year—despite Iran being a Joint United Nations Programme on HIV/AIDS (UNAIDS)-prioritized fast-track country with over 46,000 People Living with HIV (PLWH) in 20234.
The impact of HIV is particularly pronounced in developing countries, where it remains a critical public health concern. Despite global efforts, these regions continue to bear a disproportionate burden of the HIV epidemic. The complexities in these areas, ranging from limited healthcare infrastructure to social stigma, exacerbate the challenge of HIV prevention and management.
Traditional HIV screening methods, often reliant on direct testing in healthcare facilities, face numerous challenges in developing countries. Issues such as social stigma, limited access to healthcare services, and a lack of awareness about HIV risk factors contribute to late diagnosis and treatment. This delay in diagnosis can lead to higher transmission rates and a greater overall disease burden5.
HIV prediction tools can assist healthcare services to identify high-risk individuals and allocate resources effectively19. In recent years, machine learning has become increasingly prominent in HIV research. This trend is driven by the ability of machine learning techniques to handle extensive datasets with numerous covariates, manage complex relationships between predictors and outcomes, and achieve high levels of accuracy6. Conventional approaches for HIV diagnosis prediction are presented in different documents, such as analyzing over four million HIV test records in Kenya in 2023, comprising 68 selected variables using random forest algorithms7 with the sensitivity and specificity of 70% and 83% or a study from Pakistan8 by Nisa et al. in 2023 conducted a detailed analysis utilizing electronic records comprising 47,110 entries with 57 attributes, with the Random Forest resulted in an accuracy of 76%. Wang et al.9 predicted intervention non-responsiveness for HIV Prevention in the Bahamas in 2023 using Boruta feature selection and Random Forest in 2564 students, with the sensitivity and specificity of 85% and 78%, respectively. Also, Burns et al.10 developed an (Electronic Health Record) EHR-based model to predict HIV diagnosis in the USA, analyzing a substantial dataset of 998,787 patients from a Southern medical system using Least absolute shrinkage and selection operator (LASSO) and Extreme Gradient Boosting (XGBoost) techniques in 2022 with the sensitivity and specificity of 80% and 77%, respectively. The complete literature review of the HIV diagnosis methods was provided in Supplementary Table S1. Although machine learning has been applied to predict HIV risk in clinical and survey settings, few models have been tailored for localized, community-level screening in middle-income countries.
Most of these studies highlighted the significance of not only drug usage or injecting behavior but also socio-demographic factors like marital status and age in health outcome predictions. Additionally, for the entire cohort, the most significant positive predictors were male gender, having a male sexual partner, a history of domestic or sexual abuse, and a history of drug use. On the other hand, the most notable negative predictors included having a female partner, a greater number of positive urine toxicology tests, and being older.
The proposed study (PREDICT-HIV) introduces a novel HIV screening approach that emphasizes the use of demographic and lifestyle variables to identify individuals at higher risk in developing countries. This method is particularly crucial in regions where conventional screening practices face substantial barriers, and demographic and lifestyle data availability can be a game-changer. By focusing on specific risk factors prevalent in these populations, the approach seeks to enable early identification and intervention, which is essential in managing the spread of HIV. Beyond individual screening, this method can potentially influence public health strategies significantly. It offers a foundation for tailoring prevention and treatment programs more effectively based on a comprehensive understanding of local factors contributing to HIV risk. This aspect is especially critical in resource-limited settings, where the efficiency and reach of health interventions must be maximized. The study aims to fill a pivotal gap in HIV prevention and treatment in Iran, a developing country in the Middle East and North Africa (MENA), using innovative screening methods that leverage demographic and lifestyle data. To enhance accessibility, we also developed a user-friendly web application, allowing public and practitioner use of the model to improve early screening. The ultimate goal is to enhance early detection, support targeted interventions, and substantially reduce the HIV burden in these high-impact regions.
Results
Descriptive statistics
The demographic, socioeconomic characteristics, and risk factors of the participants are shown in Table 1, based on the status of the HIV diagnosis. The analysis revealed that HIV-positive individuals were predominantly older, female, and unemployed compared to HIV-negative individuals. They also had higher rates of risky behaviors, such as intravenous drug use and same-gender sexual activity. Marital status differences indicated higher rates of HIV positivity among widowed and divorced individuals. The presence of a serodiscordant partner and a history of blood transfusion were more common in the HIV-positive group. These findings suggest that specific demographic and behavioral risk factors are significantly associated with HIV positivity, highlighting the need for targeted prevention and intervention strategies.
Classification results
In this study, we employed stratified fivefold cross-validation on the dataset, which achieved a cross-validation area under the curve (cv-AUC) of 0.839 [95% CI 0.835–0.842]. Model performance for each fold, specifically for the XGBoost algorithm, was assessed using various metrics detailed in Table 3.
Figure 1 (The SHapley Additive exPlanations (SHAP) plot) illustrates the influence of each predictor on the model’s output. It illustrates the five most influential predictors, ranked in descending order based on their impact: Drug Injection, age, occupation, having a spouse with a history of HIV, and prison record. As confirmed in Tables 1 and 2, these variables exhibited the highest effect size among all predictors.
Mean absolute SHAP values plot for key features in HIV classification. This plot highlights the relative importance of each feature in the model’s decision-making process, with higher values indicating greater influence. Understanding these contributions can help identify critical biomarkers and improve model interpretability.
Despite class imbalance, the model maintained good sensitivity (0.74) and specificity (0.93), indicating reliable identification of positive and negative cases. Additionally, the model’s Matthews Correlation Coefficient (MCC) of 0.54 [0.53–0.54] indicates a moderate correlation between predicted and observed class labels, while a Cohen’s Kappa of 0.51 [0.51–0.52] suggests a fair to good agreement rate beyond chance in class labeling.
Discussion
This study demonstrates that a machine learning-based model can effectively identify individuals at elevated risk of HIV using routine demographic and behavioral data, offering a low-cost, scalable screening strategy in underserved communities. The insights gained from this research can inform more targeted and efficient HIV screening and prevention programs, ultimately contributing to the reduction of HIV transmission in high-impact regions. Future studies should aim to refine these models further and explore their application in diverse populations to enhance global HIV prevention efforts. The success of machine learning approaches in previous studies, such as those by Wang et al. (2022) and Kagendi et al. (2023), provides a strong foundation for continued innovation in this field.
Our study identified drug injection, age, the subject’s occupation, having a spouse with a history of HIV, and prison records as the most influential predictors for HIV status. These findings are consistent with previous research, which emphasizes the significant role of socio-demographic and behavioral factors in HIV transmission risk. For instance, Burns et al. (2022) identified sex, drug use, and a history of sexual abuse as strong predictors of HIV risk in a US cohort10. Similarly, Nisa et al. (2022) highlighted the importance of injecting behavior, marital status, and age in predicting HIV outcomes in Pakistan8. The alignment of our findings with these studies underscores the universal relevance of these risk factors across diverse populations.
In particular, drug injection was found to be a highly significant predictor, which is in line with previous studies such as those conducted by Nisa et al.8, which also emphasized the role of injecting behavior in HIV transmission. It highlights the need for targeted interventions focusing on drug users to reduce HIV spread.
Age as a predictor reflects the increased vulnerability of specific age groups, particularly those between 40 and 50, who may have accumulated higher risk exposures over time. It is consistent with Orel et al.'s (2022) findings, which noted age as a significant predictor in their study of African populations11. These results suggest that middle-aged adults may be at elevated risk, highlighting the need for age-specific public health strategies.
Having a spouse with a history of HIV emerged as a significant predictor, underscoring the role of close personal relationships in HIV transmission. It aligns with findings from other studies that highlight the importance of sexual and marital relationships in HIV risk, such as the work by Burns et al.10. It suggests that public health interventions should also focus on the partners of HIV-positive individuals to prevent further spread.
The subject’s occupation and prison record were also significant predictors, indicating that certain occupations and incarceration history are associated with higher HIV risk. It is consistent with findings from Kagendi et al.7, who identified socio-demographic characteristics as essential predictors of HIV outcomes. These results suggest the need for workplace interventions and better health services within the prison system.
The XGBoost classifier demonstrated consistent performance across different validation folds, as outlined in Table 3. The model achieved an average sensitivity of 0.80 [95% CI 0.79–0.80], indicating a strong ability to correctly identify positive cases. Specificity was high at 0.90 [95% CI 0.89–0.90], reflecting the model’s accuracy in recognizing negative cases. The overall accuracy of the model was robust, averaging 0.89 [95% CI 0.88–0.89]. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were 0.44 [95% CI 0.44–0.45] and 0.98 [95% CI 0.98–0.98], respectively, highlighting the model’s efficiency in confirming true positive and negative cases. The Matthew’s correlation coefficient (MCC) and Cohen’s kappa were 0.54 [95% CI 0.53–0.54] and 0.51 [95% CI 0.51–0.52], respectively, indicating moderate agreement and performance consistency across folds. These metrics collectively demonstrate the XGBoost model’s reliability and effectiveness in predicting HIV status. However, the performance of PREDICT-HIV could be assessed in the population, using (unbiased) PPV and (unbiased) NPV. The (unbiased) PPV shows the probability of having HIV, when the output of the system is positive. It could be calculated using the Bayes’ theorem as the following:
where, Se, Sp, and P are the sensitivity, specificity, and prevalence of the positive class in the population. Alternatively, (unbiased) NPV could assess the probability of not having HIV, when the output of the system is negative:
The proof of the above formulas were provided elsewhere12,13. The provided formulas could be used when assessing the performance of PREDICT-HIV in population, where the prevalence of HIV (parameter P) is know a-priori.
The XGBoost classifier achieved a cross-validated AUC of 0.839, indicating a strong discriminatory ability. This performance is consistent across multiple folds, underscoring the robustness of the model. The high NPV (0.98) suggests that the model is particularly effective in identifying individuals who are not at risk, which is crucial for reducing unnecessary testing and focusing resources on high-risk groups.
In comparison, the study by Wang et al.9 on Bahamian adolescents also demonstrated high model performance with an AUROC of 0.93, using Random Forest as the classifier. The superior performance metrics in both studies highlight the efficacy of machine learning models in predicting HIV risk. Our model’s accuracy (0.89) and sensitivity (0.80) are competitive with these results, suggesting that XGBoost is an effective tool for HIV screening in diverse populations.
Furthermore, the consistency in performance across different cross-validation folds in our study indicates that the model is well-generalized and not overfitted to the training data. This robustness is critical for practical implementation in public health settings, where models must reliably predict outcomes across various subgroups within the population.
Identifying high-risk individuals through machine learning models can significantly enhance targeted intervention strategies. By focusing on the key predictors identified in this study, public health initiatives can be more effectively tailored to address the specific needs and behaviors that contribute to HIV transmission in this population. Kagendi et al.7 highlighted the potential of using demographic data to predict HIV viral load hotspots in Kenya, demonstrating the practical application of machine learning in public health. Our study supports this approach, suggesting that early identification and intervention can mitigate the spread of HIV and improve health outcomes in resource-limited settings.
Traditional HIV screening methods often face challenges such as social stigma and limited access to healthcare services, leading to late diagnoses and higher transmission rates. Our approach, leveraging machine learning, offers a more efficient and less intrusive means of identifying at-risk individuals, potentially leading to earlier interventions and better outcomes. It is essential in developing countries, where healthcare infrastructure is often inadequate. For example, Orel et al.11 used machine learning to identify socio-behavioral predictors of HIV status in multiple African countries, demonstrating the benefits of advanced data analytics in overcoming traditional screening barriers. Our study adds to this evidence, showing that machine learning can significantly enhance HIV screening efficiency.
Further research should explore the integration of additional socioeconomic variables and the application of our model to other populations to validate its generalizability. Future studies could benefit from incorporating variables such as education level, household income, and employment status, which are relevant in other studies, such as those by Burns et al.10 and Nisa et al.8.
The created online screening tool provides a convenient and accessible means to assess personal HIV risk using lifestyle, behavioral, and demographic information. Created with an intuitive interface and mobile responsiveness, the tool enables real-time risk evaluation without the need for laboratory analysis or clinical involvement. Its usefulness is both in aiding the early detection of individuals at elevated risk—allowing prompt preventive measures—and in bolstering public health efforts by compiling anonymized data for monitoring and outreach strategies. Moreover, the tool can act as a decision support system for healthcare professionals in environments with scarce resources, where regular HIV testing might not be practical. By converting intricate risk elements into practical information, the web application could improve community involvement, raise awareness, and ultimately help decrease the number of undiagnosed HIV cases.
To enable practical use, the suggested HIV risk prediction model can be incorporated into current health systems in Iran and similar contexts via multiple viable avenues. It can be integrated into national electronic health record (EHR) systems to aid clinical decision-making during regular health visits, particularly in primary care and community health facilities. The model’s streamlined design enables incorporation into current Ministry of Health mobile health (mHealth) apps, broadening access to distant or underserved communities. Moreover, its application can enhance national screening initiatives by focusing on individuals for confirmatory testing according to tailored risk assessments. Due to the model’s dependence on self-reported lifestyle and behavioral aspects, it is especially appropriate for low-resource settings where laboratory testing is scarce. With adequate data management and privacy protections, the model may assist regional epidemiological surveillance and focused educational initiatives to improve prevention efforts.
This study has several limitations. First, the data is from a single city in Iran, which may limit generalizability. Second, self-reported behavioral data may be subject to recall or social desirability bias. Third, the model does not account for dynamic factors such as recent exposures or changes in behavior. Additionally, although machine learning models perform well, they can act as black boxes, limiting interpretability in clinical practice. Also, excluding the “Condom use” variable due to high missing data rates suggests the need for improved data collection methods in future studies.
Educational and behavioral interventions are urgently required to support HIV prevention efforts among general Iranian population. Risk and social characteristics identified in this study can help health policymakers reduce HIV-related direct and indirect expenditures.
Our machine learning model, trained on community-level demographic and behavioral data, demonstrated strong predictive accuracy and identified key risk factors such as drug injection and having an HIV-positive spouse. The development of a publicly accessible web application further enhances the model’s utility, enabling community health workers and at-risk individuals to estimate HIV risk easily and privately. As a screening approach, it emphasizes the application of demographic and lifestyle variables to identify individuals at higher risk in developing countries. This work highlights the value of AI-assisted tools in advancing public health equity and strengthening community-based HIV prevention strategies.
Material and Methods
Disease diagnosis and surveillance registries
From September 2001 to 2023, a comprehensive study in Shiraz, Southwestern Iran, enlisted 39,295 volunteers aged 18 years or older from both sexes. To recruit participants, we employed two primary strategies. First, we disseminated information publicly about the significance of visiting counseling centers for HIV testing, mainly targeting individuals engaged in high-risk behaviors. Second, we directed these high-risk individuals toward diagnostic centers for HIV screening.
Potential participants provided consent at their initial study visit and underwent pre- and post-test counseling. They were tested for HIV and received a thorough clinical examination. Two criteria were used to confirm HIV-positive cases: Firstly, an individual is considered positive if they have sequential enzyme-linked immunosorbent assay (ELISA) tests indicating the presence of HIV antibodies, which are then confirmed by a Western blot test. Secondly, an individual is also identified if they have a presumptive or definitive diagnosis of a stage 3 or stage 4 condition and/or a CD4 count of less than 350 cells per mm3 of blood in an HIV-infected patient14.
We gathered data on socio-demographic background, relevant behaviors, and knowledge about HIV/AIDS through an interviewer-administered structured questionnaire. All interviewers, equipped with prior experience in HIV/AIDS-related programs, received training in interview skills, protocol adherence, and questionnaire handling before the study commenced. Key risk predictors identified included socio-demographic variables and types of sexual orientation at the start of monitoring. Data were collected in face-to-face interviews using standardized questionnaires. Those who tested negative were informed about the significance of the window period and advised to retake the test after three months. HIV-positive individuals were then included in the follow-up Surveillance cohort.
A cumulative total of 3761 HIV/AIDS cases were reported to the HIV/AIDS registry since the inception of the study. The registry is routinely updated with the vital status of cases.
This study protocol received approval from the ethics committees of the Shiraz University of Medical Sciences and the Institutional Review Board of the Isfahan University of Medical Sciences (#IR.MUI.DHMT.REC.1403006) and conformed to the Declaration of Helsinki. All participants provided written informed consent before enrollment.
Data description
We included 39,295 subjects from Shiraz, the Non-Communicable Diseases (NCDs) department, and the vice chancellor for health; among them, 3693 (9.3%) were HIV positive. The missing percentage ranged from 0.6 to 54%, and the highest value for the missing was related to the “Condom use” parameter. We excluded the Condom used variable from the classification since the missing percentage was higher than 5%15. The occurrence of missing data on the dependent and independent variables of the enrolled subjects was random (Logistic regression16; p = 0.846). Multiple Imputation by Chained Equation (MICE) was then used to impute missing values in other variables. Age was categorized as between 18 and 30, 30 and 40, 40 and 50, 50 and 65, and > 6517. The final dataset included 39,295 unique subjects. The flow diagram of our proposed algorithm (PREDICT-HIV) used in our study is provided in Fig. 2.
Workflow of the PREDICT-HIV algorithm illustrating the various stages of data processing and model evaluation. The process begins with data preprocessing, including categorization, merging space categories, and combining variables. Data imputation is performed using MICE and MCAR methods. Then, stratified fivefold cross-validation is applied, with the model being trained and validated using XGBoost. The model’s performance is evaluated, and the classification results, including accuracy, sensitivity, specificity, and other metrics, are reported based on the mean of the five cross-validation results.
Statistical methods
Descriptive statistics were calculated for the subject’s characteristics. Statistical methods were applied to assess the relationships between various demographic and behavioral factors in HIV-positive and HIV-negative subgroups. The Mann–Whitney U test was employed to analyze ordinal variables. For nominal variables, the Chi-square test was used to evaluate the association between categorical variables. When the Chi-square test assumptions were not satisfied, Fisher’s exact test was applied to provide a more accurate measure of association in small sample sizes.
Furthermore, Odds Ratios (OR) with 95% Confidence Intervals (CI) were calculated to quantify the strength of the association between the variables and HIV status. Effect sizes were also computed to measure the magnitude of these associations, with appropriate interpretation based on the context and scale of the variables. The significance level for all tests was set at p < 0.05. These methods ensure robust statistical analysis and reliable interpretation of the study’s findings. The statistical analysis and calculations were performed using the SPSS statistical package, version 29.0 (IBM Corp. Released 2023. IBM SPSS Statistics for Windows, Version 29.0.2.0 Armonk, NY: IBM Corp). The machine learning procedures were performed using Python (Python Software Foundation. Python Language Reference, version 3.10.5 Available at https://www.python.org/downloads/release/python-3105/).
Machine learning methods
The classification was performed using gradient boosting. Gradient boosting is a universal and effective machine learning method based on first-order iterative optimization, in which the derivative of the loss function with respect to the predicted values is used. Then, this model uses pseudo-residuals rather than the typical residuals18. It uses where the objective is to minimize the loss function to improve the performance of those areas where the model has learned less than other areas and focus those areas for improving model performance. This learning ensemble tends to improve accuracy with some small risk of less coverage19.
Extreme Gradient Boosting (XGBoost) stands out as a popular variant of gradient boosting, known for its speed and performance optimization through a second-order approximation. This machine learning algorithm is particularly effective with small to medium-structured (tabular) data. XGBoost (Fig. 3) incorporates several advanced features, such as a scoring function to assess tree quality, which is vital for model accuracy. It also utilizes regularized learning to balance weights, reducing the risks of overfitting and underfitting. Innovative methods like shrinkage and feature subsampling are employed to mitigate overfitting, with subsampling randomly selecting features for each tree to increase diversity. The algorithm’s capability to handle sparse data through sparsity-aware split finding facilitates efficient processing of missing values.
The structure of the XGBoost algorithm with a focus on regularization. This diagram illustrates the key components of the XGBoost model, including the objective function and gradient boosting techniques. The objective function comprises a loss function and a regularization term to prevent overfitting. Gradient boosting adds trees sequentially, optimizing the model through gradient descent. The model’s regularization framework further helps improve generalization by penalizing complex models.
XGBoost was selected due to its superior performance in classifying small-to-medium complex (tabular) data, its ability to handle missing data, and its robustness against overfitting and underfitting. Unlike traditional models, XGBoost’s gradient tree boosting optimizes in an additive manner instead of an Euclidean space, better suiting complex data structures. Its greedy algorithm approach, which selects the best gain for split points, enhances its efficiency. The parallel learning feature of XGBoost, where data is split into blocks and processed across multiple cores, accelerates computation and supports out-of-core processing for large datasets. Additionally, XGBoost utilizes a Weighted Quantile Sketch approach for parallel quantile calculation in each data subset, aiding in the efficient approximation of quantile histograms. These features, block compression, and sharing techniques improve memory usage and computational speed, making XGBoost a versatile tool for various data-intensive tasks.20.
XGBoost distinguishes itself by employing the Newton–Raphson method in the function space, in contrast to traditional Gradient Boosting, which relies on gradient descent. This approach involves using a second-order Taylor approximation in the loss function, establishing a direct link to the Newton–Raphson method, thereby enhancing the algorithm’s efficiency in optimizing complex models.
The details of a generic regularized XGBoost algorithm is provided in the Supplementary Method S2. We used penalized XGBoost using L2 regularization. In our study, the learning rate was set to 0.03 to control the shrinkage of each tree’s contribution. The maximum depth of each tree was set to six.
Model evaluation
In this study, our objective was to address a binary classification problem. In our methodology, we adhered to the TRIPOD + AI standard21 to ensure consistent and high-quality implementation of artificial intelligence techniques throughout our study. To achieve optimal model performance, we utilized stratified fivefold cross-validation to identify the most effective hyperparameters. In this validation, data were randomly split into five stratified folds to maintain the proportion of HIV-positive cases across subsets. Each fold served once as the test set, with the remaining used for training. This approach ensured a comprehensive and unbiased selection of the best parameters for the model. We evaluated the model’s performance using the Area Under the ROC Curve (AUC). AUC is a widely recognized metric in machine learning for assessing the quality of binary classification models. It achieves this by examining the relationship between recall (sensitivity) and positive predictive value (PPV), focusing on the model’s ability to minimize false negatives while maintaining high sensitivity. This metric is particularly valuable when the balance between recall and precision is crucial to the model’s success22.
In binary classification, a confusion matrix serves as a vital tool for visualizing the performance of a machine-learning algorithm. Structured as a table, it categorizes the predictions made by the algorithm against the actual outcomes. Each row of the matrix corresponds to the actual classes, while each column represents the classes as predicted by the algorithm. Within this framework, a ‘True Positive’ (TP) occurs when the model correctly identifies a positive condition. Conversely, a ‘True Negative’ (TN) is when the model accurately predicts a negative condition. In cases where the model incorrectly predicts a positive condition, it is labeled as a ‘False Positive’ (FP). Lastly, a ‘False Negative’ (FN) denotes an instance where the model wrongly predicts a negative outcome for a case that is actually positive. These four elements—TP, TN, FP, FN—collectively provide a comprehensive overview of the model’s classification accuracy23.
Accuracy is a metric that determines how well the classifier predicts the labels, calculated from the following equation23:
The model’s performance was evaluated using two key metrics: sensitivity (true positive rate) and specificity (true negative rate). Sensitivity measures the likelihood of correctly identifying positive cases, while specificity assesses the ability to identify negative cases correctly. Equations (4) and (5) quantify the model’s effectiveness in accurately distinguishing true positives and negatives24.
Another important metric in evaluating a model is the Positive Predictive Value (PPV), which indicates the proportion of positive test results that are correctly diagnosed. On the other hand, the Negative Predictive Value (NPV) represents the proportion of negative test results that accurately identify the absence of the disease. Both PPV and NPV are calculated using specific formulas based on the values from the confusion matrix, providing insights into the reliability of the model’s predictions for both positive and negative diagnoses25:
Cohen’s Kappa is commonly used to measure inter-rater reliability for categorical items. In the context of binary classification, the formula for Cohen’s Kappa can be expressed as follows26:
Another statistical tool used for model evaluation is the Matthews Correlation Coefficient (MCC), which yields a value between − 1 and + 1. A coefficient of + 1 indicates a perfect prediction, while − 1 signifies complete disagreement between prediction and observation. For binary classification, the MCC can be calculated using the following formula26:
Feature importance
The association between the probability of being HIV positive and selected features was evaluated using SHapley Additive exPlanations (SHAP). SHAP is commonly utilized to elucidate a model’s dependency on the interaction between features. It demonstrates how each feature individually influences the model’s final predictions27, which reflects the average improvement in model performance contributed by each feature.
Web application implementation
To implement the HIV Predictor web application, we employed an XGBoost model, which was stored as a JavaScript object notation ".json" file. This file includes essential components such as the model structure, learning parameters, training parameters, feature names, and version information. The application’s backend was developed using Django (Django Software Foundation. Django. Retrieved from https://djangoproject.com), enabling the model to be loaded and predictions to be made based on user inputs. JavaScript facilitates real-time communication and data transmission between the front end and back end. The front end, designed with HTML and Cascading Style Sheets “CSS”, ensures a user-friendly experience. This setup allows the application to function efficiently on a server, providing a valuable tool for accurately predicting HIV status. The web application is freely available at http://hiv.eclinichub.com/.
Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
Code availability
The implemented code is freely available at GitHub: https://doi.org/https://doi.org/10.5281/zenodo.15707456
References
World Health Organization. Summary of the Global HIV Epidemic 2023. https://www.who.int/data/gho/data/themes/hiv-aids (accessed 23 Dec 2023).
Tian, X. et al. Global, regional, and national HIV/AIDS disease burden levels and trends in 1990–2019: A systematic analysis for the global burden of disease 2019 study. Front. Public Health. 11, 1068664. https://doi.org/10.3389/fpubh.2023.1068664 (2023).
SeyedAlinaghi, S. et al. HIV in Iran: onset, responses, and future directions. AIDS 35, 529–542. https://doi.org/10.1097/qad.0000000000002757 (2021).
Karimi, K. et al. Assessing HIV transmission knowledge and rapid test history among the general population in Iran. Sci. Rep. 15, 4944. https://doi.org/10.1038/s41598-025-88844-1 (2025).
Mahajan, A. P. et al. Stigma in the HIV/AIDS epidemic: a review of the literature and recommendations for the way forward. AIDS 22(Suppl 2), S67-79. https://doi.org/10.1097/01.aids.0000327438.13291.62 (2008).
Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reprod. Health. 4, 1062387. https://doi.org/10.3389/frph.2022.1062387 (2022).
Kagendi, N. & Mwau, M. A machine learning approach to predict HIV viral load hotspots in Kenya using real-world data. Health Data Sci. 3, 0019. https://doi.org/10.34133/hds.0019 (2023).
Nisa, S. U., Mahmood, A., Ujager, F. S. & Malik, M. HIV/AIDS predictive model using random forest based on socio-demographical, biological and behavioral data. Egypt. Inform. J. 24, 107–115. https://doi.org/10.1016/j.eij.2022.12.005 (2023).
Wang, B. et al. Predicting adolescent intervention non-responsiveness for precision HIV prevention using machine learning. AIDS Behav. 27, 1392–1402. https://doi.org/10.1007/s10461-022-03874-4 (2023).
Burns, C. M. et al. Development of a human immunodeficiency virus risk prediction model using electronic health record data from an academic health system in the Southern United States. Clin. Infect. Dis. 76, 299–306. https://doi.org/10.1093/cid/ciac775 (2022).
Orel, E. et al. Prediction of HIV status based on socio-behavioural characteristics in East and Southern Africa. PLoS ONE 17, e0264429. https://doi.org/10.1371/journal.pone.0264429 (2022).
Mohebian, M. R., Marateb, H. R., Mansourian, M., Mañanas, M. A. & Mokarian, F. A hybrid computer-aided-diagnosis system for prediction of breast cancer recurrence (HPBCR) using optimized ensemble learning. Comput. Struct. Biotechnol. J. 15, 75–85. https://doi.org/10.1016/j.csbj.2016.11.004 (2017).
Mansourian, M. et al. In Modelling and Analysis of Active Biopotential Signals in Healthcare, vol. 2 17–11–17–24 (IOP Publishing, 2020).
Radfar, S., Tayeri, K. & Namdari Tabar, H. Practical Guidelines on How to Provide Consulting Services in Behavioral Disorders Centers (Ministry of Health and Medical Education, 2009).
Heymans, M. W. & Twisk, J. W. R. Handling missing data in clinical research. J. Clin. Epidemiol. 151, 185–188. https://doi.org/10.1016/j.jclinepi.2022.08.016 (2022).
Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—A practical guide with flowcharts. BMC Med. Res. Methodol. 17, 162. https://doi.org/10.1186/s12874-017-0442-1 (2017).
SeyedAlinaghi, S. et al. HIV in Iran: Onset, responses, and future directions. AIDS. 35 (2021).
Zhang, Z., Zhao, Y., Canes, A., Steinberg, D. & Lyashevska, O. Predictive analytics with gradient boosting in clinical medicine. Ann. Transl. Med. 7, 152. https://doi.org/10.21037/atm.2019.03.29 (2019).
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7. https://doi.org/10.3389/fnbot.2013.00021 (2013).
Chen, T. & Guestrin, C. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378. https://doi.org/10.1136/bmj-2023-078378 (2024).
Pencina, M. J. & D’Agostino, R. B. Sr. Evaluating discrimination of risk prediction models: The C statistic. JAMA 314, 1063–1064. https://doi.org/10.1001/jama.2015.11082 (2015).
Powers, D. M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv:2010.16061 (2020). https://ui.adsabs.harvard.edu/abs/2020arXiv201016061P.
Yerushalmy, J. Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep. 1896(62), 1432–1449 (1947).
Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 (2006).
Chicco, D., Warrens, M. J. & Jurman, G. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. IEEE Access. 9, 78368–78381. https://doi.org/10.1109/ACCESS.2021.3084050 (2021).
Lundberg, S. M. & Lee, S.-I. In Advances in Neural Information Processing Systems, vol. 30 (eds I. Guyon et al.) (Curran Associates, Inc., 2017).
Acknowledgements
The authors would like to express their gratitude to the Vice Chancellor for Health at Shiraz University of Medical Sciences for providing data access and approval for this study. Additionally, we appreciate the support and cooperation of the staff in the NCDs department during the course of the study.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
M.N.: Conceptualization, methodology, validation, investigation, resources, data curation, writing—original draft; H.R.M.: Conceptualization, methodology, validation, writing—original draft, funding acquisition; M.A.F.: Methodology, software, validation, data curation, writing—review and editing, visualization; M.Z.R.: Methodology, software, validation, data curation, writing—review and editing, visualization; M.N.: Formal analysis, writing—review and editing, supervision; M.A.M.: Formal analysis, writing—review and editing, supervision, funding acquisition; M.J.T.: Conceptualization, validation, writing—review and editing, supervision, project administration; M.M.: Conceptualization, validation, formal analysis, writing—review and editing, supervision, project administration. All authors read and approved the final manuscript and agreed to be accountable for all aspects of the work.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nejat, M., Marateb, H.R., Farahani, M.A. et al. Proactive recognition and early detection in communities through targeted HIV screening. Sci Rep 15, 27275 (2025). https://doi.org/10.1038/s41598-025-11029-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-11029-3