Proactive recognition and early detection in communities through targeted HIV screening

Nejat, Mehdi; Marateb, Hamid Reza; Farahani, Mehrshad Alirezaei; Rajabi, Mohammad Zakaria; Nasirian, Maryam; Mañanas, Miguel Angel; Tarrahi, Mohammad Javad; Mansourian, Marjan

doi:10.1038/s41598-025-11029-3

Download PDF

article
Open access
Published: 26 July 2025

Proactive recognition and early detection in communities through targeted HIV screening

Scientific Reports volume 15, Article number: 27275 (2025) Cite this article

1433 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Human Immunodeficiency Virus (HIV) remains a critical public health concern, and is a significant global health challenge, particularly in developing countries. Early HIV detection supports targeted interventions, and substantially reduce the HIV burden. In many resource-limited settings, early detection of HIV is hindered by stigma, limited access to testing, and low risk awareness. This study aims to enhance HIV screening in resource-limited settings by employing machine learning models to predict HIV risk using demographic and lifestyle variables. We analyzed data from 39,295 individuals in Shiraz, Iran, identifying key predictors, including drug injection, age, having a spouse with a history of HIV, occupation, and prison record. We trained and validated an Extreme Gradient Boosting (XGBoost) model using stratified five-fold cross-validation on the dataset. The XGBoost model achieved high accuracy (0.89; Confidence Interval (CI) 95% [0.88–0.89]), very-good discriminatory ability (Area Under the ROC Curve (AUC = 0.84 [0.83–0.84], with a fair-to-good agreement (Cohen’s Kappa of 0.51 [0.51–0.52]). Moreover, the performance of the proposed method (PREDICT-HIV) was consistent across test folds. Our findings align with previous studies, emphasizing the importance of socio-demographic and behavioral factors in HIV risk prediction. The model’s robustness suggests its potential for practical implementation, aiding early identification and intervention in high-risk groups. Future research should incorporate additional socioeconomic variables and validate the model in diverse populations to enhance global HIV prevention efforts. The web application, implemented using the Django framework, is freely available online for public access. PREDICT-HIV may support earlier identification and intervention in underserved populations, improving the efficiency of HIV screening programs.

Survival prediction models for people living with HIV based on four machine learning models

Article Open access 25 August 2025

Assessing HIV transmission knowledge and rapid test history among the general population in Iran

Article Open access 10 February 2025

Predictors of never testing for HIV among sexually active individuals aged 15–56 years in Rwanda

Article Open access 26 January 2024

Introduction

The Human Immunodeficiency Virus (HIV) remains a significant global health challenge, responsible for millions of deaths worldwide. According to the World Health Organization (WHO), there were approximately 39 million people worldwide living with HIV at the end of 2022¹. The WHO African Region remains most severely affected, with nearly 1 in every 25 adults living with HIV. The burden of disease caused by HIV is profound, impacting not only the health and life expectancy of individuals but also affecting economies and societies at large. The global age-standardized HIV/Acquired Immune Deficiency Syndrome (AIDS) disability-adjusted life years (DALYs) rate was 601.49 (95% UI 536.16–703.92) per 100,000 cases in 2019². While global HIV incidence and AIDS-related deaths declined during this period, both metrics increased in the Middle East and North Africa (MENA) region. This trend contrasted sharply with global improvements³. It aligns with recent data from Iran indicating that only 40.9% of Iranian adults have adequate HIV transmission knowledge, and less than 10% underwent HIV testing in the past year—despite Iran being a Joint United Nations Programme on HIV/AIDS (UNAIDS)-prioritized fast-track country with over 46,000 People Living with HIV (PLWH) in 2023⁴.

The impact of HIV is particularly pronounced in developing countries, where it remains a critical public health concern. Despite global efforts, these regions continue to bear a disproportionate burden of the HIV epidemic. The complexities in these areas, ranging from limited healthcare infrastructure to social stigma, exacerbate the challenge of HIV prevention and management.

Traditional HIV screening methods, often reliant on direct testing in healthcare facilities, face numerous challenges in developing countries. Issues such as social stigma, limited access to healthcare services, and a lack of awareness about HIV risk factors contribute to late diagnosis and treatment. This delay in diagnosis can lead to higher transmission rates and a greater overall disease burden⁵.

HIV prediction tools can assist healthcare services to identify high-risk individuals and allocate resources effectively¹⁹. In recent years, machine learning has become increasingly prominent in HIV research. This trend is driven by the ability of machine learning techniques to handle extensive datasets with numerous covariates, manage complex relationships between predictors and outcomes, and achieve high levels of accuracy⁶. Conventional approaches for HIV diagnosis prediction are presented in different documents, such as analyzing over four million HIV test records in Kenya in 2023, comprising 68 selected variables using random forest algorithms⁷ with the sensitivity and specificity of 70% and 83% or a study from Pakistan⁸ by Nisa et al. in 2023 conducted a detailed analysis utilizing electronic records comprising 47,110 entries with 57 attributes, with the Random Forest resulted in an accuracy of 76%. Wang et al.⁹ predicted intervention non-responsiveness for HIV Prevention in the Bahamas in 2023 using Boruta feature selection and Random Forest in 2564 students, with the sensitivity and specificity of 85% and 78%, respectively. Also, Burns et al.¹⁰ developed an (Electronic Health Record) EHR-based model to predict HIV diagnosis in the USA, analyzing a substantial dataset of 998,787 patients from a Southern medical system using Least absolute shrinkage and selection operator (LASSO) and Extreme Gradient Boosting (XGBoost) techniques in 2022 with the sensitivity and specificity of 80% and 77%, respectively. The complete literature review of the HIV diagnosis methods was provided in Supplementary Table S1. Although machine learning has been applied to predict HIV risk in clinical and survey settings, few models have been tailored for localized, community-level screening in middle-income countries.

Most of these studies highlighted the significance of not only drug usage or injecting behavior but also socio-demographic factors like marital status and age in health outcome predictions. Additionally, for the entire cohort, the most significant positive predictors were male gender, having a male sexual partner, a history of domestic or sexual abuse, and a history of drug use. On the other hand, the most notable negative predictors included having a female partner, a greater number of positive urine toxicology tests, and being older.

The proposed study (PREDICT-HIV) introduces a novel HIV screening approach that emphasizes the use of demographic and lifestyle variables to identify individuals at higher risk in developing countries. This method is particularly crucial in regions where conventional screening practices face substantial barriers, and demographic and lifestyle data availability can be a game-changer. By focusing on specific risk factors prevalent in these populations, the approach seeks to enable early identification and intervention, which is essential in managing the spread of HIV. Beyond individual screening, this method can potentially influence public health strategies significantly. It offers a foundation for tailoring prevention and treatment programs more effectively based on a comprehensive understanding of local factors contributing to HIV risk. This aspect is especially critical in resource-limited settings, where the efficiency and reach of health interventions must be maximized. The study aims to fill a pivotal gap in HIV prevention and treatment in Iran, a developing country in the Middle East and North Africa (MENA), using innovative screening methods that leverage demographic and lifestyle data. To enhance accessibility, we also developed a user-friendly web application, allowing public and practitioner use of the model to improve early screening. The ultimate goal is to enhance early detection, support targeted interventions, and substantially reduce the HIV burden in these high-impact regions.

Results

Descriptive statistics

The demographic, socioeconomic characteristics, and risk factors of the participants are shown in Table 1, based on the status of the HIV diagnosis. The analysis revealed that HIV-positive individuals were predominantly older, female, and unemployed compared to HIV-negative individuals. They also had higher rates of risky behaviors, such as intravenous drug use and same-gender sexual activity. Marital status differences indicated higher rates of HIV positivity among widowed and divorced individuals. The presence of a serodiscordant partner and a history of blood transfusion were more common in the HIV-positive group. These findings suggest that specific demographic and behavioral risk factors are significantly associated with HIV positivity, highlighting the need for targeted prevention and intervention strategies.

Table 1 Demographic characteristics and univariate statistical analysis in HIV positive and negative subgroups.

Full size table

Classification results

In this study, we employed stratified fivefold cross-validation on the dataset, which achieved a cross-validation area under the curve (cv-AUC) of 0.839 [95% CI 0.835–0.842]. Model performance for each fold, specifically for the XGBoost algorithm, was assessed using various metrics detailed in Table 3.

Figure 1 (The SHapley Additive exPlanations (SHAP) plot) illustrates the influence of each predictor on the model’s output. It illustrates the five most influential predictors, ranked in descending order based on their impact: Drug Injection, age, occupation, having a spouse with a history of HIV, and prison record. As confirmed in Tables 1 and 2, these variables exhibited the highest effect size among all predictors.

Table 2 Distribution of risk factors and univariate statistical analysis in HIV positive and negative subgroups.

Full size table

Despite class imbalance, the model maintained good sensitivity (0.74) and specificity (0.93), indicating reliable identification of positive and negative cases. Additionally, the model’s Matthews Correlation Coefficient (MCC) of 0.54 [0.53–0.54] indicates a moderate correlation between predicted and observed class labels, while a Cohen’s Kappa of 0.51 [0.51–0.52] suggests a fair to good agreement rate beyond chance in class labeling.

Discussion

This study demonstrates that a machine learning-based model can effectively identify individuals at elevated risk of HIV using routine demographic and behavioral data, offering a low-cost, scalable screening strategy in underserved communities. The insights gained from this research can inform more targeted and efficient HIV screening and prevention programs, ultimately contributing to the reduction of HIV transmission in high-impact regions. Future studies should aim to refine these models further and explore their application in diverse populations to enhance global HIV prevention efforts. The success of machine learning approaches in previous studies, such as those by Wang et al. (2022) and Kagendi et al. (2023), provides a strong foundation for continued innovation in this field.

Our study identified drug injection, age, the subject’s occupation, having a spouse with a history of HIV, and prison records as the most influential predictors for HIV status. These findings are consistent with previous research, which emphasizes the significant role of socio-demographic and behavioral factors in HIV transmission risk. For instance, Burns et al. (2022) identified sex, drug use, and a history of sexual abuse as strong predictors of HIV risk in a US cohort¹⁰. Similarly, Nisa et al. (2022) highlighted the importance of injecting behavior, marital status, and age in predicting HIV outcomes in Pakistan⁸. The alignment of our findings with these studies underscores the universal relevance of these risk factors across diverse populations.

In particular, drug injection was found to be a highly significant predictor, which is in line with previous studies such as those conducted by Nisa et al.⁸, which also emphasized the role of injecting behavior in HIV transmission. It highlights the need for targeted interventions focusing on drug users to reduce HIV spread.

Age as a predictor reflects the increased vulnerability of specific age groups, particularly those between 40 and 50, who may have accumulated higher risk exposures over time. It is consistent with Orel et al.'s (2022) findings, which noted age as a significant predictor in their study of African populations¹¹. These results suggest that middle-aged adults may be at elevated risk, highlighting the need for age-specific public health strategies.

Having a spouse with a history of HIV emerged as a significant predictor, underscoring the role of close personal relationships in HIV transmission. It aligns with findings from other studies that highlight the importance of sexual and marital relationships in HIV risk, such as the work by Burns et al.¹⁰. It suggests that public health interventions should also focus on the partners of HIV-positive individuals to prevent further spread.

The subject’s occupation and prison record were also significant predictors, indicating that certain occupations and incarceration history are associated with higher HIV risk. It is consistent with findings from Kagendi et al.⁷, who identified socio-demographic characteristics as essential predictors of HIV outcomes. These results suggest the need for workplace interventions and better health services within the prison system.

The XGBoost classifier demonstrated consistent performance across different validation folds, as outlined in Table 3. The model achieved an average sensitivity of 0.80 [95% CI 0.79–0.80], indicating a strong ability to correctly identify positive cases. Specificity was high at 0.90 [95% CI 0.89–0.90], reflecting the model’s accuracy in recognizing negative cases. The overall accuracy of the model was robust, averaging 0.89 [95% CI 0.88–0.89]. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were 0.44 [95% CI 0.44–0.45] and 0.98 [95% CI 0.98–0.98], respectively, highlighting the model’s efficiency in confirming true positive and negative cases. The Matthew’s correlation coefficient (MCC) and Cohen’s kappa were 0.54 [95% CI 0.53–0.54] and 0.51 [95% CI 0.51–0.52], respectively, indicating moderate agreement and performance consistency across folds. These metrics collectively demonstrate the XGBoost model’s reliability and effectiveness in predicting HIV status. However, the performance of PREDICT-HIV could be assessed in the population, using (unbiased) PPV and (unbiased) NPV. The (unbiased) PPV shows the probability of having HIV, when the output of the system is positive. It could be calculated using the Bayes’ theorem as the following:

$$\left(unbiased\right) PPV= \frac{Se\times P}{Se\times P+(1-Sp)\times (1-P)}$$

(1)

where, Se, Sp, and P are the sensitivity, specificity, and prevalence of the positive class in the population. Alternatively, (unbiased) NPV could assess the probability of not having HIV, when the output of the system is negative:

Table 3 The performance of the XGBoost classifier on different folds and the resulting cross-validated indices.

Full size table

$$\left(unbiased\right) NPV= \frac{Sp\times (1-P)}{Sp\times (1-P)+(1-Se)\times P}$$

(2)

The proof of the above formulas were provided elsewhere^12,13. The provided formulas could be used when assessing the performance of PREDICT-HIV in population, where the prevalence of HIV (parameter P) is know a-priori.

The XGBoost classifier achieved a cross-validated AUC of 0.839, indicating a strong discriminatory ability. This performance is consistent across multiple folds, underscoring the robustness of the model. The high NPV (0.98) suggests that the model is particularly effective in identifying individuals who are not at risk, which is crucial for reducing unnecessary testing and focusing resources on high-risk groups.

In comparison, the study by Wang et al.⁹ on Bahamian adolescents also demonstrated high model performance with an AUROC of 0.93, using Random Forest as the classifier. The superior performance metrics in both studies highlight the efficacy of machine learning models in predicting HIV risk. Our model’s accuracy (0.89) and sensitivity (0.80) are competitive with these results, suggesting that XGBoost is an effective tool for HIV screening in diverse populations.

Furthermore, the consistency in performance across different cross-validation folds in our study indicates that the model is well-generalized and not overfitted to the training data. This robustness is critical for practical implementation in public health settings, where models must reliably predict outcomes across various subgroups within the population.

Identifying high-risk individuals through machine learning models can significantly enhance targeted intervention strategies. By focusing on the key predictors identified in this study, public health initiatives can be more effectively tailored to address the specific needs and behaviors that contribute to HIV transmission in this population. Kagendi et al.⁷ highlighted the potential of using demographic data to predict HIV viral load hotspots in Kenya, demonstrating the practical application of machine learning in public health. Our study supports this approach, suggesting that early identification and intervention can mitigate the spread of HIV and improve health outcomes in resource-limited settings.

Traditional HIV screening methods often face challenges such as social stigma and limited access to healthcare services, leading to late diagnoses and higher transmission rates. Our approach, leveraging machine learning, offers a more efficient and less intrusive means of identifying at-risk individuals, potentially leading to earlier interventions and better outcomes. It is essential in developing countries, where healthcare infrastructure is often inadequate. For example, Orel et al.¹¹ used machine learning to identify socio-behavioral predictors of HIV status in multiple African countries, demonstrating the benefits of advanced data analytics in overcoming traditional screening barriers. Our study adds to this evidence, showing that machine learning can significantly enhance HIV screening efficiency.

Further research should explore the integration of additional socioeconomic variables and the application of our model to other populations to validate its generalizability. Future studies could benefit from incorporating variables such as education level, household income, and employment status, which are relevant in other studies, such as those by Burns et al.¹⁰ and Nisa et al.⁸.

The created online screening tool provides a convenient and accessible means to assess personal HIV risk using lifestyle, behavioral, and demographic information. Created with an intuitive interface and mobile responsiveness, the tool enables real-time risk evaluation without the need for laboratory analysis or clinical involvement. Its usefulness is both in aiding the early detection of individuals at elevated risk—allowing prompt preventive measures—and in bolstering public health efforts by compiling anonymized data for monitoring and outreach strategies. Moreover, the tool can act as a decision support system for healthcare professionals in environments with scarce resources, where regular HIV testing might not be practical. By converting intricate risk elements into practical information, the web application could improve community involvement, raise awareness, and ultimately help decrease the number of undiagnosed HIV cases.

To enable practical use, the suggested HIV risk prediction model can be incorporated into current health systems in Iran and similar contexts via multiple viable avenues. It can be integrated into national electronic health record (EHR) systems to aid clinical decision-making during regular health visits, particularly in primary care and community health facilities. The model’s streamlined design enables incorporation into current Ministry of Health mobile health (mHealth) apps, broadening access to distant or underserved communities. Moreover, its application can enhance national screening initiatives by focusing on individuals for confirmatory testing according to tailored risk assessments. Due to the model’s dependence on self-reported lifestyle and behavioral aspects, it is especially appropriate for low-resource settings where laboratory testing is scarce. With adequate data management and privacy protections, the model may assist regional epidemiological surveillance and focused educational initiatives to improve prevention efforts.

This study has several limitations. First, the data is from a single city in Iran, which may limit generalizability. Second, self-reported behavioral data may be subject to recall or social desirability bias. Third, the model does not account for dynamic factors such as recent exposures or changes in behavior. Additionally, although machine learning models perform well, they can act as black boxes, limiting interpretability in clinical practice. Also, excluding the “Condom use” variable due to high missing data rates suggests the need for improved data collection methods in future studies.

Educational and behavioral interventions are urgently required to support HIV prevention efforts among general Iranian population. Risk and social characteristics identified in this study can help health policymakers reduce HIV-related direct and indirect expenditures.

Our machine learning model, trained on community-level demographic and behavioral data, demonstrated strong predictive accuracy and identified key risk factors such as drug injection and having an HIV-positive spouse. The development of a publicly accessible web application further enhances the model’s utility, enabling community health workers and at-risk individuals to estimate HIV risk easily and privately. As a screening approach, it emphasizes the application of demographic and lifestyle variables to identify individuals at higher risk in developing countries. This work highlights the value of AI-assisted tools in advancing public health equity and strengthening community-based HIV prevention strategies.

Material and Methods

Disease diagnosis and surveillance registries

From September 2001 to 2023, a comprehensive study in Shiraz, Southwestern Iran, enlisted 39,295 volunteers aged 18 years or older from both sexes. To recruit participants, we employed two primary strategies. First, we disseminated information publicly about the significance of visiting counseling centers for HIV testing, mainly targeting individuals engaged in high-risk behaviors. Second, we directed these high-risk individuals toward diagnostic centers for HIV screening.

Potential participants provided consent at their initial study visit and underwent pre- and post-test counseling. They were tested for HIV and received a thorough clinical examination. Two criteria were used to confirm HIV-positive cases: Firstly, an individual is considered positive if they have sequential enzyme-linked immunosorbent assay (ELISA) tests indicating the presence of HIV antibodies, which are then confirmed by a Western blot test. Secondly, an individual is also identified if they have a presumptive or definitive diagnosis of a stage 3 or stage 4 condition and/or a CD4 count of less than 350 cells per mm³ of blood in an HIV-infected patient¹⁴.

We gathered data on socio-demographic background, relevant behaviors, and knowledge about HIV/AIDS through an interviewer-administered structured questionnaire. All interviewers, equipped with prior experience in HIV/AIDS-related programs, received training in interview skills, protocol adherence, and questionnaire handling before the study commenced. Key risk predictors identified included socio-demographic variables and types of sexual orientation at the start of monitoring. Data were collected in face-to-face interviews using standardized questionnaires. Those who tested negative were informed about the significance of the window period and advised to retake the test after three months. HIV-positive individuals were then included in the follow-up Surveillance cohort.

A cumulative total of 3761 HIV/AIDS cases were reported to the HIV/AIDS registry since the inception of the study. The registry is routinely updated with the vital status of cases.

This study protocol received approval from the ethics committees of the Shiraz University of Medical Sciences and the Institutional Review Board of the Isfahan University of Medical Sciences (#IR.MUI.DHMT.REC.1403006) and conformed to the Declaration of Helsinki. All participants provided written informed consent before enrollment.

Data description

We included 39,295 subjects from Shiraz, the Non-Communicable Diseases (NCDs) department, and the vice chancellor for health; among them, 3693 (9.3%) were HIV positive. The missing percentage ranged from 0.6 to 54%, and the highest value for the missing was related to the “Condom use” parameter. We excluded the Condom used variable from the classification since the missing percentage was higher than 5%¹⁵. The occurrence of missing data on the dependent and independent variables of the enrolled subjects was random (Logistic regression¹⁶; p = 0.846). Multiple Imputation by Chained Equation (MICE) was then used to impute missing values in other variables. Age was categorized as between 18 and 30, 30 and 40, 40 and 50, 50 and 65, and > 65¹⁷. The final dataset included 39,295 unique subjects. The flow diagram of our proposed algorithm (PREDICT-HIV) used in our study is provided in Fig. 2.

Statistical methods

Descriptive statistics were calculated for the subject’s characteristics. Statistical methods were applied to assess the relationships between various demographic and behavioral factors in HIV-positive and HIV-negative subgroups. The Mann–Whitney U test was employed to analyze ordinal variables. For nominal variables, the Chi-square test was used to evaluate the association between categorical variables. When the Chi-square test assumptions were not satisfied, Fisher’s exact test was applied to provide a more accurate measure of association in small sample sizes.

Furthermore, Odds Ratios (OR) with 95% Confidence Intervals (CI) were calculated to quantify the strength of the association between the variables and HIV status. Effect sizes were also computed to measure the magnitude of these associations, with appropriate interpretation based on the context and scale of the variables. The significance level for all tests was set at p < 0.05. These methods ensure robust statistical analysis and reliable interpretation of the study’s findings. The statistical analysis and calculations were performed using the SPSS statistical package, version 29.0 (IBM Corp. Released 2023. IBM SPSS Statistics for Windows, Version 29.0.2.0 Armonk, NY: IBM Corp). The machine learning procedures were performed using Python (Python Software Foundation. Python Language Reference, version 3.10.5 Available at https://www.python.org/downloads/release/python-3105/).

Machine learning methods

The classification was performed using gradient boosting. Gradient boosting is a universal and effective machine learning method based on first-order iterative optimization, in which the derivative of the loss function with respect to the predicted values is used. Then, this model uses pseudo-residuals rather than the typical residuals¹⁸. It uses where the objective is to minimize the loss function to improve the performance of those areas where the model has learned less than other areas and focus those areas for improving model performance. This learning ensemble tends to improve accuracy with some small risk of less coverage¹⁹.

Extreme Gradient Boosting (XGBoost) stands out as a popular variant of gradient boosting, known for its speed and performance optimization through a second-order approximation. This machine learning algorithm is particularly effective with small to medium-structured (tabular) data. XGBoost (Fig. 3) incorporates several advanced features, such as a scoring function to assess tree quality, which is vital for model accuracy. It also utilizes regularized learning to balance weights, reducing the risks of overfitting and underfitting. Innovative methods like shrinkage and feature subsampling are employed to mitigate overfitting, with subsampling randomly selecting features for each tree to increase diversity. The algorithm’s capability to handle sparse data through sparsity-aware split finding facilitates efficient processing of missing values.

XGBoost was selected due to its superior performance in classifying small-to-medium complex (tabular) data, its ability to handle missing data, and its robustness against overfitting and underfitting. Unlike traditional models, XGBoost’s gradient tree boosting optimizes in an additive manner instead of an Euclidean space, better suiting complex data structures. Its greedy algorithm approach, which selects the best gain for split points, enhances its efficiency. The parallel learning feature of XGBoost, where data is split into blocks and processed across multiple cores, accelerates computation and supports out-of-core processing for large datasets. Additionally, XGBoost utilizes a Weighted Quantile Sketch approach for parallel quantile calculation in each data subset, aiding in the efficient approximation of quantile histograms. These features, block compression, and sharing techniques improve memory usage and computational speed, making XGBoost a versatile tool for various data-intensive tasks.²⁰.

XGBoost distinguishes itself by employing the Newton–Raphson method in the function space, in contrast to traditional Gradient Boosting, which relies on gradient descent. This approach involves using a second-order Taylor approximation in the loss function, establishing a direct link to the Newton–Raphson method, thereby enhancing the algorithm’s efficiency in optimizing complex models.

The details of a generic regularized XGBoost algorithm is provided in the Supplementary Method S2. We used penalized XGBoost using L₂ regularization. In our study, the learning rate was set to 0.03 to control the shrinkage of each tree’s contribution. The maximum depth of each tree was set to six.

Model evaluation

In this study, our objective was to address a binary classification problem. In our methodology, we adhered to the TRIPOD + AI standard²¹ to ensure consistent and high-quality implementation of artificial intelligence techniques throughout our study. To achieve optimal model performance, we utilized stratified fivefold cross-validation to identify the most effective hyperparameters. In this validation, data were randomly split into five stratified folds to maintain the proportion of HIV-positive cases across subsets. Each fold served once as the test set, with the remaining used for training. This approach ensured a comprehensive and unbiased selection of the best parameters for the model. We evaluated the model’s performance using the Area Under the ROC Curve (AUC). AUC is a widely recognized metric in machine learning for assessing the quality of binary classification models. It achieves this by examining the relationship between recall (sensitivity) and positive predictive value (PPV), focusing on the model’s ability to minimize false negatives while maintaining high sensitivity. This metric is particularly valuable when the balance between recall and precision is crucial to the model’s success²².

In binary classification, a confusion matrix serves as a vital tool for visualizing the performance of a machine-learning algorithm. Structured as a table, it categorizes the predictions made by the algorithm against the actual outcomes. Each row of the matrix corresponds to the actual classes, while each column represents the classes as predicted by the algorithm. Within this framework, a ‘True Positive’ (TP) occurs when the model correctly identifies a positive condition. Conversely, a ‘True Negative’ (TN) is when the model accurately predicts a negative condition. In cases where the model incorrectly predicts a positive condition, it is labeled as a ‘False Positive’ (FP). Lastly, a ‘False Negative’ (FN) denotes an instance where the model wrongly predicts a negative outcome for a case that is actually positive. These four elements—TP, TN, FP, FN—collectively provide a comprehensive overview of the model’s classification accuracy²³.

Accuracy is a metric that determines how well the classifier predicts the labels, calculated from the following equation²³:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

(3)

The model’s performance was evaluated using two key metrics: sensitivity (true positive rate) and specificity (true negative rate). Sensitivity measures the likelihood of correctly identifying positive cases, while specificity assesses the ability to identify negative cases correctly. Equations (4) and (5) quantify the model’s effectiveness in accurately distinguishing true positives and negatives²⁴.

$$Sensitivity=\frac{TP}{TP+FN}$$

(4)

$$Specificity=\frac{TN}{TN+FP}$$

(5)

Another important metric in evaluating a model is the Positive Predictive Value (PPV), which indicates the proportion of positive test results that are correctly diagnosed. On the other hand, the Negative Predictive Value (NPV) represents the proportion of negative test results that accurately identify the absence of the disease. Both PPV and NPV are calculated using specific formulas based on the values from the confusion matrix, providing insights into the reliability of the model’s predictions for both positive and negative diagnoses²⁵:

$$PPV=\frac{TP}{TP+FP}$$

(6)

$$NPV=\frac{TN}{TN+FN}$$

(7)

Cohen’s Kappa is commonly used to measure inter-rater reliability for categorical items. In the context of binary classification, the formula for Cohen’s Kappa can be expressed as follows²⁶:

$$\kappa =2\times \frac{TP\times TN-FN\times FP}{\left(TP+FP\right)\times \left(FP+TN\right)\times \left(TP+FN\right)\times (FN+TN)}$$

(8)

Another statistical tool used for model evaluation is the Matthews Correlation Coefficient (MCC), which yields a value between − 1 and + 1. A coefficient of + 1 indicates a perfect prediction, while − 1 signifies complete disagreement between prediction and observation. For binary classification, the MCC can be calculated using the following formula²⁶:

$$MCC= \frac{TP\times TN-FN\times FP}{\sqrt{\left(TP+FP\right)\times \left(TP+FN\right)\times \left(TN+FP\right)\times \left(FN+TN\right)}}$$

(9)

Feature importance

The association between the probability of being HIV positive and selected features was evaluated using SHapley Additive exPlanations (SHAP). SHAP is commonly utilized to elucidate a model’s dependency on the interaction between features. It demonstrates how each feature individually influences the model’s final predictions²⁷, which reflects the average improvement in model performance contributed by each feature.

Web application implementation

To implement the HIV Predictor web application, we employed an XGBoost model, which was stored as a JavaScript object notation ".json" file. This file includes essential components such as the model structure, learning parameters, training parameters, feature names, and version information. The application’s backend was developed using Django (Django Software Foundation. Django. Retrieved from https://djangoproject.com), enabling the model to be loaded and predictions to be made based on user inputs. JavaScript facilitates real-time communication and data transmission between the front end and back end. The front end, designed with HTML and Cascading Style Sheets “CSS”, ensures a user-friendly experience. This setup allows the application to function efficiently on a server, providing a valuable tool for accurately predicting HIV status. The web application is freely available at http://hiv.eclinichub.com/.

Data availability

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

Code availability

The implemented code is freely available at GitHub: https://doi.org/https://doi.org/10.5281/zenodo.15707456

References

World Health Organization. Summary of the Global HIV Epidemic 2023. https://www.who.int/data/gho/data/themes/hiv-aids (accessed 23 Dec 2023).
Tian, X. et al. Global, regional, and national HIV/AIDS disease burden levels and trends in 1990–2019: A systematic analysis for the global burden of disease 2019 study. Front. Public Health. 11, 1068664. https://doi.org/10.3389/fpubh.2023.1068664 (2023).
Article PubMed PubMed Central Google Scholar
SeyedAlinaghi, S. et al. HIV in Iran: onset, responses, and future directions. AIDS 35, 529–542. https://doi.org/10.1097/qad.0000000000002757 (2021).
Article PubMed Google Scholar
Karimi, K. et al. Assessing HIV transmission knowledge and rapid test history among the general population in Iran. Sci. Rep. 15, 4944. https://doi.org/10.1038/s41598-025-88844-1 (2025).
Article CAS PubMed PubMed Central Google Scholar
Mahajan, A. P. et al. Stigma in the HIV/AIDS epidemic: a review of the literature and recommendations for the way forward. AIDS 22(Suppl 2), S67-79. https://doi.org/10.1097/01.aids.0000327438.13291.62 (2008).
Article PubMed Google Scholar
Fieggen, J., Smith, E., Arora, L. & Segal, B. The role of machine learning in HIV risk prediction. Front. Reprod. Health. 4, 1062387. https://doi.org/10.3389/frph.2022.1062387 (2022).
Article PubMed PubMed Central Google Scholar
Kagendi, N. & Mwau, M. A machine learning approach to predict HIV viral load hotspots in Kenya using real-world data. Health Data Sci. 3, 0019. https://doi.org/10.34133/hds.0019 (2023).
Article PubMed PubMed Central Google Scholar
Nisa, S. U., Mahmood, A., Ujager, F. S. & Malik, M. HIV/AIDS predictive model using random forest based on socio-demographical, biological and behavioral data. Egypt. Inform. J. 24, 107–115. https://doi.org/10.1016/j.eij.2022.12.005 (2023).
Article Google Scholar
Wang, B. et al. Predicting adolescent intervention non-responsiveness for precision HIV prevention using machine learning. AIDS Behav. 27, 1392–1402. https://doi.org/10.1007/s10461-022-03874-4 (2023).
Article PubMed Google Scholar
Burns, C. M. et al. Development of a human immunodeficiency virus risk prediction model using electronic health record data from an academic health system in the Southern United States. Clin. Infect. Dis. 76, 299–306. https://doi.org/10.1093/cid/ciac775 (2022).
Article PubMed Central Google Scholar
Orel, E. et al. Prediction of HIV status based on socio-behavioural characteristics in East and Southern Africa. PLoS ONE 17, e0264429. https://doi.org/10.1371/journal.pone.0264429 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mohebian, M. R., Marateb, H. R., Mansourian, M., Mañanas, M. A. & Mokarian, F. A hybrid computer-aided-diagnosis system for prediction of breast cancer recurrence (HPBCR) using optimized ensemble learning. Comput. Struct. Biotechnol. J. 15, 75–85. https://doi.org/10.1016/j.csbj.2016.11.004 (2017).
Article PubMed Google Scholar
Mansourian, M. et al. In Modelling and Analysis of Active Biopotential Signals in Healthcare, vol. 2 17–11–17–24 (IOP Publishing, 2020).
Radfar, S., Tayeri, K. & Namdari Tabar, H. Practical Guidelines on How to Provide Consulting Services in Behavioral Disorders Centers (Ministry of Health and Medical Education, 2009).
Google Scholar
Heymans, M. W. & Twisk, J. W. R. Handling missing data in clinical research. J. Clin. Epidemiol. 151, 185–188. https://doi.org/10.1016/j.jclinepi.2022.08.016 (2022).
Article PubMed Google Scholar
Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—A practical guide with flowcharts. BMC Med. Res. Methodol. 17, 162. https://doi.org/10.1186/s12874-017-0442-1 (2017).
Article PubMed PubMed Central Google Scholar
SeyedAlinaghi, S. et al. HIV in Iran: Onset, responses, and future directions. AIDS. 35 (2021).
Zhang, Z., Zhao, Y., Canes, A., Steinberg, D. & Lyashevska, O. Predictive analytics with gradient boosting in clinical medicine. Ann. Transl. Med. 7, 152. https://doi.org/10.21037/atm.2019.03.29 (2019).
Article PubMed PubMed Central Google Scholar
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7. https://doi.org/10.3389/fnbot.2013.00021 (2013).
Chen, T. & Guestrin, C. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378. https://doi.org/10.1136/bmj-2023-078378 (2024).
Article PubMed PubMed Central Google Scholar
Pencina, M. J. & D’Agostino, R. B. Sr. Evaluating discrimination of risk prediction models: The C statistic. JAMA 314, 1063–1064. https://doi.org/10.1001/jama.2015.11082 (2015).
Article CAS PubMed Google Scholar
Powers, D. M. W. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv:2010.16061 (2020). https://ui.adsabs.harvard.edu/abs/2020arXiv201016061P.
Yerushalmy, J. Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Rep. 1896(62), 1432–1449 (1947).
Article Google Scholar
Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874. https://doi.org/10.1016/j.patrec.2005.10.010 (2006).
Article ADS Google Scholar
Chicco, D., Warrens, M. J. & Jurman, G. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. IEEE Access. 9, 78368–78381. https://doi.org/10.1109/ACCESS.2021.3084050 (2021).
Article Google Scholar
Lundberg, S. M. & Lee, S.-I. In Advances in Neural Information Processing Systems, vol. 30 (eds I. Guyon et al.) (Curran Associates, Inc., 2017).

Download references

Acknowledgements

The authors would like to express their gratitude to the Vice Chancellor for Health at Shiraz University of Medical Sciences for providing data access and approval for this study. Additionally, we appreciate the support and cooperation of the staff in the NCDs department during the course of the study.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Mehdi Nejat, Hamid Reza Marateb, Mehrshad Alirezaei Farahani and Mohammad Zakaria Rajabi contributed equally to this work.

Authors and Affiliations

Department of Biostatistics and Epidemiology, School of Health, and Student Research Committee, School of Health, Isfahan University of Medical Sciences, Isfahan, Iran
Mehdi Nejat
Biomedical Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran
Hamid Reza Marateb, Mehrshad Alirezaei Farahani & Mohammad Zakaria Rajabi
Institute for Research and Innovation in Health (IRIS), Automatic Control (ESAII), Universitat Politècnica de Catalunya-BarcelonaTech (UPC), Building H, Floor 4, Av. Diagonal 647, 08028, Barcelona, Spain
Hamid Reza Marateb, Miguel Angel Mañanas & Marjan Mansourian
Department of Epidemiology and Biostatistics, School of Health, Isfahan University of Medical Sciences, Hezar Jerib St., Azadi Ave., Isfahan, 8174673461, Iran
Maryam Nasirian, Mohammad Javad Tarrahi & Marjan Mansourian
CIBER of Bioengineering, Biomaterilas and Nanomedicine (CIBER-BBN), Madrid, Spain
Miguel Angel Mañanas

Authors

Mehdi Nejat
View author publications
Search author on:PubMed Google Scholar
Hamid Reza Marateb
View author publications
Search author on:PubMed Google Scholar
Mehrshad Alirezaei Farahani
View author publications
Search author on:PubMed Google Scholar
Mohammad Zakaria Rajabi
View author publications
Search author on:PubMed Google Scholar
Maryam Nasirian
View author publications
Search author on:PubMed Google Scholar
Miguel Angel Mañanas
View author publications
Search author on:PubMed Google Scholar
Mohammad Javad Tarrahi
View author publications
Search author on:PubMed Google Scholar
Marjan Mansourian
View author publications
Search author on:PubMed Google Scholar

Contributions

M.N.: Conceptualization, methodology, validation, investigation, resources, data curation, writing—original draft; H.R.M.: Conceptualization, methodology, validation, writing—original draft, funding acquisition; M.A.F.: Methodology, software, validation, data curation, writing—review and editing, visualization; M.Z.R.: Methodology, software, validation, data curation, writing—review and editing, visualization; M.N.: Formal analysis, writing—review and editing, supervision; M.A.M.: Formal analysis, writing—review and editing, supervision, funding acquisition; M.J.T.: Conceptualization, validation, writing—review and editing, supervision, project administration; M.M.: Conceptualization, validation, formal analysis, writing—review and editing, supervision, project administration. All authors read and approved the final manuscript and agreed to be accountable for all aspects of the work.

Corresponding authors

Correspondence to Mohammad Javad Tarrahi or Marjan Mansourian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nejat, M., Marateb, H.R., Farahani, M.A. et al. Proactive recognition and early detection in communities through targeted HIV screening. Sci Rep 15, 27275 (2025). https://doi.org/10.1038/s41598-025-11029-3

Download citation

Received: 24 August 2024
Accepted: 07 July 2025
Published: 26 July 2025
DOI: https://doi.org/10.1038/s41598-025-11029-3

Subjects

Abstract

Similar content being viewed by others

Survival prediction models for people living with HIV based on four machine learning models

Assessing HIV transmission knowledge and rapid test history among the general population in Iran

Predictors of never testing for HIV among sexually active individuals aged 15–56 years in Rwanda

Introduction

Results

Descriptive statistics

Classification results

Discussion

Material and Methods

Disease diagnosis and surveillance registries

Data description

Statistical methods

Machine learning methods

Model evaluation

Feature importance

Web application implementation

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information 1.

Supplementary Information 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links