Introduction

Bacterial vaginosis (BV) is a common vaginal syndrome among women of reproductive age, affecting millions of women globally. Mediated by a non-optimal vaginal microbiome, BV is associated with several adverse obstetric and gynecological outcomes. Among these are an increased risk of sexually transmitted infections (STI), human immunodeficiency virus (HIV), a positive correlation with cervical cancer, and pre-term birth1,2,3,4,5. BV can be diagnosed clinically using Amsel’s criteria which requires the display of three of the following symptoms: malodor, pH above 4.5, the presence of clue cells, and/or vaginal discharge6. Alternatively, BV can be diagnosed using Nugent scoring, a microbiological method that quantitatively measures the amount of aerobic and anaerobic bacteria within the vagina. A Nugent score is assigned by using the morphology of gram-positive and gram-variable rods7, with a designated score for BV negative (0–3), indeterminate (4–6), and BV positive (>7). Traditionally, a positive Nugent score with accompanying clinical symptoms is diagnosed as symptomatic BV8,9.

Sequencing technologies using 16S ribosomal RNA allow for identification of the relative abundances of individual bacterial species in a microbiome, making it possible to characterize the vaginal microbiota in healthy and diseased states. The vaginal microbiota can be further grouped by dominant bacteria into community state types (CSTs)10. CST I is dominated by L. crispatus, CST II by L. gasseri, CST III by L. iners, and CST V by L. jensenii11. CSTs I, II, III, and V are low diversity state types that are lactobacilli dominated. Within CST IV there is usually not a singularly dominant bacterial species, and several different species make significant contributions to the microbiota10. Generally, low diversity lactobacilli-dominated vaginal microbiomes (CSTs I, II, and V) are considered optimal12,13. Other work has shown that microbial compositions vary within the healthy vaginal microbiome, with some ethnic groups, such as Black and Hispanic women, trending towards a larger range of microbiota10,14. However, a diverse vaginal microbiota has traditionally been associated with a positive BV outcome15.

Artificial intelligence (AI) and machine learning (ML) provide an opportunity for the development of predictive models using sequencing data. The expansion of advanced computational methods using AI shows promise in highlighting the differences in important vaginal microbiota that may vary across individuals. Metagenomic community state types (mgCSTs) are one example. This expansion of CSTs uses ML to cluster metagenomic sequencing data and categorizes vaginal microbiomes based on both composition and functional potential16. Further, feature selection methods highlight understudied bacterial combinations important to consider for accurate diagnosis17,18.

Several studies have used ML and sequencing data to predict BV in women. Baker et al. use various ML models to predict BV using 16S rRNA sequencing data from 25 women generated over 10 weeks19. Beck and Foster use both Logistic Regression and Random Forest models to predict BV using 16S rRNA sequencing data from datasets generated by Ravel and Srinivasan20. In a subsequent study, Beck and Foster evaluated important features determined by their models17. These studies use ML to diagnose BV with high accuracy, but few have investigated the effect of variations across race and ethnicity on ML performance.

It is important to understand if these variations in the vaginal microbiome lead to misdiagnosis when using ML. Existing health disparities are often the byproduct of inadequate access to health care, cultural biases, socioeconomic status, and, at times, discriminatory medical practices, whether intentional or not21,22,23,24,25. These disparities can be further exacerbated with the application of AI and machine learning due to biased data and/or algorithms and inadequate evaluation and auditing practices26,27,28,29,30,31,32. For example, recent work showed ethnic disparity in BV diagnosis in an asymptomatic BV cohort and highlighted bacteria, such as Lactobacilli crispatus, Lactobacilli iners, Gardnerella, and Prevotella, that were significant to accurate BV diagnosis between ethnic groups18.

Within this work, we assess accuracy in BV prediction within a group of 220 women with symptomatic BV from multiple ethnicities. We evaluate four AI/ML algorithms (Random Forest, Logistic Regression, Support Vector Machine, and Multi-layer Perceptron) in predicting BV using vaginal microbiome sequencing data from women of diverse ethnicity. We identify disparities in BV diagnosis in this symptomatic group and utilize paired ethnicity datasets and statistical feature selection methods to reduce disparities in model performance. Through feature selection, we identify unique bacterial communities important for accurate prediction that vary between ethnic groups.

Results

The data used in this work was produced by Srinivasan et al. 33, which consists of 220 women with and without bacterial vaginosis (BV). BV was diagnosed based on Nugent scoring, indicated based on the Gram stain test of vaginal smears. Patients with a Nugent score of seven or greater are identified as BV positive, and those with a score below seven are identified as BV negative. Given the goal of predicting bacterial vaginosis (BV), we used four machine learning (ML) models: Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Multi-layer Perceptron (MLP). The hyperparameters used to optimize each classifier are provided in Supplementary Table 1. Four metrics, including balanced accuracy (BACC), area under the precision-recall curve (AUPRC), false positive rate (FPR), and false negative rate (FNR), were used to evaluate the performance of the ML models in predicting BV.

Descriptive statistics

Within the dataset, there were 220 women, of which 97 (44%) were White, 75 (34%) were Black, and 48 (22%) were of other ethnicities (i.e., Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native, mixed, chose not to disclose their ethnicity, or did not know their race). All ethnic categories were self-described. Figure 1 displays the percentage of BV diagnosis based on Nugent scoring, including by ethnicity. 53% of the women had a positive BV diagnosis with Black women and women of Other ethnicities having a higher prevalence of BV compared to White women (Fig. 1). Upon conducting a chi-square test, we found that there is a significant association between ethnicity and BV outcome (p = 0.0001 < 0.05). In this work, we examine the impact of this association between ethnicity and BV outcome on machine learning performance for predicting BV outcomes across ethnic groups.

Fig. 1: Descriptive statistics of the dataset used in terms of BV diagnosis for all subjects and subjects by ethnicity.
figure 1

BV diagnosis was based on Nugent score. Patients with a Nugent score of 7 or above are diagnosed as BV positive; patients with a Nugent score below 7 are diagnosed as BV negative.

Figure 2a displays a two-dimensional t-distributed stochastic neighbor embedding (t-SNE) projection of the operational taxonomic unit (OTU) variables mapped to BV diagnosis based on Nugent scoring. From examining the t-SNE projection, most of the data is separable by BV diagnosis. However, some samples are not well-separated in the t-SNE project, potentiating challenges in diagnosis using AI/ML models. To further explore the impact of dominant bacterial species on BV diagnosis, a t-SNE projection mapped to community state type (CST) classification is shown (Fig. 2b). The plot separates well by CST, with the majority of CST I in the BV negative cluster, and the majority of CST IV in the BV positive cluster. The cluster of mixed BV diagnosis is heavily comprised of CST III, indicating an L. iners dominant microbiome for mixed diagnosis.

Fig. 2: Visualization of sequencing data in two-dimensional space.
figure 2

t-SNE plot of 16S rRNA bacterial variables by (a) BV diagnosis based on Nugent scoring and by (b) community state type.

Figure 3 displays the percentage and count of women in each CST across ethnicity. CST IV is the predominant CST for Black (56%) and Other (50%) women. CST III, which is L. iners dominated, is the second most common state type for women in these two groups (34.7% of Black women and 25% of Other women). CST I, the L. crispatus dominated microbiome, is the third most common CST among Black women (8%) and women labeled as Other (22.9%). In contrast, CST III is the most common state type among White women (39.2%) in this cohort, followed by CST IV (33%) and CST I (26.8%). All three ethnic groups had only one patient with CST V (L. jensenii dominated). Neither group had patients categorized as CST II (L. gasseri dominated).

Fig. 3: Community state type distribution across ethnic groups.
figure 3

Community state type (CST) distribution within the (a) White, (b) Black, and (c) Other ethnicities. CST I is dominated by L. crispatus, CST II by L. gasseri, CST III by L. iners, and CST V by L. jensenii. CST IV consists of diverse bacteria with no Lactobacillus dominance.

Model performance varies by ethnicity in BV diagnosis

Table 1 shows the average balanced accuracy (BACC), area under the precision recall curve (AUPRC), false positive rate (FPR), and false negative rate (FNR) of the four ML models in predicting BV. Overall, the ML models performed well (BACC: 0.90–0.92; AUPRC: 0.93–0.96; FPR: 0.07–0.10; FNR: 0.10–0.10). Random Forest (RF) and Logistic Regression (LR) had higher BV predictive performances compared to other models, depending on the metric; however, there were no statistically significant differences in performance metrics (Table 1).

Table 1 Overall model performance for RF, LR, SVM and MLP models in terms of balanced accuracy (BACC), AUPRC, false positive rate (FPR), and false negative rate (FNR) with 95% confidence intervals

Upon examining the performance of ML models by ethnic group, we found differences in predictive outcomes (Fig. 4, Supplementary Table 2). Overall, Black women had the lowest balanced accuracy (BACC) (Fig. 4a) and highest FPRs (Fig. 4c) across all models. In contrast, FNR tended to be lower for White women, except when using the Multi-layer Perceptron (MLP) model (Fig. 4d).

Fig. 4: Model performance by ML architecture type after 10 stratified train-test runs (with nested grid search cross validation in each run).
figure 4

a Boxplots showing the median, upper quartile, lower quartile, and outliers of balanced accuracy, (b) Area under precision-recall curve (AUPRC), (c) false positive rate (FPR), and d false negative rate (FNR). Asterisk (*) indicates group pairs with statistically significant difference in model performance.

In summary, most models, except for MLP, tended to perform worse for Black women compared to White women and women of Other ethnicities. However, MLP tended to perform the most comparably across all ethnic groups.

Using paired-ethnicity training to improve model performance

In this subsequent analysis, we sought to determine whether training and testing using data of the same ethnicity (i.e., paired-ethnicity training) would reduce ethnic disparities in model performance. We only show results for Logistic Regression (LR) since it had the highest overall balanced accuracy (Table 1).

Paired-ethnicity training (Fig. 5, Supplementary Table 3) for White and Black women resulted in either increased or comparable performance with training on samples from all ethnic groups. However, these improvements did not result in statistical significance. In contrast, all performance measures degraded, except for FNR, for women of Other ethnicities with statistical significance (balanced accuracy: p = 0.002; AUPRC: p = 0.037; FPR: p = 0.004).

Fig. 5: Model performance by ethnicity with and without ethnicity-specific training (i.e., paired-ethnicity and cross-training) with LR model.
figure 5

Conducted 10 stratified train-test runs (with nested grid search cross validation in each run). a Boxplots showing the median, upper quartile, lower quartile, and outliers of balanced accuracy, (b) Area under precision-recall curve (AUPRC), (c) false positive rate (FPR), and d false negative rate (FNR). Asterisk (*) indicates group pairs with statistically significant differences in model performance.

We also examined whether these models could be generalizable to ethnic groups not used in the training process (i.e., cross-training). Overall, cross-training tended to only result in improved predictive performance for women of Other ethnicities (Fig. 5, Supplementary Table 3), particularly regarding balanced accuracy (White: p = 0.048), FPR (White: p = 0.005; Black: p = 0.012), and FNR (White: p = 0.046; Black: p = 0.039). In contrast, we found that paired-ethnicity training tended to result in better predictive outcomes for Black women compared to cross-training with data of women of Other ethnicities (BACC: p = 0.003; FPR: p = 0.004; FNR: p = 0.01). Similarly, paired-ethnicity training frequently resulted in higher predictive performance for White women than cross-training with data of women of Other ethnic groups (balanced accuracy: p = 0.006; AUPRC: p = 0.006; FPR: p = 0.006).

Bacterial taxa highlighted as significant for predicting BV

Using feature selection methods, we identified bacterial taxa that contributed to accurate BV diagnosis. The following feature selection methods were used to extract significant bacterial taxa: Gini Index, T-test, F-test, and Point Biserial (PB) Correlation. Both the p-value (PBsig) and correlation coefficient (PBcorr) of the PB Correlation were used to determine important features from this method. Results are only shown for the LR classifier since it was the best-performing model overall (Table 1).

Overall, the Gini Index method performed the best compared to other feature selection methods (Table 2). When examining model performance by ethnicity (Fig. 6, Supplementary Table 4), improvement in model performance varied for each ethnic group. For White and Black women, feature selection improved most predictive measures, although not statistically significant for most methods. In contrast, performance measures, specifically balanced accuracy and FNR, tended to degrade for women of Other ethnicities across all feature selection approaches. Overall, the PBcorr method tended to degrade most performance measures for each ethnic group.

Fig. 6: Model performance of LR classifier by ethnicity with and without feature selection.
figure 6

Conducted 10 stratified train-test runs (with nested grid search cross validation in each run). ad Boxplots showing the median, upper quartile, lower quartile, and outliers of balanced accuracy, area under precision-recall curve (AUPRC), false positive rate (FPR), and false negative rate (FNR).

Table 2 Overall model performance with and without feature selection for LR model in terms of balanced accuracy (BACC), AUPRC, false positive rate (FPR), and false negative rate (FNR) with 95% confidence intervals

To further investigate how to improve equity in model performance, features identified as significant for BV diagnosis for each ethnic group were used to train ML models, independently, using the Gini Index method. Unique bacterial taxa were found in each ethnicity-specific subset for BV diagnosis (Fig. 7). Eggerthella sp. type 1 and Atopobium vaginae (Fannyhessea vaginae) were identified as most significant for BV diagnosis for White women in this cohort, corresponding to significant bacterial taxa identified for the entire cohort. In contrast, Gardnerella vaginalis and L. crispatus were found to be important predictors of BV for women of Other ethnicities. Dialister sp. Type 2 and Gardnerella vaginalis were highlighted as important bacterial taxa for BV diagnosis among Black women. Upon training the Logistic Regression (LR) model with feature sets corresponding to each ethnic group, we found that model performance degraded, except for FNR (Supplementary Fig. 1 and Supplementary Table 5). Using paired-ethnicity training or simplistic training (i.e., no feature selection nor ethnicity-specific training) tended to perform better for each ethnic group.

Fig. 7: Identification of significant bacterial taxa.
figure 7

The shared, top bacterial taxa indicative of BV identified using the Gini Index are provided overall and across each ethnic group.

Discussion

Accurate diagnosis of bacterial vaginosis (BV) involves an understanding of the complex interplay of microbial communities that exist in a dysbiotic state. Variations in the vaginal microbiome can complicate diagnostic accuracy; however, machine learning and AI-based tools offer innovative methodologies to analyze bacterial compositions in healthy and diseased states. Incorporating population-specific data enables the identification of weaknesses in the AI/ML pipeline (e.g., data generation, model development, model evaluation) by assessing predictive outcomes and key predictors for BV across varying demographic groups. Identification of understudied vaginal microbial communities, which have been shown to vary by race and ethnicity, provides a framework to gain mechanistic insights to answer these questions.

In this study, we evaluate four machine learning classifiers for BV prediction in a cohort of women tested for symptomatic BV. Model performance was comparable to prior analyses that used ML to diagnose women with asymptomatic BV using 16S rRNA sequencing data18,20,34,35. A recent study found disparity in accurate BV classification across ethnic groups within an asymptomatic cohort with predictive accuracy being the highest for White women and the lowest for Asian women18. Differences in model performances were attributed in part to the large variety of community state types seen across Asian women in the dataset10. In this cohort, we found that model performance was highest for White women and lowest for Black women (Fig. 4). This could be a result of the data not being separable by BV diagnosis for these groups or high CST III and IV types, resulting in model difficulty in discriminating between positive and negative samples.

Within this dataset, the CST IV, which has high microbe diversity, is prevalent in 56% of Black women and 33% of White women (Fig. 3). There was also a marked difference in Lactobacillus species profiles by subpopulation group. For instance, when examining Lactobacillus dominant microbiota for Black women, L. crispatus accounts for 8%, L. iners for 34.7%, and L. jensenii for 1.3%. For white women, L. crispatus accounts for 26.8%, L. iners for 39.2%, and L. jensenii for 1%. The lack of Lactobacilli dominant microbiomes for Black women in this cohort and the dominance of CST IV could both contribute to lower algorithm performance.

Generally, women with L. crispatus-dominated microbiotas were predicted to be BV negative, while L. iners-dominant (CST III) microbiotas had mixed BV outcomes33. Within the context of BV diagnoses, L. iners can be dominant in both healthy and BV states, which can complicate BV diagnosis using ML36. This can be seen in Fig. 2b, where the BV negative cluster is dominated by CST I (L. crispatus) samples, the BV positive cluster is dominated by CST IV (high microbe diversity) samples, and the mixed cluster by CST III (L. iners dominant) samples. These results highlight the need for advanced development of ML algorithms to accurately diagnose BV even when presented with a complexity of microbiome profiles.

To address the ethnic disparity found in model performance, paired-ethnicity training, cross-training, and feature selection were implemented. When using paired ethnicity, model performance consistently improved for White and Black women, while cross training improved predictive accuracy of BV diagnosis for women of Other ethnicities. The improved accuracy with cross-training for this group could be due to the difference in sample size between ethnicities, as women of Other ethnicities had a smaller sample size than Black and White groups. Feature selection tended not to lead to significant improvement in predictive accuracy of BV for all ethnic groups (Fig. 6 and Supplementary Fig. 1, Supplementary Table 4, and Supplementary Table 5). These results emphasize the need for further research in developing ML models for improving the predictive accuracy of BV across all ethnic groups.

Important bacterial features vary between ethnic groups and offer insight into key mediators of BV that can be considered on a population level. The features shared between White women and the entire training cohort were Eggerthella sp. type 1 and Atopobium vaginae (Fannyhessea vaginae). For Black women within this cohort, Dialister sp. Type 2 and G. vaginalis were indicated as important predictors of BV, also found to be associated with positive BV outcomes by Srinivasan et al. 33. Among the training samples for grouped ethnicities (i.e., Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native, mixed, chose not to disclose their ethnicity, or did not know their race), G. vaginalis and L. crispatus were indicated as the most significant features for predicting BV. G. vaginalis was the one important bacterial feature shared across all ethnic groups and has been widely studied as an etiological agent of BV37,38,39,40,41.

A limitation of our work is that women whose ethnicity was not Black nor White were grouped, as the sample sizes of those ethnicities were too small to evaluate individually. This impacted our ability to assess model performance on sub-populations (Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native) within this study. Furthermore, we also understand that there are several factors that contribute to variance in results by ethnicity, including environmental and sociocultural factors, which were not included in this study. Although the results of this work cannot be generalized given the low number of women represented in this cohort, the findings of this work coincide with prior findings, which found ethnic disparity in BV predictive analysis18.

Overall, these results highlight the need for the development of improved methods for addressing ethnic disparity, including larger and more diverse training sets. Future studies would include developing Fair AI models for accurate BV diagnosis and building larger cohorts for accurate training across groups.

Methods

Data source

The data used in this work were produced by Srinivasan et al. 33, which consisted of 220 women with and without bacterial vaginosis (BV). BV was diagnosed based on Nugent scoring, which is indicated based on the Gram stain test of vaginal smears7. Participants with BV were treated with intravaginal metronidazole gel used each night for 5 days. The women were recruited from the Public Health, Seattle, and King County Sexually Transmitted Diseases Clinic. The protocol was approved by the Institutional Review Board at the Fred Hutchinson Cancer Research Center (IORG0000017) and complied with all relevant ethical regulations, including the Declaration of Helsinki. All study participants provided written informed consent. Women were categorized into the following racial groups: Black, White, Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native, mixed, or selected Don’t Know/Does not wish to answer.

ML/AI algorithms

Due to the high dimensionality of the data, we used the following supervised machine learning algorithms to conduct these experiments: Support Vector Machine (SVM), Random Forest (RF), and Multi-layer Perceptron (MLP). Logistic Regression (LR) was also chosen given its implementation in prior research for predicting BV diagnosis19,20 and examining ethnic disparities in BV predictions18. All classifiers were implemented with the scikit-learn Python library package.

The outcome predicted was BV diagnosis based on Nugent scoring. A Nugent score of seven or higher was indicated as BV positive, while a Nugent score below seven was indicated as BV negative.

BV diagnosis was predicted using the 155 bacterial taxa sequenced using 16S rRNA from the V3-V4 hypervariable regions. Srinivasan et al. provide a list of the bacterial taxa for this dataset33.

Evaluation measures

The models were evaluated using the following metrics: balanced accuracy, area under the precision-recall curve (AUPRC), false positive rate (FPR), and false negative rate (FNR). Although the class label (i.e., BV positive and negative) used in this work was balanced overall, the class label was imbalanced when examined by ethnicity (Fig. 1). Therefore, balanced accuracy was chosen as the metric to measure the accuracy of the models since it is robust to imbalanced datasets, therefore providing a fair representation of model performance. In addition, the area under the precision-recall curve (AUPRC) was used instead of the area under the receiver operating characteristic curve (AUROC) to better capture model performance across ethnicities, due to class imbalance by ethnicity. AUPRC is also indicative of average precision. Balanced accuracy and AUPRC were computed using the metric functions found in the Python sklearn.metrics package. False positive rate (FPR) and false negative rate (FNR) were measured to easily examine differences in model performance by ethnicity. Below are the equations for calculating balanced accuracy, AUPRC, FPR, and FNR, where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

$$Balanced\,Accuracy=\frac{1}{2}\left(\frac{{TP}}{{TP}+{FN}}+\frac{{TN}}{{TN}+{FP}}\right)$$
(1)
$${Precision}=\frac{{TP}}{{TP}+{FP}}$$
(2)
$${Recall}=\frac{{TP}}{{TP}+{FN}}$$
(4)
$${AUPRC}=\sum _{n}\left({{Recall}}_{n}-{{Recall}}_{n-1}\right){{Precision}}_{n}$$
(4)
$${FPR}=\frac{{FP}}{{FP}+{TN}}$$
(5)
$${FNR}=\frac{{FN}}{{FN}+{TP}}$$
(6)

Statistical testing

For statistical analysis, a one-tailed t-test (scipy.stats.ttest_rel) or a one-tailed Wilcoxon signed rank test (scipy.stats. Wilcoxon) was used, depending on the normality of the data. Normality was determined based on the Shapiro-Wilks test (scipy.stats. Shapiro). For comparing model performance between ethnic groups, pairwise comparisons using a one-tailed Mann-Whitney U rank test were performed.

General preprocessing procedures

For all experiments, only the 16S rRNA sequence data (i.e., bacterial taxa) from the dataset were used as predictor variables, and the Nugent score was used to indicate BV diagnosis (score >= 7 is BV positive, < 7 is BV negative) as the target variable. Given the bacterial composition of the vaginal microbiome for each subject was presented in the form of relative abundance, the data was normalized by dividing by 100 to ensure all predictor variables were between 0 and 1. The ethnicity of women was specified as White, Black, or Other. The Other category consisted of women who identified as Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native, mixed, or selected Don’t Know/Does not wish to answer.

ML performance for predicting BV: overall and by ethnicity

The general preprocessing procedures, indicated previously, were followed. No other preprocessing procedures were needed to complete this experiment.

For each model, an 80–20 train-test split stratified by BV outcome and ethnicity was performed. The sklearn.model_selection GridSearchCV function was employed to perform a grid search on the training data to select the optimal hyperparameters for each classifier—LR, SVM, RF, and MLP. Table S1 provides the hyperparameter search space for each classifier for determining the optimal models. The optimized model was then used on the test set. The model predictions were evaluated by overall performance as well as by ethnicity in the test set. The process was performed 10 times by iterating the train-test split random state from 10 to 19.

ML performance with ethnicity-specific cohorts

First, all general preprocessing procedures indicated previously were followed. Data was grouped into three independent subsets based on ethnicity (i.e., White, Black, Other) to generate three ethnicity-specific datasets.

For each ethnicity-specific dataset, an 80–20 train-test split stratified by BV outcome was performed. For all three training sets, the sklearn.model_selection GridSearchCV function was employed to perform a grid search on the training data to select the optimal hyperparameters for the Logistic Regression (LR). Table S1 indicates hyperparameters search space of the LR classifier for determining the optimal model. The optimized model was used to make predictions on the test set (i.e., paired-ethnicity training) and externally validated with the other two ethnicity-specific datasets to assess model transferability (i.e., cross training).

ML performance with feature selection

General preprocessing procedures were followed to address whether feature selection improves overall model performance. Data was grouped into three independent subsets based on ethnicity (i.e., White, Black, Other) to generate three ethnicity-specific datasets.

F-test, T-test, Gini Index, and Point Biserial test were the feature selection methods used to identify significant bacterial taxa. For the Point Biserial test, two feature sets were obtained, using both p-value (p < 0.2) and correlation value (correlation value > 0.5). F-test and T-test were implemented through the stats Python library. For Gini Index, the DecisionTreeClassifier function in the sci-kit learn Python package sklearn.tree was implemented to find optimal features. Significant features were found for the entire dataset and for each ethnicity subset.

An 80–20 train-test split, stratified by BV outcome and ethnicity, was performed on the dataset using selected features. A grid search was performed using sklearn.model_selection. GridSearchCV function to select the optimal model for the Logistic Regression (LR) classifier. Table S1 provides the hyperparameter search space of the LR classifier. Model predictions were evaluated by overall performance and by ethnicity. The process was performed 10 times by iterating the train-test split random state from 10 to 19. These steps were executed for each feature selection method. The ML model and corresponding feature selection method that resulted in the highest overall balanced accuracy were selected as the optimal pair. The best-performing feature selection method was used to select features for each ethnicity-specific dataset. The same steps indicated previously were followed to assess the effect of feature selection on model performance for each ethnic group.