Introduction

Clostridioides difficile (C. difficile) is a Gram-positive, anaerobic bacterium capable of forming spores and producing toxins1. C. difficile infection (CDI) is a major cause of antibiotic-associated diarrhea and colitis. It is globally recognized as one of the most significant hospital-acquired infections2,3. In the United States, there were more than 450,000 cases of CDI reported annually from 2011 to 2017. Approximately 29,000 deaths related to CDI were documented4,5. The economic impact of CDI on healthcare costs is substantial in the United States, amounting to 6.3 billion USD. Annual CDI hospital management required nearly 2.4 million days of inpatient stay6. Between 2011 and 2013 in Europe, despite variations among countries, CDI occurred at a rate of 7.0 cases per 10,000 patient days7. The annual cost of CDI in Europe was estimated to be €3 billion per year8. In Korea, CDI occurred at a rate of 5.06 cases per 100,000 patients, with an economic cost reported to be 15.8 million USD in 20119.

The pathogenesis of CDI is microbial dysbiosis caused by factors such as antibiotics10. Gut dysbiosis refers to the disruption of the balance of microorganisms in the gastrointestinal tract, leading to the domination and colonization of C. difficile in the large intestine1. The virulence of C. difficile is mostly attributed to enzymes and toxins that can induce the breakdown of gut barrier integrity and loss of functionality11,12. C. difficile produces two important toxins, Toxin A and Toxin B, in its pathogenesis. Traditionally, Toxin A is known as enterotoxin A and Toxin B is known as cytotoxin B.

Risk factors for CDI are mainly known to be antibiotic exposure, old age, and hospitalization. Gastric acid suppression, inflammatory bowel disease, gastrointestinal surgeries, malignancy, transplantations, chronic kidney diseases, and immunosuppressant use are also risk factors for CDI1,13. Severe CDI can lead to the development of pseudomembranous colitis, toxic megacolon, sepsis, and death14. Spores of C. difficile are transmitted by the fecal-oral route. Bacterial infection affecting the colon is spread through both direct and indirect contact. Nevertheless, CDI is occasionally overlooked in clinical settings even when symptoms are present. Thus, it is necessary to monitor high CDI risk groups, particularly those undergoing antibiotic treatment, to prevent complications and spread.

Several studies have employed machine learning algorithms to predict CDI in hospitalized patients. However, none of those studies described CDI prediction performance above an area under the receiver operating characteristic curve (AUROC) of 0.8215,16,17,18,19,20. Although direct comparisons of these performances are not reasonable due to variations in data and cohort criteria, the predictive accuracy needs to be enhanced. To the best of our knowledge, prior studies have not utilized deep learning techniques, such as recurrent neural network (RNN) and Transformer, which have achieved great performance in several tasks regarding time-series data, to predict CDI caused by antibiotics. Furthermore, prior studies have not elucidated important features for CDI prediction with rational clinical descriptions.

This study aimed to predict the occurrence of CDI within 28 days after starting antibiotic treatment using longitudinal electronic health record (EHR) data, including vital signs, laboratory tests, and patient information such as demographics, comorbidities, and medications. We trained several machine learning and deep learning models to predict CDI and compared the performances of those models. A timeline of vital signs and laboratory tests with a 35-day monitoring period and a patient information vector consisting of age, sex, comorbidities, and medications were constructed for each patient. The data were collected from two locally separated tertiary hospitals. All trained CDI prediction models were externally validated. Important features were deduced from the trained model and CDI risk variation over time was compared between CDI and non-CDI groups.

Results

Study cohorts and dataset construction

This study included 594,759 patients at Seoul National University Hospital (SNUH) between January 2001 and December 2022 and 520,041 patients at Seoul National University Bundang Hospital (SNUBH) between January 2004 and December 2021, all of whom aged 18 years or more and received antibiotics. Following our cohort criteria, 529,049 patients at SNUH and 487,803 patients at SNUBH were excluded. The detailed population flowcharts are presented in Fig. 1. Finally, the numbers of patients in CDI and non-CDI groups were 466 and 65,244 in SNUH and 642 and 31,596 in SNUBH, respectively. Data from SNUH were used for model development and internal validation, while data from SNUBH were used for external validation.

Fig. 1: Population flowcharts.
Fig. 1: Population flowcharts.
Full size image

Data from SNUH were randomly split into development (70% for training and 15% for validation) and internal validation (15%) datasets. Data from SNUBH were used for external validation.

Table 1 shows the baseline characteristics of the included patients. The baseline characteristics were summarized by the initial monitoring point. The incidence of CDI stood was around 0.71% in SNUH and 2.00% in SNUBH. In both hospitals, the CDI group exhibited a higher average age than the non-CDI group. SNUH had a higher proportion of males in the CDI group, while SNUBH displayed an opposite trend. Both hospitals showed similar trends in vital signs and laboratory tests except for total bilirubin and ALT levels. Regarding comorbidities, the CDI group in both hospitals had a significantly higher prevalence of most diseases than the non-CDI group. Furthermore, the CDI group had a larger number of antibiotics, while antacid usage was more frequent in the non-CDI group of both hospitals.

Table 1 Baseline characteristics of the included patients

CDI prediction performance

The performances of CDI prediction models are presented in Table 2. We fitted tree-based machine learning models, including random forest and gradient boosting machine (GBM)21, and RNN and attention-based deep learning models, including simple RNN, long-short term memory (LSTM)22, gated recurrent unit (GRU)23, Transformer24, and RETAIN25. These models used in internal and external validation were selected after grid search cross-validation in the development process. The results of RNN-based models are presented in Supplementary Fig. 1. The results of tree-based models and attention-based models were omitted due to their poor performances. Consequently, GRU with two layers and 64 nodes was selected as the best model. RNN-based models consistently outperformed both tree-based models and attention-based models. Simple RNN exhibited the best prediction performance in internal validation, with an AUROC of 0.968 (0.957–0.979), while GRU demonstrated the best performance with an AUROC of 0.972 (0.968–0.975) in external validation. For both internal and external validation, GRU showed the highest areas under the precision-recall curve (AUPRC) of 0.250 (0.229–0.270) and 0.535 (0.531–0.539), respectively. We calculated all sensitivities, specificities, precisions, and F1-scores with Youden’s index26. In internal validation, GRU obtained the highest sensitivity while LSTM achieved the highest precision. In external validation, GRU outperformed in all metrics. Meanwhile, since the choice of sensitivity might vary based on specific objectives27, the results with fixed sensitivities of 0.9 and 0.95 are presented in Supplementary Tables 1 and 2. The differences in AUPRC, precision, and F1-score between hospitals were primarily due to the higher CDI incidence observed in the external validation dataset. The receiver operating characteristic (ROC) and precision-recall curves of all experiments are shown in Fig. 2.

Table 2 Performance for detecting Clostridioides difficile infection
Fig. 2: Receiver operating characteristic and precision–recall curves of Clostridioides difficile infection prediction models.
Fig. 2: Receiver operating characteristic and precision–recall curves of Clostridioides difficile infection prediction models.
Full size image

Curves on the right side (external validation) are smoother than those on the left side (internal validation) because the amount of data for external validation was much larger than for internal validation. Note that all data from SNUBH were used for external validation, while only split hold-out data from SNUH (15% of the total) were used for internal validation.

Considering that it is practically challenging to collect all features, we validated the GRU-based model with subsets of the features. We categorized the features including vital signs and laboratory tests into four groups: vital signs (SBP, DBP, heart rate, respiratory rate, body temperature), complete blood count (CBC) test (WBC, hemoglobin, platelet, neutrophil, ANC)28, liver function (LF) test (albumin, total protein, total bilirubin, AST, ALP, ALT)29,30, and renal function (RF) test (BUN, creatinine, sodium, potassium, chloride, total CO2)31,32. Those items in each group are usually measured together. CRP was excluded from categories because CRP is an independent test widely used to detect bacterial infection33. We validated our model with data that utilized only selected feature groups and masked the rest. In addition, considering potential missingness, we randomly masked 20% of the used features and validated the model. When only vital signs and CBC tests were utilized, the performance was slightly dropped, with AUROC decreasing from 0.952 to 0.933 in internal validation and from 0.972 to 0.947 in external validation. Even when the random masking strategy was applied, the AUROC remained higher than 0.9 in most cases. These results are summarized in Supplementary Table 3.

Feature importance analysis

We used Deep SHAP34 to identify important features for deep learning-based CDI prediction. The SHAP values of vital signs, laboratory tests, and patient information in both hospitals exhibited similar patterns, as shown in Fig. 3. This process used GRU as a reference model because it was selected as the best model in the development process. Body temperature and platelet count emerged as the two most influential variables, followed by ANC, BUN, neutrophil percentage, potassium, sodium, and CRP. Malignant tumors showed a relatively high SHAP value in internal validation. However, most comorbidities had a minimal impact on results. Notably, the number of antibiotics used and antacid usage exhibited the highest SHAP values among patient information in both hospitals.

Fig. 3: SHAP values of vital signs, laboratory tests, and patient information for patients with Clostridioides difficile infection.
Fig. 3: SHAP values of vital signs, laboratory tests, and patient information for patients with Clostridioides difficile infection.
Full size image

a SHAP values of vital signs and laboratory tests. b SHAP values of patient information. We separated the results of timeline data (vital signs and laboratory tests) and patient information data because those were fed into different layers (RNN or attention layer for timeline and fully connected layer for patient information) and the dimensions of timeline and patient information data were different. Direct comparison of SHAP values of those two data with the same scale was considered to be inappropriate.

Risk variation over time

To assess temporal differences in risk variation between CDI and non-CDI groups, we calculated continuous risk scores by sequentially entering timelines ranging from two days to 35 days to the trained model. The risk score was the output value of the model. GRU served as a reference model in this process. Risk score variations over time are shown in Fig. 4. Across the timeline, the risk score in the CDI group exhibited a consistent increase, while the risk score in the non-CDI group either maintained its initial value or decreased. The CDI group had a higher risk score than the non-CDI group initially. This difference in risk score became bigger as time progressed.

Fig. 4: Risk score variation over time.
Fig. 4: Risk score variation over time.
Full size image

The shaded part indicates the standard deviation.

Discussion

In this study, we developed and validated several machine learning and deep learning-based CDI prediction models using longitudinal EHR data, including a total of 97,948 patients. For internal and external validation, we used large multicenter datasets from two locally separate tertiary hospitals, SNUH and SNUBH. The model trained with GRU exhibited the best prediction performance with an AUROC of 0.952 for internal validation and 0.972 for external validation. In addition, we identified influential features for CDI prediction through Deep SHAP and assessed temporal differences in risk variation between CDI and non-CDI groups.

The CDI prediction model developed in this study demonstrated superior performance compared to previous studies. Panchavati et al. exhibited CDI detection performance with an AUROC of 0.815 using six hours of inpatient data and XGBoost15. Oh et al. formulated a CDI prediction model using inpatient data and logistic regression, yielding an AUROC of 0.82016. Marra et al. performed a cross-sectional study with EHR to predict CDI occurrences three days in advance with an AUROC of 0.60417. Wiens et al. trained support vector machine and hidden Markov model for CDI prediction, and achieved an AUROC of 0.79 and 0.75, respectively19,20. Regarding CDI caused by antibiotics, Werkhoven et al. performed multiple logistic regression and detected CDI in patients receiving antibiotic therapy with an AUROC of 0.81 18.

The CDI prediction model has a potential advantage in reducing CDI transmission and preventing complications in clinical settings by reducing underdiagnosis of CDI using patient trajectory. CDI is usually diagnosed by stool examinations such as nucleic acid amplification testing (NAAT), glutamate dehydrogenase (GDH), and enzyme immunoassay (EIA). NAAT and GDH are known to have high sensitivity and yield rapid results. Although EIA exhibits variations in sensitivity, it maintains a high specificity. The diagnostic process for CDI employs a multistep procedure that includes performing EIA if NAAT or GDH is positive. It is typically initiated when new-onset unformed stools occur more than three times within 24 h14. However, rate of CDI underdiagnosis remains significant despite current diagnostic strategy35,36,37. Lack of suspicion is one of the important reasons for underdiagnosis38. Improvement in the diagnostic process through monitoring systems for CDI in high risk patients is necessary. Our prediction model could be utilized to identify patients at risk of developing symptoms. Among the patients who are assigned high risk scores by the prediction model, symptomatic patients should be isolated and undergo stool test, while asymptomatic patients should be closely observed for symptom development. This approach could help reduce the underdiagnosis of CDI, thereby decreasing transmission and preventing complications. Furthermore, our model showed that the difference in risk score between CDI and non-CDI groups gradually widened over time. This trend implies that clinicians can consider the potential risk of CDI despite the lack of symptoms if the risk score remains high or increases after antibiotic treatment.

Platelet count and body temperature emerged as the two most important features among vital signs and laboratory tests, while the number of antibiotics used and antacid usage stood out as key attributes within patient information. Elevated body temperature is recognized as one of the main symptoms of CDI1. Platelet level also plays a significant role in CDI, as evidenced by a study indicating associations between abnormal platelet levels and CDI outcomes39. Thrombocytosis is associated with inflammation given that platelets are considered to be acute phase reactants40. On the other hand, thrombocytopenia is associated with underlying diseases such as malignancies, hepatic diseases, and immunosuppression, all of which are risk factors for CDI13. The number of antibiotics used during the observation period correlates with the duration of antibiotic use. In cases where a patient remains unresponsive to antibiotic treatment, there is a suspicion of antibiotic-resistant bacteria, prompting a switch to a broad-spectrum antibiotic. However, the risk of CDI tends to increase as antibiotic treatment continues1, as the rising trend of risk score was exhibited in the CDI group in our study. It has been reported that gastric acid suppression might have an influence on CDI development41,42. However, there is still controversy as several studies exhibited conflicting results43,44. A notable finding was the significantly lower use of antacids among patients with CDI. The reduced antacid usage in this group might be attributed to considerations of potential drug interactions, as the CDI group presented with a higher prevalence of underlying diseases 45.

In this study, Transformer outperformed tree-based models in CDI prediction. However, RNN-based models exhibited much better performances. Although Transformers are gaining widespread usages in various fields such as natural language processing (NLP), they were found to be less suitable for training on short numeric timelines (maximum 35 days) in our dataset than RNN-based models. This contrast might be attributed to differing ways in which RNNs and Transformers handle input data: RNN processes data sequentially, while Transformer takes in data all at once with positional encoding and learns relationships between variables with less susceptibility to temporal and sequential dependencies. Several studies dealing with numeric time-series data have employed RNNs for capturing sequential changes46,47,48. On the other hand, RETAIN, an interpretable attention-based neural network model for temporal EHR data, was initially developed for binary EHR variables. Consequently, it did not perform as well in our dataset, which contained numerous continuous numeric variables in timelines.

Regarding cohort definition, we did not include patients with missing values in our dataset, although eliminating patient records with missing features might introduce a potential bias in models because many patients were excluded by the criterion. Considering the potential bias, we conducted an experiment without eliminating any patients with missing features. Instead, we only excluded patients with any missing vital signs. In SNUH, there were 1092 patients in the CDI group and 298,531 in the non-CDI group. In SNUBH, there were 1563 patients in the CDI group and 301,984 in the non-CDI group. We then imputed missing values using multivariate imputation by chained equations (MICE)49. In this case, the performance of the random forest in internal validation had an AUROC of 1.0. Respiratory rate and CRP were the two most important variables in the random forest. Even when we trained another random forest with only those two variables and patient information, the AUROC was still 1.0. This implies that the model learned the imputation pattern of missing values reflecting the test pattern, which was relatively easy to learn, rather than the temporal variation of patient trajectory. The GRU-based model also exhibited an AUROC of 1.0 in internal validation. However, a significant decrease in performance was seen in external validation with an AUROC of 0.84. Test items routinely measured may vary between hospitals according to the policy of each hospital or regionality, and the model that learned test patterns did not perform well in the external cohort. Nevertheless, the original missingness criterion might be difficult to be satisfied in clinical practice. Thus, we validated our model by applying various mitigated missing conditions. Using only vital signs and CBC test, the model achieved an AUROC of 0.933 in internal validation and 0.947 in external validation. Even with a random masking strategy, the model maintained a high performance with an AUROC of 0.929 in internal validation and 0.904 in external validation.

This study has several limitations. First, it was a retrospective study with potential selection bias. However, it was noteworthy that our predictive model maintained good performance in both cohorts, which had different baseline characteristics and the number of events. Second, we monitored only four weeks from the index date. A previous study has reported that the risk of CDI is the highest in the first month after antibiotics use and that it persists until three months50. Accordingly, we excluded patients in the non-CDI group who developed CDI within 12 weeks after antibiotics use. However, we monitored them for four weeks, considering the lack of enough vital signs and laboratory test data. Further studies with prolonged monitoring periods are needed. Third, the prevalence of CDI was low in both cohorts. While we used Focal loss51 to address class imbalance, deep learning models often suffer from those extreme class imbalances. However, the incidence of CDI in the general inpatient population has been reported to be under 2%, and this confirms that our datasets reflected real-world data52. In addition, the results of internal and external validation showed similar trends, lending reliability to our findings. Fourth, records of gastrointestinal symptoms were not included in this study. Our model was based on EHR, which included measurements, drugs, and diagnoses records. Symptoms such as diarrhea are usually documented in nursing records. Unfortunately, we could not obtain nursing records owing to internal circumstances. However, if nursing records are accessible and models are trained with symptom history, the appropriate time to use the model can be specified based on symptoms. In addition, prediction performance might be improved with more sophisticated data. Fifth, we did not include patients with missing values in this study. This exclusion was to prevent the model from training test patterns, which might vary between hospitals, rather than the temporal variation of patient trajectory. However, further investigation on handling missing values for longitudinal EHR-based time-series data is required for expanding the study population and generalizing the prediction model. Sixth, we have provided several pieces of clinical evidence that support our feature importance analysis and have conducted a sub-analysis with subsets of the features, but prospective studies utilizing our model are needed to completely evaluate the practical suitability of our model.

In conclusion, we developed a high-performing deep learning-based CDI prediction model from patients with antibiotic treatment. The model was internally and externally validated using data from two locally separate tertiary hospitals. The CDI prediction model can reduce underdiagnosis of CDI and can contribute to the goal of decreasing transmission and preventing complications. This study had limitations in data acquisition, monitoring period, class imbalance, and data missingness. For future works, prospective studies on additional data of gastrointestinal symptoms are needed to specify an appropriate time to use the prediction model, to discover better models with expanded monitoring periods, and to further investigate how to handle missing values.

Methods

Data curation

This study used data from the SNUH between January 2001 and December 2022 and the SNUBH between January 2004 and December 2021. All data were collected from the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)53. OMOP is a public-private partnership established in the United States to inform the appropriate use of observational healthcare databases and study the effects of medical products. OMOP CDM provides standard-based data analysis solutions that support EHR from different sources into a standard data structure, which enables large-scale data analysis. Data from SNUH were randomly split into development (70% for training and 15% for validation) and internal validation (15%) datasets. Data from SNUBH were used for external validation.

The Institutional Review Boards (IRB) at SNUH (IRB No. 2308-101-1459) and SNUBH (IRB No. X-2308-846-906) granted a waiver of approval and informed consent, considering that the data used in this study were de-identified and based on observational electronic medical records from the OMOP CDM. This retrospective, multicenter study was conducted in agreement with the Declaration of Helsinki, the Korean Bioethics and Safety Act (Law No. 16372), and the Human Research Protection Program–Standard Operating Procedure of Seoul National University Hospital.

Cohort definition and main outcomes

Patients with antibiotic prescriptions aged over 18 years were identified and divided into two groups according to whether the first C. difficile test was positive or not. We used a C. difficile toxin test with EIA for the diagnosis of CDI. Exclusion criteria for the CDI group were as follows: We first excluded patients who had no antibiotic prescriptions within 28 days before C. difficile test. Then, patients with CDI that occurred within two days (washout period) after the index date of antibiotics (start of antibiotics) were also excluded, assuming that CDI did not occur due to antibiotics but for other reasons. To remove the potential redundant effect of antibiotics, we excluded the patients with past antibiotic prescriptions within 28 days before the index date. The non-CDI group included patients who had CDI after the first C. difficile test and patients who had never experienced CDI. For the former case, those who were diagnosed with CDI within 12 weeks after antibiotic treatment were considered to be at potential risk for CDI, and thus, they were excluded. For both positive and negative groups, patients with any missing vital signs and laboratory tests within both seven days before and 28 days after the index date and patients with previous colectomy procedures were excluded. A brief illustration of the index date definition is shown in Fig. 5.

Fig. 5: Index date definition.
Fig. 5: Index date definition.
Full size image

We monitored patient records between seven days before and 28 days after the index date, with the maximum length of the monitoring period of 35 days.

Data preprocessing

We used vital signs, laboratory tests, and patient information (including age, sex, comorbidity records, the number of antibiotics used, and antacids usage). OMOP CDM concept IDs of antibiotics, antacids, vital signs, laboratory tests, and colectomy procedures used in this study are shown in Supplementary Tables 47. For each patient, we constructed a timeline of vital signs and laboratory tests with patient information vectors. To construct a timeline, we first generated a table with 35 columns, implying the maximum monitoring period. Each column represented a sequential date. The last column was set as the last day. For each vital sign and laboratory item, values were filled in on each date. Blank parts between tests were linearly interpolated. Front and back parts of the timeline without tests were padded with the first and last measured vital signs and laboratory tests, respectively. A patient information vector consisted of age, sex, comorbidity records, the number of antibiotics used, and antacids usage. Age and the number of antibiotics used were numeric, while the rest variables were binary. All numeric variables of patient information, vital signs, and laboratory tests data were standardized before training.

Model development

We trained three kinds of models: tree-based model (including random forest and GBM as a baseline), RNN-based model (including simple RNN, LSTM, and GRU), and attention-based model (including Transformer and RETAIN). As tree-based models were trained with one-dimensional vectors, we concatenated the first and last columns of the timeline and patient information vector of each patient. This implies that tree-based models also used vital signs and laboratory tests before and after the index date as other deep learning models. In the case of RNN and attention-based models, a timeline was fed into the RNN or attention layer and a patient information vector was inputted into the fully connected (dense) layer. The outputs of those layers were then merged, and entered another dense layer to classify CDI or normal cases. A brief illustration of the process of CDI prediction using deep learning models is shown in Fig. 6.

Fig. 6: The process of Clostridioides difficile infection prediction using tree-based and deep learning models.
Fig. 6: The process of Clostridioides difficile infection prediction using tree-based and deep learning models.
Full size image

Timeline variables (vital signs and laboratory tests) were fed into the RNN or attention layer, while patient information variables were fed into the fully connected (dense) layer. The output vectors of those two layers were concatenated and fed into another fully connected layer to predict CDI. For tree-based models, we transformed timeline and patient information data into a one-dimensional vector by concatenating the first and last columns of the timeline and patient information vector.

We used grid search cross-validation to find the best-performing model. For tree-based models, the number of trees (from 20 to 200 in 20 intervals), the maximum depth of the tree (one to ten and infinite), and the maximum number of features to consider when looking for the best split (one to ten and the number of features) were used as hyperparameters. For RNN and attention-based models, the number of RNN and attention layers (one to five) and the number of nodes in all hidden layers (32, 64, 128, 256, and 512) were used as hyperparameters. In the case of Transformer, the number of heads for multi-head attention was set to eight. The batch size was set to 256. We used Focal loss51 to address the extreme class imbalance of the dataset and Adam optimizer54 with a learning rate of 0.0001. The model was trained for a maximum of 100 epochs. Early stopping was set with a patience of 20 on the performance measured using the AUROC. All experiments during the development process were performed across five random seeds. The models with the best mean performance were selected and used for internal and external validation. Scikit-learn (version 1.0.2) and Pytorch (version 1.12.0) in Python (version 3.8.10) were used for tree-based and deep learning-based models, respectively.

Identifying important features

To identify important features for deep learning-based CDI prediction, we used Deep SHAP, an enhanced version of the DeepLIFT algorithm55. Deep SHAP could compute attribution scores of all nodes and approximate Shapley values implying feature importance scores. Regarding the timeline for each patient, we computed the total absolute SHAP values across all days and then averaged these summed SHAP values across all patients to identify significant vital signs and laboratory items. As for patient information vectors, we similarly averaged the absolute SHAP values across all patients to discern the crucial patient information. SHAP (version 0.42.1) package in Python (version 3.8.10) was used for SHAP value calculation.

Statistical analysis

Characteristics such as age, sex, vital signs, laboratory tests, comorbidities, and drug usage (the number of antibiotics used and antacids usage) between CDI and non-CDI groups and between hospitals were compared by calculating P-values using the Student’s t-test for continuous variables and the Fisher’s exact test for categorical variables. To measure and compare the performances of the models, we used AUROC and AUPRC. Confidence intervals (CIs) of AUROC and AUPRC were calculated using DeLong’s method56, while those of sensitivity, specificity, precision, and F1-score were calculated using Wilson’s method57. Statistical significance was set at α = 0.05. All statistical analyses were performed using scikit-learn (version 1.0.2) in Python (version 3.8.10).