Introduction

Colorectal cancer (CRC) is one of the most common malignant tumors, its morbidity and mortality ranked the third and second place respectively among all kinds of malignant tumors, which has resulted in a huge burden to health systems around the world1. Fortunately, the incidence of CRC among adults older than 50 has decreased obviously over the years2,3, this is partly due to the popularity of CRC screening recommended by guidelines. However, the global incidence of young-onset colorectal cancer (YOCRC, younger than 50 years) sustained increase since 1990s4,5, the causes for this phenomenon remain elusive. Effectively mitigating the burden of CRC among younger adults has become a pressing public health issue that needs to be addressed.

The prognosis of CRC is closely related to the stage at diagnosis, 5-year relative survival rate of stage I is as high as 90%, while the stage IV CRC with distant metastasis is only 14%6, YOCRC patients tend to present with advanced stage at initial diagnosis7. Therefore, early identification of YOCRC is vitally important. Since the development of CRC mostly follows the sequence of “adenoma-cancer”, and it takes 5–10 years from precancerous lesions to cancer, this provides a time window for early identification of YOCRC8. Growing evidence indicated the clinical features, underlying pathogenesis and drivers of YOCRC may differ from the late-onset CRC (LOCRC, elder than 50 years)9,10, another study also revealed the community-level risk factors associated with YOCRC and LOCRC differed meaningfully11. Furthermore, age and family history of CRC in first-degree relatives are still the cornerstone of CRC risk stratification, most YOCRC will be classified as average risk by the current widely used risk stratification models/scores for merely a minority of YOCRC patients have first-degree relatives with CRC12, a recently published study demonstrated the Asia-Pacific Colorectal Screening (APCS) score might undervalue the possibility of advanced colorectal neoplasia in patients younger than 50 years13. Thus, the risk stratification models/scores used for LOCRC may not be suitable for younger adults. However, there is limited research concerning the development of YOCRC risk stratification model, and a generally accepted YOCRC risk stratification approach is currently lacking. Thus, creating an effective CRC risk stratification model specifically for younger adults has great clinical significance.

Application of artificial intelligence (AI) in medical field is rapidly developing, and machine learning (ML) is a major branch of AI14. Numerous previous studies have demonstrated ML prediction models are superior to traditional cancer risk stratification models15,16. Moreover, the widespread adoption of Electronic Medical Records (EMR) in hospitals has led to a vast amount of clinical data being stored in the medical information system. This provides a good platform for deploying ML algorithms and guiding individualized cancer risk stratification decision-making. Therefore, it is possible to construct YOCRC risk stratification models by utilizing EMR and AI. In this study, we aimed to develop, internally and temporally validate the YOCRC risk stratification models based on real-world clinical data and ML algorithms to identify individuals with high-risk of YOCRC that require further colonoscopy. Flowchart of building the AI models for YOCRC risk stratification was shown in Supplementary Fig. 1.

Methods

Data source and study population

The data for this retrospective study were obtained from the EMR of Renmin Hospital of Wuhan University by using Structured Query Language (SQL) tool. This study received approval from the institutional review board of Renmin Hospital of Wuhan University (approve number: WDRY2022-K260). Informed consent was waived since this is a retrospective study and the patients’ personal information was anonymized to protect their privacy. This study was registered with ClinicalTrials.gov (NCT06342622) on March 15, 2024. This study was also conducted in compliance with ethical principles of the Declaration of Helsinki. We analyzed the data of individuals aged 18–49 who underwent colonoscopy from January 2013 to December 2022. The inclusions of YOCRC group were as follows: (1) Newly diagnosed with CRC confirmed by colonoscopy or pathology; (2) Age at 18–49 when diagnosis; (3) Never received any CRC-related treatment including surgery, chemotherapy, radiotherapy and others at another medical institution. The YNCRC (young individuals without CRC diagnosis) group were selected by the following criteria: (1) No CRC confirmed by colonoscopy or pathology; (2) Age at 18–49. The exclusions for both groups were: (1) Hospital stay less than 24 h or without Complete Blood Count data; (2) Patients with inflammatory bowel disease (IBD) or hereditary CRC syndromes such as familial adenomatous polyposis (FAP) or hereditary non-polyposis colon cancer (HNPCC, or named Lynch syndrome); (3) History of other types of primary malignant tumor, and other reasons that made them unsuitable for enrollment in this study, such as the patients with garbled data.

Data extraction

Routine clinical data extracted in this study were obtained from the Department of Gastroenterology, the Department of Oncology, and the Physical Examination Center, etc. There were two major sources for the study participants: some individuals included in our study had relevant symptoms (such as chronic abdominal pain, altered bowel habits, unexplained weight loss, and hematochezia), and they underwent colonoscopy examination at the doctor’s recommendation, while others came to the hospital for a comprehensive physical examination (the physical examination included colonoscopy). Clinical data that were routinely assessed were extracted from the EMR of each patient. After consulting published literatures, clinical expertise and practical considerations, we collected sociodemographic characteristics (such as sex, age, height, weight, body mass index (BMI)), personal history (including smoking history, history of alcohol intake, medical history), history of CRC among first-degree relatives, comorbid conditions (such as diabetes, hyperlipidemia, etc.), symptoms (such as chronic abdominal pain, altered bowel habit, unexplained weight loss, hematochezia, etc.), laboratory test indexes (including routine blood test, biochemical test, tumor marker and fecal occult blood), and results of endoscopy and pathological examination. For patients with multiple hospital visits, we only selected the clinical data from their first visit.

Data preprocessing

Due to the missing values, outlier and extreme values, imbalanced data and other conditions that need to be handled in the raw data, data preprocessing become a critical part in constructing ML models for YOCRC risk stratification. To handle missing values, we filtered out features that had more than 65% missing data. Then, we used a combination of simple imputation technique and nonparametric Random Forest (RF) algorithm to fill in the remaining missing values. As for the handling of outlier and extreme values, values greater than 1.5 times the IQR (Interquartile Range) of the upper quartile or lower than 1.5 times the IQR of the lower quartile were considered as the outlier and extreme values17, we further replaced those below the 1% quantile with the 1% quantile value and replaced those above the 99% quantile with the 99% quantile value. This approach preserves the information contained in the outliers while minimizing their adverse impact on ML model performance. Categorical features like sex, smoking history, history of alcohol intake, history of CRC among first-degree relatives were binarized through one-hot encoding. To ensure comparability between different features and eliminate dimensionality effects, continuous numerical features like age, routine blood test, and biochemical test were normalized using the Min-max normalization method: new value = (original value - Xmin)/(Xmax-Xmin). The data from 2013 to 2021 were randomly split into a training dataset (50%) for ML model development, and an internal validation dataset (50%) for internal validation, while the data from 2022 were used as temporal validation cohort for temporal validation. The vast majority of participants were not diagnosed with YOCRC, resulting in imbalanced data, this may lead to the YOCRC risk stratification models had strong bias towards the negative category, that is, YNCRC individuals. Resampling was a popular and effective method to alleviate the influence from imbalanced distribution, here we introduced the random downsampling method18 on the training dataset to deal with the imbalanced data, which randomly deleted cases from the majority class, until the two classes were balanced.

Feature selection

Each feature had different value in development of the YOCRC risk stratification models, some features were critical, while certain features might be redundant or even irrelevant for the label. Appropriate feature selection could reduce the complexity of model creation and enhance its efficiency. The available computational resources and experimental outcomes determined the final feature selection pipeline. We started by removing features with a single category ratio higher than 90% and those with a low coefficient of variance (<0.1). Then the spearman correlation analysis was conducted to assess the correlation between features. An absolute value of the correlation coefficient >0.8 indicated a strong correlation between features, and only one of the features was retained. A heatmap was used to visualize the correlations between features. Subsequently, we applied the Boruta feature selection method to further exclude irrelevant features. Boruta can handle nonlinear relationships and interactions between features, and it performed better than penalized linear regression methods (e.g., L1 and L2 regularization) in our experiments. This algorithm calculated the Z-score of each real feature obtained from RF model, and compared them with the “shadow features” in each iteration, if the Z-score was greater than the maximal Z-score of shadow features, the real feature was judged as “important” and retained; otherwise, the real feature was phased out19. Here, we conducted Boruta feature selection method to select the key features in YOCRC risk stratification models.

YOCRC risk stratification model development

The training dataset was used to develop ML models for YOCRC risk stratification. In this study, we selected eight mature ML algorithms including Logistic Regression (LR), RF, k-Nearest Neighbor (KNN), Support Vector Classification (SVC), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Stacking ensemble classifier. LR model is a generalized linear regression ML model commonly used for classification tasks20. KNN model is one of the most widely used and simplest classifiers, this algorithm is realized by determining the similarity or distance between the tested cases and the training cases21. SVC is a well-known algorithm, this model completes the classification task by establishing the decision boundaries that get the margin maximization22. DT is one of the oldest and most common ML algorithms, where each internal node in the DT represents a judgment on an attribute, each branch represents the output of a judgment result, and finally the leaf nodes represent the classification results23. RF, XGBoost and AdaBoost are all ensemble ML algorithms consisting of multiple decision trees, those algorithms were intensively explored in medical field. The Stacking ensemble learning algorithm is realized based on layered learning, usually a two-layer learner. In the first layer, several base learners are developed. The outputs of these first-layer learners serve as input features for the second-layer learner, known as the meta-learner. The final label is predicted through the training of the meta-learner24. We selected models showing the top three areas under the receiver operating characteristic (ROC) curve (AUC) values in the internal validation cohort as the first-layer learners and base classifiers for the Stacking algorithm, while the LR model was chosen as the meta-learner for Stacking algorithm. Supplementary Fig. 2 showed the detailed information for the ML models used in this study.

As the cancer risk stratification tool, the consequences of the missed diagnosis of CRC are more severe than misdiagnosis. Therefore, we applied Grid search with cross validation to search the best hyper-parameters for each ML model, aiming to achieve the highest ‘Recall’ value (we set the hyperparameter scoring = ‘recall’). During the Grid search, tenfold cross validation was used to divide the training set into 10 parts. For each corresponding ML model, we trained all the hyperparameter combinations using nine parts, while the remaining one part was used to obtain the predicted ‘Recall’ value. After 10 iterations, 10 predicted recall values were generated, then the average of the 10 groups predicted “Recall” values were further determined, all the ML models repeated the above processes. From the hyperparameter pool, we ultimately selected the best hyperparameter combination with the highest predicted “Recall” value for each ML model development (Supplementary Fig. 3).

YOCRC risk stratification model evaluation

The performance of YOCRC models were assessed in both the internal validation dataset and temporal validation dataset. Accuracy (\(\frac{{TP}+{TN}}{{TP}+{FP}+{TN}+{FN}}\)), Sensitivity (Recall) \((\frac{{TP}}{{TP}+{FN}})\), Specificity \((\frac{{TN}}{{TN}+{FP}})\), Negative predictive value (NPV) \((\frac{{TN}}{{TN}+{FN}})\), Precision or Positive predictive value (PPV)\(\,(\frac{{TP}}{{TP}+{FP}})\), and F1 scores (\(\frac{2* {Precision}* {Recall}}{{Precision}+{Recall}}\)) were calculated respectively. Here, TP, FP, TN, FN represent the number of true positives, false positives, true negatives and false negatives. Additionally, we used ROC curve and AUC to evaluate the discrimination of the YOCRC risk stratification models. The Brier score was used to evaluate both discrimination and calibration quantitatively, Brier score (Y, P) = 1/n * \({\sum }_{i=1}^{n}{({P}_{i}-{Y}_{i})}^{2}\,\)(P: Probability of model prediction; Y: Actual probability of occurrence; n: number of predicted events). The lower the Brier score, the better the discrimination and calibration of the ML model. The calibration plot was used to assess the consistency between the values predicted by the ML model and the actual labels.

Feature importance and YOCRC risk stratification model interpretations

ML models are often regarded as black boxes due to the complexity and opaque of the algorithms. In order to enhance the clarity and understanding of the reasoning mechanism behind ML models, we introduced Shapley Additive exPlanation (SHAP) method to improve the interpretability for the ML model with best performance. SHAP is a novel approach designed to interpret various ML models proposed by Lundberg and Lee25, it has been widely verified to be an approach with excellent interpretability performance26. We conducted the SHAP summary analysis to provide the global interpretability of the ML model, and the Local Interpretable Model-agnostic Explanations (LIME) were also implemented to explain how the ML model make the classification decision in a given individual (local interpretability). SHAP dependence analyses were visualized for the four features with the highest SHAP values to show how a single feature affects the output of the ML model.

Software and statistical analysis

Data preprocessing, statistical analysis and ML model development were conducted by using SPSS 20.0 software, R (v.4.3.0, R Foundation for Statistical Computing), and Python software (v.3.9.12, Python Software Foundation), the python packages used in this study were ‘Pandas’, ‘NumPy’, ‘SciPy’, ‘Matplotlib’ and ‘Scikit-learn’, etc. Continuous variables were expressed as mean ± standard deviation (Mean ± SD) or medians (interquartile range), while categorical features were expressed as percentages.

Results

Patient characteristics

Figure 1 displayed the enrollment flow chart of patients. We initially extracted the data of 13,653 patients from EMR, after applying exclusion criteria, 1665 patients were removed due to the hospital stay less than 24 h or without Complete Blood Count data, 523 Patients with IBD or hereditary CRC syndromes were excluded, another 591 patients with history of other types of primary malignant tumor or with garbled data were also not included. Finally, up to 10,874 individuals qualified for this study, 477 individuals with YOCRC and 10,397 without YOCRC, of which, 7134 (65.6%) individuals had no CRC related symptoms, more specifically, and there were 317 (66.5%) YOCRC cases and 6817 (65.6%) YNCRC cases without CRC related symptoms, respectively. 8,820 patients in derivation cohort (4410 in training dataset and 4410 in internal validation dataset, respectively), while 2,054 patients in temporal validation cohort. Baseline characteristics of the enrolled individuals from two cohorts were depicted in Table 1, there were 399 (4.5%) and 78 (3.8%) YOCRC cases in derivation cohort and temporal validation cohort, respectively. The median age for the derivation cohort was 43 (36–48) years and 56.4% were men, for the temporal validation cohort, the median age was 43 (35–47) years and 53.6% were men. Overall, the subjects from both cohorts appeared to be similar in terms of most features.

Fig. 1: Flowchart of the patient selection.
figure 1

We initially extracted the data of 13,653 patients from EMR. After applying exclusion criteria, 1665 patients were not included because their hospital stay was less than 24 h or without Complete Blood Count data; 523 patients with IBD or hereditary CRC syndromes were excluded, and another 591 patients with a history of other types of primary malignant tumor or with garbled data were also not included. Finally, 10,874 individuals were included in the study, of which 8820 were in the derivation cohort (training dataset and internal validation dataset), and 2054 were in the temporal validation cohort. EMR Electronic Medical Record, CRC Colorectal cancer, CBC Complete Blood Count, YOCRC Young-onset colorectal cancer, YNCRC Young individuals without CRC diagnosis.

Table 1 Baseline characteristics of the enrolled individuals in the derivation cohort and temporal validation cohort dataset

Feature selection

There were up to 65 features in the original feature set (Supplementary Table 1). The distribution of missing data for each feature was illustrated in Supplementary Fig. 4, overall, the data of “Sex”, “Age”, “Personal history”, “Comorbid conditions”, “CRC related symptoms”, laboratory test indexes like “Complete Blood Count” and “Biochemical test” were basically complete, while the remaining features showed missing values to varying degrees, after removing 11 features with more than 65% missing values (Height, BMI, AFP, CA199, CA724, CA125, CA153, NSE, SCC, Hcg, TPSA), a total of 54 features were retained. We further excluded 20 categorical features with a single category ratio higher than 90% and three continuous features with the variances lower than 0.1. We also explored the relationship between features by conducting spearman correlation analysis, the heatmap was shown to visualize the degree of correlation between different features (Supplementary Fig. 5), five groups of features exhibited a high correlation: (1) “Neutrophils (Neu)” was highly correlated with “White blood cell (WBC).” (2) “Alanine aminotransferase (ALT)” was highly correlated with “Aspartate aminotransferase (AST).” (3) “Direct Bilirubin (DBil)” was highly correlated with “Total Bilirubin (TBil).” (4) “High-density lipoprotein (HDL)” was highly correlated with “Apolipoprotein a (Apoa).” (5) “Low-density lipoprotein (LDL),” “apolipoprotein b (Apob),” and “total cholesterol (TC) were highly correlated with each other. Thus, we retained five features (WBC, ALT, TBil, HDL, and TC) in the five groups and removed six features (Neu, AST, DBil, LDL, Apoa, and Apob). Then the Boruta feature selection method was applied to select the key features, 10 features marked as unimportant were phased out. Finally, 16 important features (Fecal immunochemical test (FIT)), Carcinoembryonic antigen (CEA), Lipoprotein a (Lipa), HDL, Lactate dehydrogenase (LDH), Alkaline phosphatase (ALP), Albumin/globulin ratio (A/G), ALT, Red blood cell distribution width (RDW), C-reactive protein (CRP), Plasma hemoglobin (Hb), Platelet (PLT), Monocyte (Mono), Lymphocyte (Lym), WBC, and TC were used in the subsequent development and evaluation of ML models in the training, internal validation and temporal validation (Fig. 2).

Fig. 2: The feature importance based on Boruta feature selection method.
figure 2

The horizontal axis presented the name of features, and the vertical axis showed the Z-score of features. The box plot displayed the Z-score of features based on RF algorithm. The green boxes indicated the features confirmed as important, the blue boxes showed the tentative features, and the red boxes represented the features judged as unimportant. Eosinophi Eosinophiles, GGT Gamma glutamyl transpeptidase, Glu Blood glucose, TBil Total bilirubin, TC Total cholesterol, TG Triglycerides, ALP Alkaline phosphatase, PLT Platelets, WBC White blood cells, Mono Monocytes, Lipa Lipoprotein a, RDW Red cell distribution width, A/G Albumin/globulin ratio, ALT Alanine aminotransferase, LDH Lactate dehydrogenase, HDL high-density lipoprotein, Lym Lymphocytes, CEA Carcinoembryonic antigen, Hb Hemoglobin, FIT Fecal immunochemical test, CRP C-reactive protein.

Comparison of the machine learning models’ performance for YOCRC risk stratification

We developed eight ML models for YOCRC risk stratification in the training dataset. Figures 3A, 4A, and Table 2 described the performance of YOCRC risk stratification models in internal validation dataset. Among these models, RF ML model provided the good performance with AUC of 0.859 and the highest Recall of 0.840, outperforming other models (AUC: LR, 0.768; KNN, 0.692; SVC, 0.777; DT, 0.732; XGBoost, 0.871; AdaBoost, 0.843; Stacking, 0.821; Recall: LR, 0.650; KNN, 0.665; SVC, 0.720; DT, 0.625; XGBoost, 0.775; AdaBoost, 0.745; Stacking, 0.800). We also used the confusion matrix to evaluate the ML models in the internal validation dataset (Supplementary Fig. 6), each comprising four boxes, with the boxes at the top left, bottom left, top right, and bottom right representing TN, FN, FP, and TP, respectively. The darker the color, the larger the corresponding value. The TN, FN, FP, and TP for the RF model were 3126, 32, 1084, and 168, respectively (Supplementary Fig. 6); thus, 168 YOCRC cases were properly detected among 200 patients with YOCRC (84%) using the RF model, while 3126 cases were properly classified under YNCRC among the 4210 individuals with YNCRC (74.25%). In the temporal validation dataset, the RF ML model also had good classification performance with AUC of 0.888, and showed the best Recall of 0.872 compared to other models, a high Recall indicated a lower YOCRC missed diagnosis rate (AUC: LR, 0.799; KNN, 0.726; SVC, 0.827; DT, 0.779; XGBoost, 0.892; AdaBoost, 0.887; Stacking, 0.849; Recall: LR, 0.577; KNN, 0.679; SVC, 0.744; DT, 0.705; XGBoost, 0.808; AdaBoost, 0.782; Stacking, 0.821) (Figs. 3B, 4B, Table 3 and Supplementary Fig. 7). It is worth noting that RF, XGBoost, and AdaBoost were all ensemble ML algorithms that comprised multiple weak decision trees. These algorithms exhibited the best performance in temporal validation, with all AUC values exceeding 0.850. The confusion matrix for eight ML models in the temporal validation dataset was shown in Supplementary Fig. 7. TN, FN, FP, and TP for the RF model were 1532, 10, 444, and 68, respectively, so 68 patients with YOCRC were correctly identified among 78 patients (87.18%), and 1532 cases were correctly classified as YNCRC among 1976 individuals (77.53%).

Fig. 3: The Bubble chart used to visualize various performance indicators of ML models in both validation datasets.
figure 3

A The Bubble chart used to visualize various performance indicators of ML models in the internal validation. This bubble chart was generated through a visual online platform (https://www.chiplot.online/). ML Machine learning, LR Logistic regression, RF Random Forest, KNN k-Nearest Neighbor, SVC Support Vector Classification, DT Decision Tree, XGBoost eXtreme Gradient Boosting, AdaBoost Adaptive Boosting, AUC Area under the receiver operating characteristic (ROC) curve, NPV Negative predictive value, PPV Positive predictive value. B The Bubble chart used to visualize various performance indicators of ML models in the temporal validation. This bubble chart was generated through a visual online platform (https://www.chiplot.online/). ML machine learning, LR logistic regression, RF random forest, KNN k-Nearest Neighbor, SVC Support Vector Classification, DT Decision Tree, XGBoost eXtreme Gradient Boosting, AdaBoost Adaptive Boosting, AUC area under the receiver operating characteristic (ROC) curve, NPV negative predictive value, PPV positive predictive value.

Fig. 4: ROC curve of YOCRC risk stratification models in both validation datasets.
figure 4

A The AUC of the ROC curve for YOCRC risk stratification models with different ML algorithms in the internal validation. AUC Area under the ROC curve, ROC Receiver operating characteristic, YOCRC Young-onset colorectal cancer. LR Logistic regression, RF Random Forest, KNN k-Nearest Neighbor, SVC Support Vector Classification, DT Decision Tree, XGBoost eXtreme Gradient Boosting, AdaBoost Adaptive Boosting. B The AUC of the ROC curve for YOCRC risk stratification models with different ML algorithms in the temporal validation. AUC area under the ROC curve, ROC receiver operating characteristic, YOCRC young-onset colorectal cancer, LR logistic regression, RF random forest, KNN k-Nearest Neighbor, SVC Support Vector Classification, DT decision tree, XGBoost eXtreme Gradient Boosting, AdaBoost Adaptive Boosting.

Table 2 Comparison the performance of ML models for YOCRC risk stratification in the internal validation dataset
Table 3 Comparison the performance of ML models for YOCRC risk stratification in the temporal validation dataset

We also observed a relatively low Precision (PPV) (ranged from 0.077 to 0.156 in internal validation, 0.071 to 0.159 in temporal validation) and F1 score (ranged from 0.137 to 0.251 in internal validation, 0.128 to 0.260 in temporal validation) in all ML models, which mainly ascribed to the unbalanced class in the internal validation dataset and temporal validation dataset with over 20-fold discrepancy between YOCRC and YNCRC cases (Supplementary Fig. 8, Supplementary Fig. 9, Table 2, and Table 3). However, the NPV values of the ML models were consistently higher than 0.950 in both internal validation and temporal validation. This indicated that the models developed in this study could serve as a valuable technique for physicians to exclude the possibility of YOCRC when an individual is classified as low risk. Supplementary Figs. 10, 11 showed the generated calibration curve in the lower right corner of the ideal calibration line in both internal validation and temporal validation; thus, the overall YOCRC risk was overestimated to some degree, which may be related to the setting of the hyperparameter “Recall” and consistent with the requirement of cancer risk stratification model, that is, the highest possible “Recall” value was needed to reduce the missed diagnosis rate. Notably, as the abscissa increases, the probability of prediction increases and the generated curve converges to the ideal line (especially for the RF model). This suggests that calibration is better when the model provides a prediction with a high probability of disease. Moreover, all ML models showed good discrimination and calibration with a low Brier score (most ML models had values less than 0.250 for both internal and temporal validations) (Table 2 and Table 3).

Sub-analysis of the optimal RF model for very early onset colorectal cancer risk stratification

A recently published YOCRC clinical practice guidelines27 defined very early onset CRC as CRC diagnoses at 35 years of age or younger, based on definitions used in previous studies28,29,30,31. This field has been attracting growing interest among researchers and clinicians. Former epidemiological data have revealed that the greatest change in incidence of YOCRC is among younger adults age at 20–3932,33. Therefore, we also tested the performance of RF model in very early onset CRC risk stratification. And this analysis also showed the good discriminative performance of RF in very early onset CRC risk stratification, with the AUC of 0.845, Accuracy of 0.733, Recall of 0.784, Specificity of 0.732, NPV of 0.989, PPV of 0.099, F1 score of 0.175 and Brier score of 0.183 in internal validation, and AUC of 0.867, Accuracy of 0.781, Recall of 0.813, Specificity of 0.780, NPV of 0.992, PPV of 0.104, F1 score of 0.184 and Brier score of 0.167 in temporal validation (Supplementary Fig. 12).

Feature importance and model interpretations with SHAP

The RF model with the highest Recall value in both internal validation and temporal validation was selected for SHAP analysis. The SHAP feature importance of the RF model was displayed in Fig. 5A with the average absolute SHAP values as the importance of each feature; the selected features were arranged from the highest to lowest average absolute SHAP values for ranking the impact of each feature on the prediction. Figure 5A showed the top five informative features for YOCRC risk stratification were CRP, FIT, Lym, CEA, and LDH. Figure 5B showed the density scatterplot of the SHAP values in YOCRC-negative and YOCRC-positive cases. On the left, the features were arranged in order of their importance (the same as in Fig. 5A). The color on the right defines the feature value: blue was related to a lower value, whereas red signified a higher value. As presented in Fig. 5B, CRP had the widest distribution of SHAP values, and many SHAP values (positive or negative values) of CRP were far from 0 compared with other features, suggesting higher average absolute SHAP values (highest importance), consistent with the results shown in Fig. 5A. The blue dots of CRP were mainly located on the left, and the red dots were mainly on the right; therefore, a lower CRP value contributed to a lower prediction of YOCRC, while a higher CRP value was associated with a higher possibility of YOCRC. Similarly, higher levels of CEA and LDH and positive FIT contributed to a higher possibility of YOCRC, whereas a higher Lym count was associated with a lower possibility. We conducted the SHAP dependence analysis and drew the partial dependency plots for the relationship between the feature value and the SHAP value for the top four most important features; the analysis of partial dependence plots provided a visual interpretation of the distribution of each feature and its global relationship with the risk of YOCRC. Figure 6A indicated how CRP influenced the risk of YOCRC; as CRP gradually increased in the x-axis value, the likelihood of YOCRC also had an increased tendency (increased in the y-axis value); similar patterns could also be observed with FIT (Fig. 6B) and CEA (Fig. 6D). However, an increased Lym value was associated with a lower likelihood of YOCRC (Fig. 6C), consistent with the results shown in Fig. 5B.

Fig. 5: Feature importance and global interpretations based on SHAP method.
figure 5

A, B Global summary of the SHAP value distribution for all enrolled features. The top five informative features for YOCRC risk stratification were CRP, FIT, Lym, CEA, and LDH. SHAP SHapley Additive explanation, YOCRC Young-onset colorectal cancer, CRP C-reactive protein, FIT Fecal immunochemical test, Lym lymphocytes, CEA carcinoembryonic antigen, LDH lactate dehydrogenase, Lipa Lipoprotein a, Hb Hemoglobin, Mono Monocyte, A/G Albumin/globulin ratio, HDL high density lipoprotein, ALT alanine aminotransferase, ALP alkaline phosphatase, TC total cholesterol, PLT platelet, RDW red blood cell distribution width, WBC white blood cell.

Fig. 6: SHAP dependence analysis for the top four influencing features (CRP, FIT, Lym, CEA).
figure 6

Higher levels of CRP, CEA, and LDH, along with the positive FIT all contributed a higher possibility of YOCRC, while the higher Lym count was associated with the lower possibility. SHAP SHapley Additive explanation, CRP C-reactive protein, FIT Fecal immunochemical test, Lym Lymphocytes, CEA carcinoembryonic antigen.

We presented two additional cases, one with YOCRC and the other without YOCRC, which were correctly classified by the RF model using the LIME method to demonstrate the clinical utility and translational impact of the RF model in the real clinical setting. Figure 7A showed a case without YOCRC, which was correctly classified as YNCRC (the possibility of YNCRC was 0.84), mainly based on the following considerations: a negative FIT, low levels of CRP, CEA, and LDH, and a higher Lym count. We randomly chose a patient with YOCRC. The LIME method also correctly indicated a high possibility of YOCRC (the possibility of YOCRC was 0.87), mainly determined based on the following conditions: higher levels of CRP, CEA, and Mono count, the lower level of Hb, and Lym count, while the negative FIT and higher level of HDL had an incorrect effect on the final classification results (Fig. 7B). In clinical practice, conflicting indicators make making definitive decisions challenging. Therefore, local interpretation using the LIME algorithm has significant clinical value for aiding clinicians in assessing the risk of YOCRC.

Fig. 7: Local interpretations based on LIME method.
figure 7

A, B Local Interpretable Model-agnostic Explanations (LIME) for the case without YOCRC and the case with YOCRC. Figure (A) displayed the case without YOCRC, this example is correctly classified as YNCRC mainly through the following conditions: a negative FIT, low levels of CRP, CEA, LDH, along with the higher Lym count. Figure (B) showed an individual with YOCRC, LIME method also correctly indicates high possibility of YOCRC, this was mainly determined by the following conditions: a higher levels of CRP, CEA, and Mono count, the lower level of Hb, and Lym count, while the negative FIT and higher level of HDL had an incorrect effect on the final classification results.

Discussion

The incidence of YOCRC is increasing in most countries, but data on risk stratification approaches for YOCRC are scarce, and the use of ML algorithms in detecting high risk young individuals age at 18–49 years is still lacking. In this large population-based sample study, we developed, internally and temporally validated a set of YOCRC risk stratification ML models. We found that the ML models based on routine clinical data could be used to YOCRC risk stratification among younger adults, of which, RF model presented a better overall performance in YOCRC risk stratification with the lowest missed diagnosis rate, thus, this model is potentially useful as a tool to help physicians identify young individuals with high risk of YOCRC that require further colonoscopy.

The increasing prevalence of YOCRC has prompted the American Cancer Society (ACS) to update the CRC screening starting age from 50 to 45 years for average-risk individuals in 201834. Besides, according to the current CRC guidelines from UK and USA35,36,37, young individuals with hereditary syndrome or family history of CRC should undergo CRC screening at an earlier age. However, more than half of YOCRC patients are younger than 4538, and some guidelines still recommend 50 years as the starting age for CRC screening among average-risk individuals36,39, also the majority of YOCRC are sporadic without hereditary syndrome or family history of CRC12, “one-size-fits-all” standard for screening starting age may cause the delay diagnosis of many YOCRC patients40. What’s more, clinical features, pathological profiles, molecular characteristics and risk factors of YOCRC patients appeared to represent a different disease process compared with LOCRC9,10,11, the majority of data about CRC risk stratification is based on evidence from LOCRC cohorts, this could explain the reason why the current CRC risk stratification approaches like APCS may not be suitable for younger adults. The only currently available international management guideline specially designed for YOCRC was published in 2023, recommending that younger adults with high-risk symptoms (hematochezia, unexplained iron deficiency anemia, or unexplained weight loss) or positive FIT undergo diagnostic colonoscopy. In contrast, for individuals with low-risk symptoms (i.e., abdominal pain, or changes in bowel habits), FIT may be an option to assess the risk of YOCRC27. Nevertheless, owing to the lack of relevant research evidence, some important YOCRC risk stratification contents, including but not limited to the detailed risk stratification process such as risk assessment frequency and other risk stratification methods (such as CRC risk stratification score) were not mentioned in this guideline. Therefore, the globally recognized YOCRC risk stratification method is yet to be established.

Nowadays, several screening methods for CRC exist. Colonoscopy could directly and clearly observe the colorectal mucosal lesions, but its widespread implementation was challenging in developing countries/regions because of limited medical resources and the invasive nature of the procedure. As an alternative, FIT is a widely used and noninvasive tool for CRC risk stratification. Multiple guidelines proposed FIT as the routine and primary CRC risk stratification examination41,42,43, however, the sensitivity of FIT was comparatively low in identifying stage I CRC (53%), suggesting that many cases of CRC will remain undetected by FIT alone44, besides, the sensitivity of FIT was strongly associated with the sampling times, studies found the sensitivity of FIT was lower than 0.7 with one sampling45,46 (lower than the sensitivity of ML models developed in our study), this suggests that over 30% of CRC cases will be missed with only one FIT sampling, thus, to reduce the risk of missed diagnosis, multiple examinations for FIT might be required to improve sensitivity, increasing the time cost. A large meta-analysis found the low PPV (4.9%) for FIT in CRC risk stratification47, which was also lowered than our study (PPV = 10%). Fecal DNA/RNA testing and computed tomographic colonography are two additional methods of CRC risk stratification that have been under development for several years. However, they are not yet widely available, particularly in the developing regions48, and the risk stratification performance of fecal DNA/RNA testing and computed tomographic colonography for YOCRC requires further research. Therefore, there is an urgent need to develop a user-friendly and accurate CRC risk stratification model specifically tailored for young populations. Archambault49 et al. constructed the YOCRC risk prediction models that incorporated an environmental risk score (ERS) and a polygenic risk score (PRS), this study found the higher values of ERS or PRS were associated with an increased risk of YOCRC, with both models contributing to case-control discrimination with an AUC of 0.631 (95% CI: 0.615 - 0.647). In another multicenter prospective study, Zhao50 et al. developed the National Colorectal Polyp Care (NCPC) score by conventional LR model, which consisted of eight predicting factors and categorizes individuals into low risk (LR), intermediate risk (IR), and high risk (HR) based on different score ranges, individuals with IR or HR or with a positive FIT were recommended to receive colonoscopy for further diagnosis. Similar to the results of our study (Recall value of LR model was 0.725 and 0.718 in internal validation and temporal validation, respectively), the risk-stratification model combining the NCPC score and FIT was able to detect 72.7% cases of YOCRC patients.

ML algorithms have been widely recognized for their ability to explore nonlinear relationships between features and labels, making them a promising approach for individualized risk prediction, while there have been limited studies on ML algorithms for YOCRC risk stratification. Douglas F Redd51 et al. used four machine learning and statistical models (LR, SVC, RF, and deep neural network) to identify CRC in US military Veterans aged 35–49, and found the ML models performed better than conventional LR model, among three ML models, deep neural network had the best accuracy achieved 0.985, however, this study only enrolled CRC patients in US military Veterans, individuals younger than 35 were also excluded. Another single center study included 3116 patients aged 35–50, four ML models (regularized discriminant analysis, RF, neural network, and stochastic gradient boosting) were constructed and internally validated. Similar to the above study, this study also found ML models showed higher discriminative ability compared to the conventional LR model, stochastic gradient boosting had the highest AUC of 0.76 (0.46–1.0) for CRC prediction, however, this study had a limited number of CRC cases (only 16, accounting for 0.05% of the sample), which may influence the results of the study52. It is worth mentioning that previous epidemiological data have indicated that the greatest change in incidence of YOCRC occurs among younger adults aged 20–3932,33, whereas both studies only included individuals aged 35–49. In contrast to previous studies, we firstly used ML algorithm to establish the YOCRC risk stratification models for individuals aged 18–49 based on large samples, we also evaluated the performance of ML models in very early onset CRC risk stratification for the first time. Additionally, temporal validation was conducted to further assess the performance of the ML models. In clinical practice, the consequences of missed diagnosis of YOCRC are more serious than misdiagnosis. In this study, we prioritized “Recall” over “Precision” and optimized the risk stratification models by setting the hyperparameter scoring = ‘recall’. Our findings demonstrated that the RF ML model achieved a recall of 0.840 and 0.872 in the internal validation and temporal datasets, respectively, surpassing the results reported in previous studies.

The prediction indicators for YOCRC risk stratification are diverse. In this study, we trained the risk stratification models using basic clinical information and routine laboratory testing items, which are easily accessible. The SHAP analysis of this study identified CRP, FIT, Lym, CEA and LDH as the top five key features in the RF based YOCRC risk stratification model. The underlying mechanism may be corroborated by previous findings, CRP is commonly used to assess the systemic inflammation, when the body shows signs of non-specific inflammation in the tumor microenvironment, the tumor cells release proinflammatory factors, which could stimulate hepatocytes to secrete CRP into the bloodstream53. Growing evidence has showed the CRP concentration is significantly increased in cancer patient plasma54, suggesting its potential function as a blood biomarker for cancer risk stratification, this has also been confirmed by multiple recently published studies assessing the value of CRP in CRC detection55,56,57. FIT is commonly used in population screening to identify individuals who have CRC, and it is strongly recommend by various CRC guidelines34,58. Lymphocyte count is an indicator of the responsiveness of the host’s immune system, it plays the crucial role in inhibiting the proliferation and spread of cancer59. Lymphopenia is often observed among advanced cancer cases. This could lead to a weakened and insufficient immune response. Several studies have associated lymphopenia with an unfavorable prognosis in oncological patients60. An 11-year follow-up study also indicated that medium and high cytotoxic activity of peripheral-blood lymphocytes is associated with reduced cancer risk, whereas low activity is associated with an increased risk61. LDH is an enzyme that plays a crucial role in cellular metabolism and energy production. Pre-treatment LDH levels have demonstrated its utility as a biomarker for risk stratification in patients with CRC62,63, here we also found the LDH has the potential value in YOCRC risk stratification. CEA is an oncofetal antigen produced by epithelial tumor cells in the digestive tract derived from endoderm, it exhibits high specificity but low sensitivity in CRC risk stratification, thus, similar to the LDH, CEA is commonly tested in patients with known CRC to monitor their response to treatment or screen for recurrence after treatment64. Of note, this study also discovered that indicators related to lipid metabolism, such as HDL and Lipa, were important features in the YOCRC risk stratification model. These findings align with previous research indicating that hyperlipidemia is one of the risk factors for YOCRC65. This suggests that YOCRC is closely related to metabolic processes and emphasizes the need to pay more attention to these metabolic indicators for YOCRC prevention. In contrast to the features in our study, a previous study showed that the best CRC risk stratification ML model comprised the four most important features: age, albumin, hematocrit and percent Lym, with the mean age of CRC patients higher than 60 years old66. Similarly, another study also found the age was the most important features in CRC risk stratification (mean age of CRC patients were higher than 50), followed by various hemoglobin-related parameters67. Goshen68 et al. developed a CRC risk stratification model based on standard laboratory records (mean age of CRC patients was 62.2), and found the most important features felled into three principal categories: iron deficiency (ferritin, etc.), inflammation and thrombosis (Hb, Mean corpuscular volume, Neutrophils, PLT, etc.), and liver function (alanine aminotransferase, etc.). On the whole, the features for YOCRC and general population (or older adults) CRC risk stratification are not exactly the same, there are the same and different features, for the same features such as Lym, Hb and PLT, the importance ranking in the model is also different, so the CRC risk stratification model for the general population (or older adults) may not be the optimal model for young population, making this study clinically valuable.

This study has some limitations. Firstly, our ML models are based on retrospective EMR data that may cause potential inherent bias, to reduce the influences of such bias, we enrolled large number of samples with up to 10,874 individuals included, and the performance of the models were also internally and temporally validated. Secondly, previous researches4,69,70 demonstrated gene mutations, westernized diet, and low physical activity increased the risk of YOCRC, so another limitation is that our ML models did not include the information about PRS, dietary habits and physical activity conditions because those data were lacking in the EMR. Thirdly, although we conducted internal and temporal validation to assess the stability of our model’s application, further evaluation is needed to determine the risk and benefits of deploying the optimal model in real-world scenarios. This would require the design of clinical trials for comprehensive evaluation. Fourthly, the precision values of current YOCRC risk stratification models are relatively low due to the imbalanced distribution of YOCRC and YNCRC individuals. In order to improve the precision of the YOCRC risk stratification model, it could be beneficial to incorporate additional important features such as PRS, dietary habits and physical activity condition in the future, which might improve its precision. Lastly, the individuals enrolled in our study were solely from Chinese populations in one big medical center, also the model was developed and validated restricted to individuals with the hospital stay of more than 24 h and nearly one-third of enrolled individuals had CRC related symptoms. Thus, its performance in other populations might be uncertain. The generalizability and applicability of the current method needs to be assessed through additional external validation datasets, including those of other ethnic groups. Nevertheless, these limitations do not alter the fact that the RF ML model proposed in this study is worthy of attention in YOCRC risk stratification, healthy institutions could consider to run this model to detect individuals with high-risk of YOCRC.

In conclusion, we have developed, internally and temporally validated the novel YOCRC risk stratification models based on routine clinical data and ML algorithms, the RF ML algorithm model presented the lowest missed diagnosis rate in YOCRC risk stratification among younger individuals aged at 18–49, this model has the potential to assist physicians in identifying individuals at high risk of YOCRC, and those classified as high risk should undergo colonoscopy for further diagnosis. In the future, a prospective study, as well as the other populations are required to further validate the generalizability and practicality of this YOCRC risk stratification model in clinical practice.