Development and validation of machine learning models for young-onset colorectal cancer risk stratification

Zhen, Junhai; Li, Jiao; Liao, Fei; Zhang, Jixiang; Liu, Chuan; Xie, Huabing; Tan, Cheng; Dong, Weiguo

doi:10.1038/s41698-024-00719-2

Download PDF

Article
Open access
Published: 22 October 2024

Development and validation of machine learning models for young-onset colorectal cancer risk stratification

Junhai Zhen¹^na1,
Jiao Li²^na1,
Fei Liao²,
Jixiang Zhang²,
Chuan Liu²,
Huabing Xie¹,
Cheng Tan² &
…
Weiguo Dong²

npj Precision Oncology volume 8, Article number: 239 (2024) Cite this article

3771 Accesses
6 Citations
Metrics details

Subjects

Abstract

Incidence of young-onset colorectal cancer (YOCRC, younger than 50) has significantly increased worldwide. The performance of fecal immunochemical test in detecting YOCRC is unsatisfactory. Using routine clinical data, we aimed to develop machine learning (ML) models to identify individuals with high-risk YOCRC who require further colonoscopy. We retrospectively extracted data of 10,874 young individuals. Multiple supervised ML techniques were devised to distinguish individuals with and without CRC, classifiers were trained, internally validated and temporally validated. In internal validation cohort, Random Forest (RF) ML model demonstrated good performance with AUC of 0.859 and highest recall of 0.840. In temporal validation cohort, the RF ML model also exhibited good classification performance, achieving AUC of 0.888 and highest recall of 0.872. RF algorithm-based approach is effective and feasible in YOCRC risk stratification. This could be valuable in assessing the risk of YOCRC so that clinical management, including further colonoscopy, can be subsequently made. (Registration: This study was registered with ClinicalTrials.gov (NCT06342622) on March 15, 2024.).

Diagnostic yield of a risk model versus faecal immunochemical test only: a randomised controlled trial in a colorectal cancer screening programme

Article 19 July 2023

Development and validation of survival prediction tools in early and late onset colorectal cancer patients

Article Open access 15 April 2025

Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Article Open access 21 January 2025

Introduction

Colorectal cancer (CRC) is one of the most common malignant tumors, its morbidity and mortality ranked the third and second place respectively among all kinds of malignant tumors, which has resulted in a huge burden to health systems around the world¹. Fortunately, the incidence of CRC among adults older than 50 has decreased obviously over the years^2,3, this is partly due to the popularity of CRC screening recommended by guidelines. However, the global incidence of young-onset colorectal cancer (YOCRC, younger than 50 years) sustained increase since 1990s^4,5, the causes for this phenomenon remain elusive. Effectively mitigating the burden of CRC among younger adults has become a pressing public health issue that needs to be addressed.

The prognosis of CRC is closely related to the stage at diagnosis, 5-year relative survival rate of stage I is as high as 90%, while the stage IV CRC with distant metastasis is only 14%⁶, YOCRC patients tend to present with advanced stage at initial diagnosis⁷. Therefore, early identification of YOCRC is vitally important. Since the development of CRC mostly follows the sequence of “adenoma-cancer”, and it takes 5–10 years from precancerous lesions to cancer, this provides a time window for early identification of YOCRC⁸. Growing evidence indicated the clinical features, underlying pathogenesis and drivers of YOCRC may differ from the late-onset CRC (LOCRC, elder than 50 years)^9,10, another study also revealed the community-level risk factors associated with YOCRC and LOCRC differed meaningfully¹¹. Furthermore, age and family history of CRC in first-degree relatives are still the cornerstone of CRC risk stratification, most YOCRC will be classified as average risk by the current widely used risk stratification models/scores for merely a minority of YOCRC patients have first-degree relatives with CRC¹², a recently published study demonstrated the Asia-Pacific Colorectal Screening (APCS) score might undervalue the possibility of advanced colorectal neoplasia in patients younger than 50 years¹³. Thus, the risk stratification models/scores used for LOCRC may not be suitable for younger adults. However, there is limited research concerning the development of YOCRC risk stratification model, and a generally accepted YOCRC risk stratification approach is currently lacking. Thus, creating an effective CRC risk stratification model specifically for younger adults has great clinical significance.

Application of artificial intelligence (AI) in medical field is rapidly developing, and machine learning (ML) is a major branch of AI¹⁴. Numerous previous studies have demonstrated ML prediction models are superior to traditional cancer risk stratification models^15,16. Moreover, the widespread adoption of Electronic Medical Records (EMR) in hospitals has led to a vast amount of clinical data being stored in the medical information system. This provides a good platform for deploying ML algorithms and guiding individualized cancer risk stratification decision-making. Therefore, it is possible to construct YOCRC risk stratification models by utilizing EMR and AI. In this study, we aimed to develop, internally and temporally validate the YOCRC risk stratification models based on real-world clinical data and ML algorithms to identify individuals with high-risk of YOCRC that require further colonoscopy. Flowchart of building the AI models for YOCRC risk stratification was shown in Supplementary Fig. 1.

Methods

Data source and study population

The data for this retrospective study were obtained from the EMR of Renmin Hospital of Wuhan University by using Structured Query Language (SQL) tool. This study received approval from the institutional review board of Renmin Hospital of Wuhan University (approve number: WDRY2022-K260). Informed consent was waived since this is a retrospective study and the patients’ personal information was anonymized to protect their privacy. This study was registered with ClinicalTrials.gov (NCT06342622) on March 15, 2024. This study was also conducted in compliance with ethical principles of the Declaration of Helsinki. We analyzed the data of individuals aged 18–49 who underwent colonoscopy from January 2013 to December 2022. The inclusions of YOCRC group were as follows: (1) Newly diagnosed with CRC confirmed by colonoscopy or pathology; (2) Age at 18–49 when diagnosis; (3) Never received any CRC-related treatment including surgery, chemotherapy, radiotherapy and others at another medical institution. The YNCRC (young individuals without CRC diagnosis) group were selected by the following criteria: (1) No CRC confirmed by colonoscopy or pathology; (2) Age at 18–49. The exclusions for both groups were: (1) Hospital stay less than 24 h or without Complete Blood Count data; (2) Patients with inflammatory bowel disease (IBD) or hereditary CRC syndromes such as familial adenomatous polyposis (FAP) or hereditary non-polyposis colon cancer (HNPCC, or named Lynch syndrome); (3) History of other types of primary malignant tumor, and other reasons that made them unsuitable for enrollment in this study, such as the patients with garbled data.

Data extraction

Routine clinical data extracted in this study were obtained from the Department of Gastroenterology, the Department of Oncology, and the Physical Examination Center, etc. There were two major sources for the study participants: some individuals included in our study had relevant symptoms (such as chronic abdominal pain, altered bowel habits, unexplained weight loss, and hematochezia), and they underwent colonoscopy examination at the doctor’s recommendation, while others came to the hospital for a comprehensive physical examination (the physical examination included colonoscopy). Clinical data that were routinely assessed were extracted from the EMR of each patient. After consulting published literatures, clinical expertise and practical considerations, we collected sociodemographic characteristics (such as sex, age, height, weight, body mass index (BMI)), personal history (including smoking history, history of alcohol intake, medical history), history of CRC among first-degree relatives, comorbid conditions (such as diabetes, hyperlipidemia, etc.), symptoms (such as chronic abdominal pain, altered bowel habit, unexplained weight loss, hematochezia, etc.), laboratory test indexes (including routine blood test, biochemical test, tumor marker and fecal occult blood), and results of endoscopy and pathological examination. For patients with multiple hospital visits, we only selected the clinical data from their first visit.

Data preprocessing

Due to the missing values, outlier and extreme values, imbalanced data and other conditions that need to be handled in the raw data, data preprocessing become a critical part in constructing ML models for YOCRC risk stratification. To handle missing values, we filtered out features that had more than 65% missing data. Then, we used a combination of simple imputation technique and nonparametric Random Forest (RF) algorithm to fill in the remaining missing values. As for the handling of outlier and extreme values, values greater than 1.5 times the IQR (Interquartile Range) of the upper quartile or lower than 1.5 times the IQR of the lower quartile were considered as the outlier and extreme values¹⁷, we further replaced those below the 1% quantile with the 1% quantile value and replaced those above the 99% quantile with the 99% quantile value. This approach preserves the information contained in the outliers while minimizing their adverse impact on ML model performance. Categorical features like sex, smoking history, history of alcohol intake, history of CRC among first-degree relatives were binarized through one-hot encoding. To ensure comparability between different features and eliminate dimensionality effects, continuous numerical features like age, routine blood test, and biochemical test were normalized using the Min-max normalization method: new value = (original value - X_min)/(X_max-X_min). The data from 2013 to 2021 were randomly split into a training dataset (50%) for ML model development, and an internal validation dataset (50%) for internal validation, while the data from 2022 were used as temporal validation cohort for temporal validation. The vast majority of participants were not diagnosed with YOCRC, resulting in imbalanced data, this may lead to the YOCRC risk stratification models had strong bias towards the negative category, that is, YNCRC individuals. Resampling was a popular and effective method to alleviate the influence from imbalanced distribution, here we introduced the random downsampling method¹⁸ on the training dataset to deal with the imbalanced data, which randomly deleted cases from the majority class, until the two classes were balanced.

Feature selection

Each feature had different value in development of the YOCRC risk stratification models, some features were critical, while certain features might be redundant or even irrelevant for the label. Appropriate feature selection could reduce the complexity of model creation and enhance its efficiency. The available computational resources and experimental outcomes determined the final feature selection pipeline. We started by removing features with a single category ratio higher than 90% and those with a low coefficient of variance (<0.1). Then the spearman correlation analysis was conducted to assess the correlation between features. An absolute value of the correlation coefficient >0.8 indicated a strong correlation between features, and only one of the features was retained. A heatmap was used to visualize the correlations between features. Subsequently, we applied the Boruta feature selection method to further exclude irrelevant features. Boruta can handle nonlinear relationships and interactions between features, and it performed better than penalized linear regression methods (e.g., L1 and L2 regularization) in our experiments. This algorithm calculated the Z-score of each real feature obtained from RF model, and compared them with the “shadow features” in each iteration, if the Z-score was greater than the maximal Z-score of shadow features, the real feature was judged as “important” and retained; otherwise, the real feature was phased out¹⁹. Here, we conducted Boruta feature selection method to select the key features in YOCRC risk stratification models.

YOCRC risk stratification model development

The training dataset was used to develop ML models for YOCRC risk stratification. In this study, we selected eight mature ML algorithms including Logistic Regression (LR), RF, k-Nearest Neighbor (KNN), Support Vector Classification (SVC), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), Adaptive Boosting (AdaBoost), Stacking ensemble classifier. LR model is a generalized linear regression ML model commonly used for classification tasks²⁰. KNN model is one of the most widely used and simplest classifiers, this algorithm is realized by determining the similarity or distance between the tested cases and the training cases²¹. SVC is a well-known algorithm, this model completes the classification task by establishing the decision boundaries that get the margin maximization²². DT is one of the oldest and most common ML algorithms, where each internal node in the DT represents a judgment on an attribute, each branch represents the output of a judgment result, and finally the leaf nodes represent the classification results²³. RF, XGBoost and AdaBoost are all ensemble ML algorithms consisting of multiple decision trees, those algorithms were intensively explored in medical field. The Stacking ensemble learning algorithm is realized based on layered learning, usually a two-layer learner. In the first layer, several base learners are developed. The outputs of these first-layer learners serve as input features for the second-layer learner, known as the meta-learner. The final label is predicted through the training of the meta-learner²⁴. We selected models showing the top three areas under the receiver operating characteristic (ROC) curve (AUC) values in the internal validation cohort as the first-layer learners and base classifiers for the Stacking algorithm, while the LR model was chosen as the meta-learner for Stacking algorithm. Supplementary Fig. 2 showed the detailed information for the ML models used in this study.

As the cancer risk stratification tool, the consequences of the missed diagnosis of CRC are more severe than misdiagnosis. Therefore, we applied Grid search with cross validation to search the best hyper-parameters for each ML model, aiming to achieve the highest ‘Recall’ value (we set the hyperparameter scoring = ‘recall’). During the Grid search, tenfold cross validation was used to divide the training set into 10 parts. For each corresponding ML model, we trained all the hyperparameter combinations using nine parts, while the remaining one part was used to obtain the predicted ‘Recall’ value. After 10 iterations, 10 predicted recall values were generated, then the average of the 10 groups predicted “Recall” values were further determined, all the ML models repeated the above processes. From the hyperparameter pool, we ultimately selected the best hyperparameter combination with the highest predicted “Recall” value for each ML model development (Supplementary Fig. 3).

YOCRC risk stratification model evaluation

The performance of YOCRC models were assessed in both the internal validation dataset and temporal validation dataset. Accuracy (\(\frac{{TP}+{TN}}{{TP}+{FP}+{TN}+{FN}}\)), Sensitivity (Recall) \((\frac{{TP}}{{TP}+{FN}})\), Specificity \((\frac{{TN}}{{TN}+{FP}})\), Negative predictive value (NPV) \((\frac{{TN}}{{TN}+{FN}})\), Precision or Positive predictive value (PPV)\(\,(\frac{{TP}}{{TP}+{FP}})\), and F1 scores (\(\frac{2* {Precision}* {Recall}}{{Precision}+{Recall}}\)) were calculated respectively. Here, TP, FP, TN, FN represent the number of true positives, false positives, true negatives and false negatives. Additionally, we used ROC curve and AUC to evaluate the discrimination of the YOCRC risk stratification models. The Brier score was used to evaluate both discrimination and calibration quantitatively, Brier score (Y, P) = 1/n * \({\sum }_{i=1}^{n}{({P}_{i}-{Y}_{i})}^{2}\,\)(P: Probability of model prediction; Y: Actual probability of occurrence; n: number of predicted events). The lower the Brier score, the better the discrimination and calibration of the ML model. The calibration plot was used to assess the consistency between the values predicted by the ML model and the actual labels.

Feature importance and YOCRC risk stratification model interpretations

ML models are often regarded as black boxes due to the complexity and opaque of the algorithms. In order to enhance the clarity and understanding of the reasoning mechanism behind ML models, we introduced Shapley Additive exPlanation (SHAP) method to improve the interpretability for the ML model with best performance. SHAP is a novel approach designed to interpret various ML models proposed by Lundberg and Lee²⁵, it has been widely verified to be an approach with excellent interpretability performance²⁶. We conducted the SHAP summary analysis to provide the global interpretability of the ML model, and the Local Interpretable Model-agnostic Explanations (LIME) were also implemented to explain how the ML model make the classification decision in a given individual (local interpretability). SHAP dependence analyses were visualized for the four features with the highest SHAP values to show how a single feature affects the output of the ML model.

Software and statistical analysis

Data preprocessing, statistical analysis and ML model development were conducted by using SPSS 20.0 software, R (v.4.3.0, R Foundation for Statistical Computing), and Python software (v.3.9.12, Python Software Foundation), the python packages used in this study were ‘Pandas’, ‘NumPy’, ‘SciPy’, ‘Matplotlib’ and ‘Scikit-learn’, etc. Continuous variables were expressed as mean ± standard deviation (Mean ± SD) or medians (interquartile range), while categorical features were expressed as percentages.

Results

Patient characteristics

Figure 1 displayed the enrollment flow chart of patients. We initially extracted the data of 13,653 patients from EMR, after applying exclusion criteria, 1665 patients were removed due to the hospital stay less than 24 h or without Complete Blood Count data, 523 Patients with IBD or hereditary CRC syndromes were excluded, another 591 patients with history of other types of primary malignant tumor or with garbled data were also not included. Finally, up to 10,874 individuals qualified for this study, 477 individuals with YOCRC and 10,397 without YOCRC, of which, 7134 (65.6%) individuals had no CRC related symptoms, more specifically, and there were 317 (66.5%) YOCRC cases and 6817 (65.6%) YNCRC cases without CRC related symptoms, respectively. 8,820 patients in derivation cohort (4410 in training dataset and 4410 in internal validation dataset, respectively), while 2,054 patients in temporal validation cohort. Baseline characteristics of the enrolled individuals from two cohorts were depicted in Table 1, there were 399 (4.5%) and 78 (3.8%) YOCRC cases in derivation cohort and temporal validation cohort, respectively. The median age for the derivation cohort was 43 (36–48) years and 56.4% were men, for the temporal validation cohort, the median age was 43 (35–47) years and 53.6% were men. Overall, the subjects from both cohorts appeared to be similar in terms of most features.

**Fig. 1: Flowchart of the patient selection.**

Table 1 Baseline characteristics of the enrolled individuals in the derivation cohort and temporal validation cohort dataset

Full size table

Feature selection

There were up to 65 features in the original feature set (Supplementary Table 1). The distribution of missing data for each feature was illustrated in Supplementary Fig. 4, overall, the data of “Sex”, “Age”, “Personal history”, “Comorbid conditions”, “CRC related symptoms”, laboratory test indexes like “Complete Blood Count” and “Biochemical test” were basically complete, while the remaining features showed missing values to varying degrees, after removing 11 features with more than 65% missing values (Height, BMI, AFP, CA199, CA724, CA125, CA153, NSE, SCC, Hcg, TPSA), a total of 54 features were retained. We further excluded 20 categorical features with a single category ratio higher than 90% and three continuous features with the variances lower than 0.1. We also explored the relationship between features by conducting spearman correlation analysis, the heatmap was shown to visualize the degree of correlation between different features (Supplementary Fig. 5), five groups of features exhibited a high correlation: (1) “Neutrophils (Neu)” was highly correlated with “White blood cell (WBC).” (2) “Alanine aminotransferase (ALT)” was highly correlated with “Aspartate aminotransferase (AST).” (3) “Direct Bilirubin (DBil)” was highly correlated with “Total Bilirubin (TBil).” (4) “High-density lipoprotein (HDL)” was highly correlated with “Apolipoprotein a (Apoa).” (5) “Low-density lipoprotein (LDL),” “apolipoprotein b (Apob),” and “total cholesterol (TC) were highly correlated with each other. Thus, we retained five features (WBC, ALT, TBil, HDL, and TC) in the five groups and removed six features (Neu, AST, DBil, LDL, Apoa, and Apob). Then the Boruta feature selection method was applied to select the key features, 10 features marked as unimportant were phased out. Finally, 16 important features (Fecal immunochemical test (FIT)), Carcinoembryonic antigen (CEA), Lipoprotein a (Lipa), HDL, Lactate dehydrogenase (LDH), Alkaline phosphatase (ALP), Albumin/globulin ratio (A/G), ALT, Red blood cell distribution width (RDW), C-reactive protein (CRP), Plasma hemoglobin (Hb), Platelet (PLT), Monocyte (Mono), Lymphocyte (Lym), WBC, and TC were used in the subsequent development and evaluation of ML models in the training, internal validation and temporal validation (Fig. 2).

Comparison of the machine learning models’ performance for YOCRC risk stratification

We developed eight ML models for YOCRC risk stratification in the training dataset. Figures 3A, 4A, and Table 2 described the performance of YOCRC risk stratification models in internal validation dataset. Among these models, RF ML model provided the good performance with AUC of 0.859 and the highest Recall of 0.840, outperforming other models (AUC: LR, 0.768; KNN, 0.692; SVC, 0.777; DT, 0.732; XGBoost, 0.871; AdaBoost, 0.843; Stacking, 0.821; Recall: LR, 0.650; KNN, 0.665; SVC, 0.720; DT, 0.625; XGBoost, 0.775; AdaBoost, 0.745; Stacking, 0.800). We also used the confusion matrix to evaluate the ML models in the internal validation dataset (Supplementary Fig. 6), each comprising four boxes, with the boxes at the top left, bottom left, top right, and bottom right representing TN, FN, FP, and TP, respectively. The darker the color, the larger the corresponding value. The TN, FN, FP, and TP for the RF model were 3126, 32, 1084, and 168, respectively (Supplementary Fig. 6); thus, 168 YOCRC cases were properly detected among 200 patients with YOCRC (84%) using the RF model, while 3126 cases were properly classified under YNCRC among the 4210 individuals with YNCRC (74.25%). In the temporal validation dataset, the RF ML model also had good classification performance with AUC of 0.888, and showed the best Recall of 0.872 compared to other models, a high Recall indicated a lower YOCRC missed diagnosis rate (AUC: LR, 0.799; KNN, 0.726; SVC, 0.827; DT, 0.779; XGBoost, 0.892; AdaBoost, 0.887; Stacking, 0.849; Recall: LR, 0.577; KNN, 0.679; SVC, 0.744; DT, 0.705; XGBoost, 0.808; AdaBoost, 0.782; Stacking, 0.821) (Figs. 3B, 4B, Table 3 and Supplementary Fig. 7). It is worth noting that RF, XGBoost, and AdaBoost were all ensemble ML algorithms that comprised multiple weak decision trees. These algorithms exhibited the best performance in temporal validation, with all AUC values exceeding 0.850. The confusion matrix for eight ML models in the temporal validation dataset was shown in Supplementary Fig. 7. TN, FN, FP, and TP for the RF model were 1532, 10, 444, and 68, respectively, so 68 patients with YOCRC were correctly identified among 78 patients (87.18%), and 1532 cases were correctly classified as YNCRC among 1976 individuals (77.53%).

**Fig. 3: The Bubble chart used to visualize various performance indicators of ML models in both validation datasets.**

**Fig. 4: ROC curve of YOCRC risk stratification models in both validation datasets.**

Table 2 Comparison the performance of ML models for YOCRC risk stratification in the internal validation dataset

Full size table

Table 3 Comparison the performance of ML models for YOCRC risk stratification in the temporal validation dataset

Full size table

We also observed a relatively low Precision (PPV) (ranged from 0.077 to 0.156 in internal validation, 0.071 to 0.159 in temporal validation) and F1 score (ranged from 0.137 to 0.251 in internal validation, 0.128 to 0.260 in temporal validation) in all ML models, which mainly ascribed to the unbalanced class in the internal validation dataset and temporal validation dataset with over 20-fold discrepancy between YOCRC and YNCRC cases (Supplementary Fig. 8, Supplementary Fig. 9, Table 2, and Table 3). However, the NPV values of the ML models were consistently higher than 0.950 in both internal validation and temporal validation. This indicated that the models developed in this study could serve as a valuable technique for physicians to exclude the possibility of YOCRC when an individual is classified as low risk. Supplementary Figs. 10, 11 showed the generated calibration curve in the lower right corner of the ideal calibration line in both internal validation and temporal validation; thus, the overall YOCRC risk was overestimated to some degree, which may be related to the setting of the hyperparameter “Recall” and consistent with the requirement of cancer risk stratification model, that is, the highest possible “Recall” value was needed to reduce the missed diagnosis rate. Notably, as the abscissa increases, the probability of prediction increases and the generated curve converges to the ideal line (especially for the RF model). This suggests that calibration is better when the model provides a prediction with a high probability of disease. Moreover, all ML models showed good discrimination and calibration with a low Brier score (most ML models had values less than 0.250 for both internal and temporal validations) (Table 2 and Table 3).

Sub-analysis of the optimal RF model for very early onset colorectal cancer risk stratification

A recently published YOCRC clinical practice guidelines²⁷ defined very early onset CRC as CRC diagnoses at 35 years of age or younger, based on definitions used in previous studies^28,29,30,31. This field has been attracting growing interest among researchers and clinicians. Former epidemiological data have revealed that the greatest change in incidence of YOCRC is among younger adults age at 20–39^32,33. Therefore, we also tested the performance of RF model in very early onset CRC risk stratification. And this analysis also showed the good discriminative performance of RF in very early onset CRC risk stratification, with the AUC of 0.845, Accuracy of 0.733, Recall of 0.784, Specificity of 0.732, NPV of 0.989, PPV of 0.099, F1 score of 0.175 and Brier score of 0.183 in internal validation, and AUC of 0.867, Accuracy of 0.781, Recall of 0.813, Specificity of 0.780, NPV of 0.992, PPV of 0.104, F1 score of 0.184 and Brier score of 0.167 in temporal validation (Supplementary Fig. 12).

Feature importance and model interpretations with SHAP

The RF model with the highest Recall value in both internal validation and temporal validation was selected for SHAP analysis. The SHAP feature importance of the RF model was displayed in Fig. 5A with the average absolute SHAP values as the importance of each feature; the selected features were arranged from the highest to lowest average absolute SHAP values for ranking the impact of each feature on the prediction. Figure 5A showed the top five informative features for YOCRC risk stratification were CRP, FIT, Lym, CEA, and LDH. Figure 5B showed the density scatterplot of the SHAP values in YOCRC-negative and YOCRC-positive cases. On the left, the features were arranged in order of their importance (the same as in Fig. 5A). The color on the right defines the feature value: blue was related to a lower value, whereas red signified a higher value. As presented in Fig. 5B, CRP had the widest distribution of SHAP values, and many SHAP values (positive or negative values) of CRP were far from 0 compared with other features, suggesting higher average absolute SHAP values (highest importance), consistent with the results shown in Fig. 5A. The blue dots of CRP were mainly located on the left, and the red dots were mainly on the right; therefore, a lower CRP value contributed to a lower prediction of YOCRC, while a higher CRP value was associated with a higher possibility of YOCRC. Similarly, higher levels of CEA and LDH and positive FIT contributed to a higher possibility of YOCRC, whereas a higher Lym count was associated with a lower possibility. We conducted the SHAP dependence analysis and drew the partial dependency plots for the relationship between the feature value and the SHAP value for the top four most important features; the analysis of partial dependence plots provided a visual interpretation of the distribution of each feature and its global relationship with the risk of YOCRC. Figure 6A indicated how CRP influenced the risk of YOCRC; as CRP gradually increased in the x-axis value, the likelihood of YOCRC also had an increased tendency (increased in the y-axis value); similar patterns could also be observed with FIT (Fig. 6B) and CEA (Fig. 6D). However, an increased Lym value was associated with a lower likelihood of YOCRC (Fig. 6C), consistent with the results shown in Fig. 5B.

**Fig. 5: Feature importance and global interpretations based on SHAP method.**

**Fig. 6: SHAP dependence analysis for the top four influencing features (CRP, FIT, Lym, CEA).**

We presented two additional cases, one with YOCRC and the other without YOCRC, which were correctly classified by the RF model using the LIME method to demonstrate the clinical utility and translational impact of the RF model in the real clinical setting. Figure 7A showed a case without YOCRC, which was correctly classified as YNCRC (the possibility of YNCRC was 0.84), mainly based on the following considerations: a negative FIT, low levels of CRP, CEA, and LDH, and a higher Lym count. We randomly chose a patient with YOCRC. The LIME method also correctly indicated a high possibility of YOCRC (the possibility of YOCRC was 0.87), mainly determined based on the following conditions: higher levels of CRP, CEA, and Mono count, the lower level of Hb, and Lym count, while the negative FIT and higher level of HDL had an incorrect effect on the final classification results (Fig. 7B). In clinical practice, conflicting indicators make making definitive decisions challenging. Therefore, local interpretation using the LIME algorithm has significant clinical value for aiding clinicians in assessing the risk of YOCRC.

**Fig. 7: Local interpretations based on LIME method.**

Discussion

The incidence of YOCRC is increasing in most countries, but data on risk stratification approaches for YOCRC are scarce, and the use of ML algorithms in detecting high risk young individuals age at 18–49 years is still lacking. In this large population-based sample study, we developed, internally and temporally validated a set of YOCRC risk stratification ML models. We found that the ML models based on routine clinical data could be used to YOCRC risk stratification among younger adults, of which, RF model presented a better overall performance in YOCRC risk stratification with the lowest missed diagnosis rate, thus, this model is potentially useful as a tool to help physicians identify young individuals with high risk of YOCRC that require further colonoscopy.

The increasing prevalence of YOCRC has prompted the American Cancer Society (ACS) to update the CRC screening starting age from 50 to 45 years for average-risk individuals in 2018³⁴. Besides, according to the current CRC guidelines from UK and USA^35,36,37, young individuals with hereditary syndrome or family history of CRC should undergo CRC screening at an earlier age. However, more than half of YOCRC patients are younger than 45³⁸, and some guidelines still recommend 50 years as the starting age for CRC screening among average-risk individuals^36,39, also the majority of YOCRC are sporadic without hereditary syndrome or family history of CRC¹², “one-size-fits-all” standard for screening starting age may cause the delay diagnosis of many YOCRC patients⁴⁰. What’s more, clinical features, pathological profiles, molecular characteristics and risk factors of YOCRC patients appeared to represent a different disease process compared with LOCRC^9,10,11, the majority of data about CRC risk stratification is based on evidence from LOCRC cohorts, this could explain the reason why the current CRC risk stratification approaches like APCS may not be suitable for younger adults. The only currently available international management guideline specially designed for YOCRC was published in 2023, recommending that younger adults with high-risk symptoms (hematochezia, unexplained iron deficiency anemia, or unexplained weight loss) or positive FIT undergo diagnostic colonoscopy. In contrast, for individuals with low-risk symptoms (i.e., abdominal pain, or changes in bowel habits), FIT may be an option to assess the risk of YOCRC²⁷. Nevertheless, owing to the lack of relevant research evidence, some important YOCRC risk stratification contents, including but not limited to the detailed risk stratification process such as risk assessment frequency and other risk stratification methods (such as CRC risk stratification score) were not mentioned in this guideline. Therefore, the globally recognized YOCRC risk stratification method is yet to be established.

Nowadays, several screening methods for CRC exist. Colonoscopy could directly and clearly observe the colorectal mucosal lesions, but its widespread implementation was challenging in developing countries/regions because of limited medical resources and the invasive nature of the procedure. As an alternative, FIT is a widely used and noninvasive tool for CRC risk stratification. Multiple guidelines proposed FIT as the routine and primary CRC risk stratification examination^41,42,43, however, the sensitivity of FIT was comparatively low in identifying stage I CRC (53%), suggesting that many cases of CRC will remain undetected by FIT alone⁴⁴, besides, the sensitivity of FIT was strongly associated with the sampling times, studies found the sensitivity of FIT was lower than 0.7 with one sampling^45,46 (lower than the sensitivity of ML models developed in our study), this suggests that over 30% of CRC cases will be missed with only one FIT sampling, thus, to reduce the risk of missed diagnosis, multiple examinations for FIT might be required to improve sensitivity, increasing the time cost. A large meta-analysis found the low PPV (4.9%) for FIT in CRC risk stratification⁴⁷, which was also lowered than our study (PPV = 10%). Fecal DNA/RNA testing and computed tomographic colonography are two additional methods of CRC risk stratification that have been under development for several years. However, they are not yet widely available, particularly in the developing regions⁴⁸, and the risk stratification performance of fecal DNA/RNA testing and computed tomographic colonography for YOCRC requires further research. Therefore, there is an urgent need to develop a user-friendly and accurate CRC risk stratification model specifically tailored for young populations. Archambault⁴⁹ et al. constructed the YOCRC risk prediction models that incorporated an environmental risk score (ERS) and a polygenic risk score (PRS), this study found the higher values of ERS or PRS were associated with an increased risk of YOCRC, with both models contributing to case-control discrimination with an AUC of 0.631 (95% CI: 0.615 - 0.647). In another multicenter prospective study, Zhao⁵⁰ et al. developed the National Colorectal Polyp Care (NCPC) score by conventional LR model, which consisted of eight predicting factors and categorizes individuals into low risk (LR), intermediate risk (IR), and high risk (HR) based on different score ranges, individuals with IR or HR or with a positive FIT were recommended to receive colonoscopy for further diagnosis. Similar to the results of our study (Recall value of LR model was 0.725 and 0.718 in internal validation and temporal validation, respectively), the risk-stratification model combining the NCPC score and FIT was able to detect 72.7% cases of YOCRC patients.

ML algorithms have been widely recognized for their ability to explore nonlinear relationships between features and labels, making them a promising approach for individualized risk prediction, while there have been limited studies on ML algorithms for YOCRC risk stratification. Douglas F Redd⁵¹ et al. used four machine learning and statistical models (LR, SVC, RF, and deep neural network) to identify CRC in US military Veterans aged 35–49, and found the ML models performed better than conventional LR model, among three ML models, deep neural network had the best accuracy achieved 0.985, however, this study only enrolled CRC patients in US military Veterans, individuals younger than 35 were also excluded. Another single center study included 3116 patients aged 35–50, four ML models (regularized discriminant analysis, RF, neural network, and stochastic gradient boosting) were constructed and internally validated. Similar to the above study, this study also found ML models showed higher discriminative ability compared to the conventional LR model, stochastic gradient boosting had the highest AUC of 0.76 (0.46–1.0) for CRC prediction, however, this study had a limited number of CRC cases (only 16, accounting for 0.05% of the sample), which may influence the results of the study⁵². It is worth mentioning that previous epidemiological data have indicated that the greatest change in incidence of YOCRC occurs among younger adults aged 20–39^32,33, whereas both studies only included individuals aged 35–49. In contrast to previous studies, we firstly used ML algorithm to establish the YOCRC risk stratification models for individuals aged 18–49 based on large samples, we also evaluated the performance of ML models in very early onset CRC risk stratification for the first time. Additionally, temporal validation was conducted to further assess the performance of the ML models. In clinical practice, the consequences of missed diagnosis of YOCRC are more serious than misdiagnosis. In this study, we prioritized “Recall” over “Precision” and optimized the risk stratification models by setting the hyperparameter scoring = ‘recall’. Our findings demonstrated that the RF ML model achieved a recall of 0.840 and 0.872 in the internal validation and temporal datasets, respectively, surpassing the results reported in previous studies.

The prediction indicators for YOCRC risk stratification are diverse. In this study, we trained the risk stratification models using basic clinical information and routine laboratory testing items, which are easily accessible. The SHAP analysis of this study identified CRP, FIT, Lym, CEA and LDH as the top five key features in the RF based YOCRC risk stratification model. The underlying mechanism may be corroborated by previous findings, CRP is commonly used to assess the systemic inflammation, when the body shows signs of non-specific inflammation in the tumor microenvironment, the tumor cells release proinflammatory factors, which could stimulate hepatocytes to secrete CRP into the bloodstream⁵³. Growing evidence has showed the CRP concentration is significantly increased in cancer patient plasma⁵⁴, suggesting its potential function as a blood biomarker for cancer risk stratification, this has also been confirmed by multiple recently published studies assessing the value of CRP in CRC detection^55,56,57. FIT is commonly used in population screening to identify individuals who have CRC, and it is strongly recommend by various CRC guidelines^34,58. Lymphocyte count is an indicator of the responsiveness of the host’s immune system, it plays the crucial role in inhibiting the proliferation and spread of cancer⁵⁹. Lymphopenia is often observed among advanced cancer cases. This could lead to a weakened and insufficient immune response. Several studies have associated lymphopenia with an unfavorable prognosis in oncological patients⁶⁰. An 11-year follow-up study also indicated that medium and high cytotoxic activity of peripheral-blood lymphocytes is associated with reduced cancer risk, whereas low activity is associated with an increased risk⁶¹. LDH is an enzyme that plays a crucial role in cellular metabolism and energy production. Pre-treatment LDH levels have demonstrated its utility as a biomarker for risk stratification in patients with CRC^62,63, here we also found the LDH has the potential value in YOCRC risk stratification. CEA is an oncofetal antigen produced by epithelial tumor cells in the digestive tract derived from endoderm, it exhibits high specificity but low sensitivity in CRC risk stratification, thus, similar to the LDH, CEA is commonly tested in patients with known CRC to monitor their response to treatment or screen for recurrence after treatment⁶⁴. Of note, this study also discovered that indicators related to lipid metabolism, such as HDL and Lipa, were important features in the YOCRC risk stratification model. These findings align with previous research indicating that hyperlipidemia is one of the risk factors for YOCRC⁶⁵. This suggests that YOCRC is closely related to metabolic processes and emphasizes the need to pay more attention to these metabolic indicators for YOCRC prevention. In contrast to the features in our study, a previous study showed that the best CRC risk stratification ML model comprised the four most important features: age, albumin, hematocrit and percent Lym, with the mean age of CRC patients higher than 60 years old⁶⁶. Similarly, another study also found the age was the most important features in CRC risk stratification (mean age of CRC patients were higher than 50), followed by various hemoglobin-related parameters⁶⁷. Goshen⁶⁸ et al. developed a CRC risk stratification model based on standard laboratory records (mean age of CRC patients was 62.2), and found the most important features felled into three principal categories: iron deficiency (ferritin, etc.), inflammation and thrombosis (Hb, Mean corpuscular volume, Neutrophils, PLT, etc.), and liver function (alanine aminotransferase, etc.). On the whole, the features for YOCRC and general population (or older adults) CRC risk stratification are not exactly the same, there are the same and different features, for the same features such as Lym, Hb and PLT, the importance ranking in the model is also different, so the CRC risk stratification model for the general population (or older adults) may not be the optimal model for young population, making this study clinically valuable.

This study has some limitations. Firstly, our ML models are based on retrospective EMR data that may cause potential inherent bias, to reduce the influences of such bias, we enrolled large number of samples with up to 10,874 individuals included, and the performance of the models were also internally and temporally validated. Secondly, previous researches^4,69,70 demonstrated gene mutations, westernized diet, and low physical activity increased the risk of YOCRC, so another limitation is that our ML models did not include the information about PRS, dietary habits and physical activity conditions because those data were lacking in the EMR. Thirdly, although we conducted internal and temporal validation to assess the stability of our model’s application, further evaluation is needed to determine the risk and benefits of deploying the optimal model in real-world scenarios. This would require the design of clinical trials for comprehensive evaluation. Fourthly, the precision values of current YOCRC risk stratification models are relatively low due to the imbalanced distribution of YOCRC and YNCRC individuals. In order to improve the precision of the YOCRC risk stratification model, it could be beneficial to incorporate additional important features such as PRS, dietary habits and physical activity condition in the future, which might improve its precision. Lastly, the individuals enrolled in our study were solely from Chinese populations in one big medical center, also the model was developed and validated restricted to individuals with the hospital stay of more than 24 h and nearly one-third of enrolled individuals had CRC related symptoms. Thus, its performance in other populations might be uncertain. The generalizability and applicability of the current method needs to be assessed through additional external validation datasets, including those of other ethnic groups. Nevertheless, these limitations do not alter the fact that the RF ML model proposed in this study is worthy of attention in YOCRC risk stratification, healthy institutions could consider to run this model to detect individuals with high-risk of YOCRC.

In conclusion, we have developed, internally and temporally validated the novel YOCRC risk stratification models based on routine clinical data and ML algorithms, the RF ML algorithm model presented the lowest missed diagnosis rate in YOCRC risk stratification among younger individuals aged at 18–49, this model has the potential to assist physicians in identifying individuals at high risk of YOCRC, and those classified as high risk should undergo colonoscopy for further diagnosis. In the future, a prospective study, as well as the other populations are required to further validate the generalizability and practicality of this YOCRC risk stratification model in clinical practice.

Data availability

The data and code underlying this article will be shared on reasonable request to the corresponding author.

References

Sung, H. et al. Global Cancer Statistics2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in185 countries. CA: A Cancer J. Clin. 71, 209–249 (2021).
Google Scholar
Sonnenberg, A. & Byrd-Clark, D. D. U.S. hospitalizations for colorectal cancer 1970–2010. Digestive Dis. Sci. 59, 282–286 (2014).
Article Google Scholar
Murphy, C. C. et al. Decrease in incidence of colorectal cancer among individuals 50 years or older after recommendations for population-based screening. Clin. Gastroenterol. Hepatol. 15, 903–9.e6 (2017).
Article PubMed Google Scholar
Gu, W. J. et al. The burden of early-onset colorectal cancer and its risk factors from 1990 to 2019: A systematic analysis for the global burden of disease study 2019. Cancers 14, 1–17 (2022).
Article Google Scholar
GBD 2019 Colorectal Cancer Collaborators. Global, regional, and national burden of colorectal cancer and its risk factors, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Gastroenterol. Hepatol. 7, 627–647 (2022).
Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2020. CA: A Cancer J. Clin. 70, 7–30 (2020).
Chen, F. W. et al. Advanced-stage colorectal cancer in persons younger than 50 years not associated with longer duration of symptoms or time to diagnosis. Clin. Gastroenterol. Hepatol. : Off. Clin. Pract. J. Am. Gastroenterol. Assoc. 15, 728–37.e3 (2017).
Article Google Scholar
Dekker, E. et al. Colorectal cancer. Lancet 394, 1467–1480 (2019).
Article PubMed Google Scholar
Patel, S. G. et al. The rising tide of early-onset colorectal cancer: a comprehensive review of epidemiology, clinical features, biology, risk factors, prevention, and early detection. Lancet Gastroenterol. Hepatol. 7, 262–274 (2022).
Article PubMed Google Scholar
Zaborowski, A. M. et al. Characteristics of early-onset vs late-onset colorectal cancer: A review. JAMA Surg. 156, 865–874 (2021).
Article PubMed Google Scholar
Dong, W. et al. Geographic variation and risk factor association of early versus late onset colorectal cancer. Cancers 15 (2023).
Stoffel, E. M. et al. Germline genetic features of young individuals with colorectal cancer. Gastroenterology 154, 897–905.e1 (2018).
Article CAS PubMed Google Scholar
Takada, K. et al. Favorable survival after screening for young-onset colorectal cancer: Benefits of screening in young adults. Dis. colon rectum 65, 996–1004 (2022).
Article PubMed Google Scholar
Ahsan, M. M., Lunan, S. A. & Siddique, Z. Machine-learning-based disease diagnosis: A comprehensive review. Healthcare (Basel, Switzerland) 10 (2022).
Pan, Y. et al. Screening and diagnosis of colorectal cancer and advanced adenoma by Bionic Glycome method and machine learning. Am. J. Cancer Res 11, 3002–3020 (2021).
CAS PubMed Google Scholar
Li, T. et al. Predictive models based on machine learning for bone metastasis in patients with diagnosed colorectal cancer. Front Public Health 10, 984750 (2022).
Article PubMed Google Scholar
Ali, M. M. et al. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput. Biol. Med. 136, 104672 (2021).
Article PubMed Google Scholar
Wadekar, A. S. Understanding opioid use disorder (OUD) using tree-based classifiers. Drug Alcohol Depend. 208, 107839 (2020).
Article CAS PubMed Google Scholar
Statistics, L B. & Breiman L. Random forests. Mach. Learn.:5-32 (2001).
Ray, S. A Quick review of machine learning algorithms; proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), F, 2019.
Abu Alfeilat, H. A. et al. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big Data 7, 221–248 (2019).
Article PubMed Google Scholar
Gupta, A. & Kahali, B. Machine learning-basedcognitive impairment classification with optimal combination ofneuropsychological tests. Alzheimers Dement (N Y) 6, e12049 (2020).
Quinlan, J. R. Induction of decision trees. Mach. Learn. (1986).
Wolpert, D. H. Stacked generalization. Neural Netw. 5, 241–259 (1992).
Article Google Scholar
Lundberg S. & Lee S. I. A Unified Approach to Interpreting Model Predictions; proceedings of the Nips, F, (2017).
Athanasiou, M. et al. An explainable XGBoost–based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes Mellitus. IEEE, (2020).
Cavestro, G. M. et al. Delphi initiative for early-onset colorectal cancer (DIRECt) international management guidelines. Clin. Gastroenterol. Hepatol. 21, 581–603.e33 (2023).
Article PubMed Google Scholar
Jasperson, K. W. et al. Evaluating Lynch syndrome in very early onset colorectal cancer probands without apparent polyposis. Fam. Cancer 9, 99–107 (2010).
Article PubMed PubMed Central Google Scholar
Durno, C. et al. Family history and molecular features of children, adolescents, and young adults with colorectal carcinoma. Gut 54, 1146–1150 (2005).
Article CAS PubMed PubMed Central Google Scholar
Terdiman, J. P. et al. Hereditary nonpolyposis colorectal cancer in young colorectal cancer patients: high-risk clinic versus population-based registry. Gastroenterology 122, 940–947 (2002).
Article CAS PubMed Google Scholar
Liu, B. et al. Genetic instability occurs in the majority of young patients with colorectal cancer. Nat. Med. 1, 348–352 (1995).
Article CAS PubMed Google Scholar
Peterse, E. F. P. et al. The impact of the rising colorectal cancer incidence in young adults on the optimal age to start screening: Microsimulation analysis I to inform the American Cancer Society colorectal cancer screening guideline. Cancer 124, 2964–2973 (2018).
Article PubMed Google Scholar
Chambers, A. C. et al. Demographic trends in the incidence of young-onset colorectal cancer: a population-based study. Br. J. Surg. 107, 595–605 (2020).
Article CAS PubMed Google Scholar
Wolf, A. M. D. et al. Colorectal cancer screening for average-risk adults: 2018 guideline update from the American Cancer Society. CA Cancer J. Clin. 68, 250–281 (2018).
Article PubMed Google Scholar
Rex, D. K. et al. Colorectal cancer screening: Recommendations for physicians and patients from the U.S. multi-society task force on colorectal cancer. Am. J. Gastroenterol. 112, 1016–1030 (2017).
Article PubMed Google Scholar
Provenzale, D. et al. NCCN guidelines insights: colorectal cancer screening, version 2.2020. J. Natl Compr. Cancer Netw. : JNCCN 18, 1312–1320 (2020).
Article PubMed Google Scholar
Cairns, S. R. et al. Guidelines for colorectal cancer screening and surveillance in moderate and high risk groups (update from 2002). Gut 59, 666–689 (2010).
Article PubMed Google Scholar
Stoffel, E. M. & Murphy, C. C. Epidemiology and mechanisms of the increasing incidence of colon and rectal cancers in young adults. Gastroenterology 158, 341–353 (2020).
Article PubMed Google Scholar
Parkin, C. J., Bell, S. W. & Mirbagheri, N. Colorectal cancer screening in Australia: An update. Aust. J. Gen. Pract. 47, 859–863 (2018).
Article PubMed Google Scholar
Wang, K. et al. Long-term colorectal cancer incidence and mortality after colonoscopy screening according to individuals’ risk profiles. J. Natl Cancer Inst. 113, 1177–1185 (2021).
Article PubMed Google Scholar
Shaukat, A. et al. ACG clinical guidelines: Colorectal cancer screening 2021. Am. J. Gastroenterol. 116, 458–479 (2021).
Article CAS PubMed Google Scholar
Monahan, K. J. et al. Faecal immunochemical testing (FIT) in patients with signs or symptoms of suspected colorectal cancer (CRC): a joint guideline from the Association of Coloproctology of Great Britain and Ireland (ACPGBI) and the British Society of Gastroenterology (BSG). Gut 71, 1939–1962 (2022).
Article PubMed Google Scholar
Cubiella, J. et al. Clinical practice guideline. Diagnosis and prevention of colorectal cancer 2018 update.Gastroenterologia y. Hepatologia 41, 585–596 (2018).
Article PubMed Google Scholar
Morikawa, T. et al. A comparison of the immunochemical fecal occult blood test and total colonoscopy in the asymptomatic population. Gastroenterology 129, 422–428 (2005).
Article PubMed Google Scholar
Levi, Z. et al. A quantitative immunochemical fecal occult blood test for colorectal neoplasia. Ann. Intern Med. 146, 244–255 (2007).
Article PubMed Google Scholar
Park, D. I. et al. Comparison of guaiac-based and quantitative immunochemical fecal occult blood testing in a population at average risk undergoing colorectal cancer screening. Am. J. Gastroenterol. 105, 2017–2025 (2010).
Article PubMed Google Scholar
Montalvan-Sanchez, E. E. et al. Colorectal cancer screening programs in Latin America: A systematic review and meta-analysis. JAMA Netw. Open 7, e2354256 (2024).
Article PubMed Google Scholar
Levin, T. R. et al. Organized colorectal cancer screening in integrated health care systems. Epidemiol. Rev. 33, 101–110 (2011).
Article PubMed Google Scholar
Archambault, A. N. et al. Risk stratification for early-onset colorectal cancer using a combination of genetic and environmental risk scores: An international multi-center study. J. Natl. Cancer Inst. 114, 528–539 (2022).
PubMed Google Scholar
Zhao, S. et al. FIT-based risk-stratification model effectively screens colorectal neoplasia and early-onset colorectal cancer in Chinese population: a nationwide multicenter prospective study. J. Hematol. Oncol. 15, 162 (2022).
Article CAS PubMed Google Scholar
Redd, D. F. et al. Identification of colorectal cancer using structured and free text clinical data. Health Inform. J. 28, 14604582221134406 (2022).
Article Google Scholar
Hussan, H. et al. Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records. PloS one 17, e0265209 (2022).
Article CAS PubMed PubMed Central Google Scholar
Coussens, L. M. & Werb, Z. Inflammation and cancer. Nature 420, 860–867 (2002).
Article CAS PubMed PubMed Central Google Scholar
Juan, H., Qijun, W. & Junlan, L. Serum CRP protein as a differential marker in cancer. Cell Biochem Biophys. 64, 89–93 (2012).
Article PubMed Google Scholar
Zajkowska, M. & Mroczko, B. A novel approach to staging and detection of colorectal cancer in early stages. J. Clin. Med. 12, 1–11 (2023).
Article Google Scholar
Bu, F. et al. Evaluation of C-reactive protein and fibrinogen in comparison to CEA and CA72-4 as diagnostic biomarkers for colorectal cancer. Heliyon 9, 1–11 (2023).
Article Google Scholar
Zhu, M. et al. C-reactive protein and cancer risk: a pan-cancer study of prospective cohort and Mendelian randomization analysis. BMC Med. 20, 301 (2022).
Article CAS PubMed Google Scholar
Cardoso, R. et al. Colorectal cancer incidence, mortality, and stage distribution in European countries in the colorectal cancer screening era: an international population-based study. Lancet Oncol. 22, 1002–1013 (2021).
Article PubMed Google Scholar
Terzić, J. et al. Inflammation and colon cancer. Gastroenterology 138, 2101–14.e5 (2010).
Article PubMed Google Scholar
Ray-Coquard, I. et al. Lymphopenia as a prognostic factor for overall survival in advanced carcinomas, sarcomas, and lymphomas. Cancer Res 69, 5383–5391 (2009).
Article CAS PubMed PubMed Central Google Scholar
Imai, K. et al. Natural cytotoxic activity of peripheral-blood lymphocytes and cancer incidence: an 11-year follow-up study of a general population. Lancet 356, 1795–1799 (2000).
Article CAS PubMed Google Scholar
Basile, D. et al. The MIMIC study: Prognostic role and cutoff definition of monocyte-to-lymphocyte ratio and lactate dehydrogenase levels in metastatic colorectal cancer. Oncologist 25, 661–668 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ding, H. et al. Evaluating prognostic value of dynamics of circulating lactate dehydrogenase in colorectal cancer using modeling and machine learning. Clin. Pharmacol Ther. (2023).
Nicholson, B. D. et al. Blood CEA levels for detecting recurrent colorectal cancer. Cochrane Database Syst. Rev. 12, 1–203 (2015).
Google Scholar
O’Sullivan, D. E. et al. Risk factors for early-onset colorectal cancer: A systematic review and meta-analysis. Clin. Gastroenterol. Hepatol. 20, 1229–1240.e5 (2022).
Article PubMed Google Scholar
Shi, Q. et al. An enrichment model using regular health examination data for early detection of colorectal cancer. Chin. J. Cancer Res 31, 686–698 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kinar, Y. et al. Development and validation of a predictive model for detection of colorectal cancer in primary care by analysis of complete blood counts: a binational retrospective study. J. Am. Med. Inf. Assoc. 23, 879–890 (2016).
Article Google Scholar
Goshen, R. et al. Predicting the presence of colon cancer in members of a health maintenance organisation by evaluating analytes from standard laboratory records. Br. J. Cancer 116, 944–950 (2017).
Article PubMed PubMed Central Google Scholar
Rosato, V. et al. Risk factors for young-onset colorectal cancer. Cancer Causes Control 24, 335–341 (2013).
Pearlman, R. et al. Prevalence and spectrum of germline cancer susceptibility gene mutations among patients with early-onset colorectal cancer. JAMA Oncol. 3, 464–471 (2017).
Article PubMed Google Scholar

Download references

Acknowledgements

No funding was granted for this study. We would like to thank editors and the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

These authors contributed equally: Junhai Zhen, Jiao Li.

Authors and Affiliations

Department of General Practice, Renmin Hospital of Wuhan University, Wuhan, 430060, Hubei Province, China
Junhai Zhen & Huabing Xie
Department of Gastroenterology, Renmin Hospital of Wuhan University, Wuhan, 430060, Hubei Province, China
Jiao Li, Fei Liao, Jixiang Zhang, Chuan Liu, Cheng Tan & Weiguo Dong

Authors

Junhai Zhen
View author publications
Search author on:PubMed Google Scholar
Jiao Li
View author publications
Search author on:PubMed Google Scholar
Fei Liao
View author publications
Search author on:PubMed Google Scholar
Jixiang Zhang
View author publications
Search author on:PubMed Google Scholar
Chuan Liu
View author publications
Search author on:PubMed Google Scholar
Huabing Xie
View author publications
Search author on:PubMed Google Scholar
Cheng Tan
View author publications
Search author on:PubMed Google Scholar
Weiguo Dong
View author publications
Search author on:PubMed Google Scholar

Contributions

(I) Conception and design: J.H.Z., C.L., J.L.; (II) Administrative support: W.G.D.; (III) Provision of study materials or patients: F.L., J.X.Z.; (IV) Collection and assembly of data: J.H.Z., H.B.X.; (V) Data analysis and interpretation: C.T., C.L., and J.L.; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Corresponding author

Correspondence to Weiguo Dong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics

This study received approval from the institutional review board of Renmin Hospital of Wuhan University (approve number: WDRY2022-K260). Informed consent was waived since this is a retrospective study and the patients’ personal information was anonymized to protect their privacy.

Consent for publication

All authors have approved of the consents of this manuscript and consent for publication.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhen, J., Li, J., Liao, F. et al. Development and validation of machine learning models for young-onset colorectal cancer risk stratification. npj Precis. Onc. 8, 239 (2024). https://doi.org/10.1038/s41698-024-00719-2

Download citation

Received: 26 January 2024
Accepted: 19 September 2024
Published: 22 October 2024
DOI: https://doi.org/10.1038/s41698-024-00719-2

This article is cited by

Constructing multicancer risk cohorts using national data from medical helplines and secondary care
- Hadi Modarres
- Dimitris Pipinis
- Emma Hirst-Williams
npj Digital Medicine (2025)

Subjects

Abstract

Similar content being viewed by others

Diagnostic yield of a risk model versus faecal immunochemical test only: a randomised controlled trial in a colorectal cancer screening programme

Development and validation of survival prediction tools in early and late onset colorectal cancer patients

Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Introduction

Methods

Data source and study population

Data extraction

Data preprocessing

Feature selection

YOCRC risk stratification model development

YOCRC risk stratification model evaluation

Feature importance and YOCRC risk stratification model interpretations

Software and statistical analysis

Results

Patient characteristics

Feature selection

Comparison of the machine learning models’ performance for YOCRC risk stratification

Sub-analysis of the optimal RF model for very early onset colorectal cancer risk stratification

Feature importance and model interpretations with SHAP

Discussion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics

Consent for publication

Additional information

Supplementary information

Supplementary Information file

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Constructing multicancer risk cohorts using national data from medical helplines and secondary care

Search

Quick links