Introduction

Crohn’s disease (CD), one of the major types of inflammatory bowel disease (IBD), is a chronic, refractory bowel disease of unknown etiology1,2,3. With its rising global incidence (up to 29.3 per 100,000), CD has emerged as a significant public health challenge, predominantly affecting young adults and severely impairing their career aspirations and quality of life4,5,6. Due to the absence of consistent signs and symptoms in the preclinical phase, diagnosis is often delayed for months or even years. Furthermore, no definitive treatment for CD is currently available7. Therefore, establishing approaches to identify whether a person will develop CD in the future has become a public health imperative, which is critical for early diagnosis and timely intervention in at-risk populations. However, such reliable tools for early identification are still lacking.

The onset of CD is preceded by a key preclinical phase characterized by changes in the intestinal immune system, composition of the intestinal microbiome, intestinal permeability, and clinical parameters3,8. Understanding this phase is crucial for predicting the disease. However, although several studies have identified a set of plasma antibodies, gut microbiome compositions, or hematological and biochemical parameters using preclinical samples associated with future CD onset, predictive models based on these markers have shown low predictive performance9,10,11. Recently, the American PREDICTS study, using a nested case-control study design, showed a panel of 51 protein biomarkers that can predict CD within 5 years. Despite the high accuracy, the ability to identify proteins associated with CD more than 5 years in advance is limited by the lack of long-term data. It is also restricted by a small sample size and a predominantly active male military population12. Taken together, the association and long-term predictive value of proteins with the risk of future development of CD remain largely unknown.

Here, we employed a proteomic approach in a large prospective cohort with up to 16 years of follow-up to establish and validate proteomics-based models for the noninvasive prediction of CD in the future. Using 2736 Olink plasma protein measurements in 52,896 individuals in the UK Biobank (UKB), we first comprehensively assessed the correlation between plasma proteins and CD to discover a panel of candidate proteins. Next, we developed machine learning (ML) models with these proteins, whether in combination with clinical predictor data or not, and evaluated their predictive value of the onset of CD in general population. Lastly, we analyzed how these proteins stratify the risk of CD onset.

Results

Study population

After excluding participants diagnosed with CD at baseline and those with missing proteomic data, a total of 52,896 individuals from the UKB were finally included (Fig. 1). The cohort comprised 46% females, with a mean age of 56.8 years. Of these, 39,634 participants were assigned to the UKB training cohort, while 13,262 participants from geographically distinct recruitment centers were assigned to the UKB testing cohort for validation. Table 1 presents the baseline characteristics of the training (n = 39,634) and testing (n = 13,262) cohorts. We observed significant differences in key population characteristics, including age, ethnicity, Townsend deprivation index (TDI), BMI, alcohol consumption, and dietary habits (P < 0.05), highlighting the demographic and lifestyle heterogeneity between training and testing cohorts. Over a median follow-up of 13.6 years (interquartile range [IQR]: 12.9–14.4; maximum: 16.6 years), 139 (0.26%) incident CD cases were identified (Supplementary Table 1). The follow-up duration from baseline to CD diagnosis ranged from 0.12 to 14.88 years (median: 7.84 years; IQR: 4.37–10.74; mean: 7.47 years). When stratified by cohort, follow-up duration among CD cases ranged from 0.12 to 13.78 years in the training set (median: 8.69 years; IQR: 4.75–11.08; mean: 7.87 years), and from 0.38 to 14.88 years in the testing set (median: 6.42 years; IQR: 2.70–8.60; mean: 6.36 years).

Fig. 1: Study overview.
figure 1

First, we extracted data from 52,896 UK Biobank (UKB) participants with a median follow-up time of 13.6 years, including Crohn’s disease (CD) endpoint defined by ICD10 codes, 2736 plasma proteomics, and 41 clinical predictors spanning demographic, lifestyles, comorbidity and medication history, serum assays, and polygenic risk score (PRS) for CD. Next, Cox proportional hazard models and machine learning-based feature importance ranking were performed for feature selection, followed by model development with 10 × 5-fold internal cross-validation in the UKB training cohort. The performance of the protein model was then evaluated in a geographically distinct UKB cohort, the external EPIC-Norfolk study, and the cross-sectional Southern China cohort. Finally, we investigated the predictive performance of the protein model and its risk stratification for CD onset. Created in BioRender. Chen, H. (2025) https://BioRender.com/fc0f2j0. Abbreviations: LGBM Light Gradient Boosting Machine, XGBoost eXtreme Gradient Boosting, RF Random Forest, PRS polygenic risk score, ROC receiver operating characteristic, AUC area under the curve.

Table 1 Baseline characteristics of UK Biobank participants in the training and testing cohorts

In the European Prospective Investigation into Cancer (EPIC)-Norfolk study (n = 2944), 16 incident CD cases were identified within 16 years of follow-up. The cohort included 45% males, with a mean age of 61.0 years, and no significant differences in baseline characteristics were observed between cases and controls. In the cross-sectional Southern China cohort, 37 of 74 participants were diagnosed with CD and 64% were male, and the overall average age was 44 years old. Compared to participants without CD, those with CD were more frequently younger (P < 0.001). Baseline characteristics of the two cohorts are provided in Supplementary Tables 2 and 3.

Proteins associated with incident CD

Among the 2736 proteomic biomarkers examined, 44 proteins were significantly associated with incident CD after adjusting for age, sex, and ethnicity in model 1 (Fig. 2a and Supplementary Data 1). Then, we performed a sensitivity analysis using model 2, which additionally adjusted for TDI, BMI, alcohol intake frequency, smoking status, dietary habits, physical activity, comorbidities (depression and anxiety), and history use of antibiotics and nonsteroidal anti-inflammatory drugs (NSAIDs), and confirmed that 35 of the significant associations were consistent with those found in model 1 (Fig. 2a). Thirty-two proteins (GDF15, IL6, CHI3L1, TNFSF13B, CXCL11, CD274, CSF1, ASGR1, TNF, CXCL9, TXNDC15, WFDC2, REG1B, TNFSF13, ICAM1, IL15, PGLYRP1, TIMP1, PRSS8, VWA1, IL18BP, DSC2, IL10RB, LGALS9, TNFRSF10A, PLAUR, TNFRSF1A, VSIG2, DEFA1_DEFA1B, TNFRSF1B, TNFRSF14, REG3A) were positively associated with the risk of CD onset, while three proteins (GSN, ITGA11, ITGAV) were negatively associated. Notably, GDF15 (HR 2.16, P = 8.86 × 10–9) and IL6 (HR 1.51, P = 4.43 × 10−7) had the most significant associations with CD after Bonferroni corrections.

Fig. 2: Associations of plasma proteins with incident Crohn’s disease.
figure 2

a Volcano plots showing the HR (x axis) and −log10(P value) (y axis) for the global associations of 2736 proteins with incident Crohn’s disease in the training set. All results for both Cox proportional hazard regression models 1 and 2 are shown here. Model 1 was adjusted for age, sex, and ethnicity. Model 2 was additionally adjusted for BMI, Townsend deprivation index, smoking, alcohol intake frequency, physical activity, dietary habits, anxiety, depression, and medication history of antibiotics and NSAIDs. P values were calculated under two-sided Wald tests and no multiple comparisons were applied. Proteins above the horizontal dotted black line were significantly associated with incident Crohn’s disease after Bonferroni corrections (P < 0.05) taking into account the number of proteins tested (n = 2736). b Enrichment for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Significant proteins after Bonferroni correction derived from Cox proportional hazard regressions in model 1 or model 2 were fed into the DAVID website (https://david-d.ncifcrf.gov) for enrichment analysis. P values were calculated under two-sided tests and statistical significance was defined as a false discovery rate corrected P < 0.05 (dotted horizontal line). The number above each bar is the number of observed proteins in each pathway. Detailed results are shown in Supplementary Data 2. Abbreviations BP biological process, CC cellular component, MF molecular function, TNF tumour necrosis factor.

Biologic pathway analyses

Enrichment analyses of important CD-related proteins identified through two Cox proportional models revealed several biological pathways, including immune and inflammatory response, extracellular space, cytokine-cytokine receptor interaction, and TNF signaling pathway (Fig. 2b and Supplementary Data 2).

Protein importance ranking

For the proteins associated with CD in both models 1 and 2, we further ranked them according to their importance in predicting CD. As illustrated in the bar chart (Fig. 3a), CD274, CHI3L1, and REG1B were ranked as the top three in protein importance ordering. After using the sequential forward selection scheme, we ultimately selected the top nine proteins (CD274, CHI3L1, REG1B, ITGAV, PRSS8, ITGA11, GDF15, DEFA1_DEFA1B, and IL6) for CD prediction in subsequent analyses (Supplementary Fig. 1).

Fig. 3: Protein importance ranking and SHAP visualization of modeling on all-time incident Crohn’s disease populations.
figure 3

a The top 20 proteins according to the average absolute SHAP value. The bar chart indicates the importance of the sorted proteins based on their contributions to the prediction of future Crohn’s disease. b SHAP visualization plot of selected proteins. The width of the range of the horizontal bars can be understood as the extent of the contribution to the prediction of Crohn’s disease; the wider their range, the greater the contribution. The color of the horizontal bars denotes the magnitude of plasma proteins, which was coded in a gradient from blue (low) to red (high), shown as the color bar on the right-hand side. The direction on the x axis indicates the likelihood of developing Crohn’s disease (right) or being healthy (left). Source data are provided as a Source Data file. Abbreviations: SHAP Shapley Additive Explanations.

Shapley Additive Explanations (SHAP) summary plot was used to illustrate the influence of these selected proteins on CD risk prediction (Fig. 3b). The effects of each protein were represented by their value magnitude, and tendency directions were indicated by the horizontal axis. For example, participants with elevated levels (colored in red) of CD274 exhibited a higher likelihood of developing CD (right side), whereas those with lower levels (blue) were more likely to remain healthy (left). In contrast, for ITGA11 and ITGAV, lower values contributed to higher predictions, whereas higher values decreased them. The remaining proteins can be explained in a similar way.

Predictive accuracy of proteomics-based models

We evaluated the predictive performance of the 9-protein panel for future CD onset using four machine learning algorithms: Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Extra Trees (Fig. 4, Supplementary Tables 4, 5; Supplementary Figs. 2, 3 and Supplementary Data 3, 4). The 9-protein model demonstrated considerable prediction performance across all four algorithms for all-time incident CD. In the UKB testing set (geographically distinct), the area under the curves (AUCs) ranged from 0.71 to 0.73 for a 75/25 training/testing split and 0.71 to 0.77 for an 80/20 split. The model was further externally validated in the EPIC-Norfolk study, with AUCs ranging from 0.70 to 0.76. In the cross-sectional Southern China cohort, AUCs ranged from 0.76 to 0.79, demonstrating its ability to distinguish CD patients from controls. Notably, the XGBoost model obtained the highest AUC in the Southern China cohort (0.79, 95% CI 0.77–0.81), good performance in EPIC-Norfolk (0.73, 95% CI 0.71–0.75), and the second-best AUC in the UKB testing set (0.72 for a 75/25 split, 95% CI 0.71–0.73; 0.76 for an 80/20 split, 95% CI 0.74–0.77), indicating its robust generalizability across independent cohorts. Therefore, XGBoost was finally chosen as the optimal model. In the UKB testing set, the 9-protein model demonstrated superior predictive performance across four algorithms compared to clinical risk models based on demographics, serum markers, and polygenic risk score (PRS) for CD (highest AUC among clinical risk models for XGBoost: 0.67 for the 75/25 split and 0.60 for the 80/20 split, Mann–Whitney U test: P < 0.001). For XGBoost, combining the PRS of CD with the protein panel significantly improved the AUC to 0.74 (95% CI 0.73–0.75), and adding all clinical risk factors further increased the AUC to 0.78 (95% CI 0.77–0.79) for a 75/25 split.

Fig. 4: Predictive accuracy of plasma proteins panel, alone or in combination with clinical variables.
figure 4

Bar charts and dot plots show the area under the receiver operating characteristic curve (AUC) of different predictive models for all-time incident Crohn’s disease (CD) in the geographically distinct UK Biobank testing cohort with 25% (a, n = 13,262) and 20% (b, n = 10,632) proportions, and the performance of the protein model in the external EPIC-Norfolk study (c, n = 2944) and in the cross-sectional Southern China cohort (d, n = 74). Each dot represents one of 10 repeated bootstrap analyses, with model performance estimated as the mean AUC from 1000 resamplings. Bars show the mean AUC across the 10 runs, and error bars indicate 95% confidence intervals (CIs). Model performance evaluated using four machine learning algorithms: Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Extra Trees. Demographic variables included age, while serum markers consisted of 11 indicators detailed in Supplementary Data 9. PRS represents the polygenic risk score for CD. The combined model included protein, demographics, serum, and PRS of CD. Source data are provided as a Source data file.

We then deployed the six optimal proteins identified in the PREDICTS study12 into the UKB and compared the performance with our models. The results showed that the predictive performance of the PREDICTS model (AUC range: 0.64–0.69, XGBoost AUC: 0.68) was significantly inferior to our protein model (AUC range: 0.71–0.73, XGBoost AUC: 0.72) in predicting all-time incident CD whether or not it was combined with clinical risk factors (P < 0.05) (Supplementary Tables 6, 7 and Supplementary Fig. 4).

Results of replication validation analyses

Our findings remained consistent in the sensitivity analyses for XGBoost. At different time points, the protein model still showed higher predictive value (AUC 0.69–0.72) than clinical risk models (AUC 0.53–0.67) (Supplementary Data 5 and Supplementary Figs. 5, 6). When randomly subsampling a control group matched 1:1 with CD patients by age, sex, and race, the 9-protein model achieved an AUC of 0.71 (95% CI 0.68–0.73) in predicting all-time incident CD, increasing to 0.76 (95% CI 0.74–0.78) with the addition of all clinical risk factors (Supplementary Table 8; Supplementary Fig. 7 and Supplementary Data 6). The results remained consistent after excluding individuals who developed CD within the first 2 years of follow-up (Supplementary Table 9; Supplementary Fig. 8 and Supplementary Data 7). In addition, the associations between the nine proteins and incident CD remained significant after further adjustment for major chronic inflammatory comorbidities (Supplementary Table 10).

Risk stratification for CD onset

To further assess how the 9-protein model stratifies the risk of CD onset, participants were classified into high- and low-risk subgroups using the optimal probability cutoff (0.484 for the XGBoost model), determined by maximizing the Youden index in the training cohort. Kaplan–Meier survival curves illustrated distinct cumulative risk patterns between the stratified subgroups (Fig. 5a). Participants in the high-risk subgroup exhibited a significantly higher risk of developing CD than those in low-risk subgroup, both in the training set (HR 11.6, P = 1.44 × 10–22) and in the geographically distinct UKB testing set (HR 4.23, P = 3.26 × 10–5), after adjusting for age, sex, ethnicity in model 1 (Supplementary Table 11). Similarly, Cox regression analyses for individual plasma proteins also demonstrated significant associations with CD onset (Supplementary Table 12 and Supplementary Fig. 9).

Fig. 5: Risk stratification and clinical associations of the 9-protein model with Crohn’s disease (CD) and CD-related phenotypes.
figure 5

a Protein model stratifies the risk of CD onset. Unadjusted Kaplan–Meier curves illustrate distinct cumulative risk trajectories for incident CD between the stratified subgroups (high-risk subgroup: red line; low-risk subgroup: blue line). The optimal cutoff (0.484) was determined using the Youden index in the UK Biobank (UKB) training cohort and applied to both the training and testing sets. Associations between the protein model and disease risk were assessed using Cox proportional hazards models, adjusted for age, sex, and ethnicity, with P values calculated using two-sided Wald tests and without adjustment for multiple comparisons. Hazard ratios (HRs) and P values are presented. Kaplan–Meier estimated cumulative incidence, with shaded areas representing 95% confidence intervals (CIs). b Clinical association between protein levels and CD-related phenotypes. Heatmap showing the associations between nine proteins in the protein model and modifiable risk factors for CD, including obesity, physical inactivity, smoking, alcohol consumption (≥3 times per week), poor diet, depression, and anxiety. Linear regression models were used, adjusted for age, sex, and ethnicity. Protein levels were treated as outcome variables, while CD-related phenotypes served as explanatory variables. The β coefficient represents the effect size, with positive associations in red and negative associations in blue. Statistical significance was determined using the false discovery rate (FDR) correction (*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001). Source data are provided as a Source data file.

Clinical association between protein levels and CD-related phenotypes

To further explore whether the association between CD and proteins is influenced by modifiable risk factors for CD, we examined the relationships between the nine proteins in the protein model and CD-associated phenotypes. Obesity, physical inactivity, smoking, poor diet, and depression were significantly associated with all nine proteins after false discovery rate (FDR) correction (Fig. 5b and Supplementary Data 8). Specifically, CD274, CHI3L1, REG1B, PRSS8, GDF15, DEFA1_DEFA1B, and IL6 exhibited positive associations with these modifiable risk factors for CD, whereas ITGAV and ITGA11 showed negative associations. Notably, these associations further support the potential of these proteins as early biomarkers for CD, influenced by modifiable risk factors, which may offer opportunities for preventive strategies.

Discussion

In this study of over 52,000 participants, we applied a large-scale proteomic analysis using four machine learning algorithms to generate a 9-protein model to non-invasively predict future CD onset. The proteomic model (AUC 0.76) significantly outperformed clinical risk models based on demographics, serum biomarkers, and genetics (highest AUC 0.67) in the UKB testing set (geographically distinct). It was further externally validated in EPIC-Norfolk (AUC 0.73) and exhibited high discriminatory capacity for CD in the cross-sectional Southern China cohort (AUC 0.79). Combining proteins with clinical risk data enabled better predictions up to 16 years pre-diagnosis (AUC 0.78). Individuals identified as high-risk by the protein model were 4.23 times more likely to develop CD. CD-associated proteins were enriched in pathways including inflammatory and immune response, cytokine-cytokine receptor interaction, and TNF signaling pathway, indicating the activation of these biological processes long before the clinical diagnosis of CD.

Two recent studies investigated the role of the serum proteome in the preclinical state of CD. The first, the PREDICTS study, used the SOMAscan assay to assess 1129 proteins in 200 CD patients and 200 healthy controls, identifying 51 proteins predictive of CD, with an AUC of 0.76 within 5 years and 0.87 within 1 year12. However, the study’s models, though highly predictive, faced limitations in broader application due to short-term data (maximally 5 years for PREDICTS vs. 16 years for our model), a small sample size, a restricted number of protein tests, dependency on blood predictors, and lack of external validation. Moreover, when the optimal proteins identified were applied to the UKB population, the PREDICTS model (AUC 0.68) performed inferiorly compared to our protein model in predicting all-time incident CD. It should be noted, though, that the PREDICTS study utilized the SOMAscan assay, whereas the UKB data were generated using the Olink Explore panel, and the limited overlap in the proteins assessed may partly account for the observed differences in model performance. The second study analyzed serum proteomics from the GEM cohort, including 71 healthy first-degree relatives (FDRs) who later developed CD and 284 FDRs who remained healthy, using the Olink PEA assay (446 proteins)13. A total of 25 proteins were found to be associated with the risk of developing CD. Among them, CXCL9, CXCL11, CSF1, PGLYRP1, and MMP10 were also identified in our study to be associated with future CD incidence. Notably, the time window before diagnosis in the GEM cohort was relatively short (IQR 1.0–3.5 years), whereas ours was much longer (IQR 4.4–10.7). This difference also needs to be considered, variations in diagnosis definitions (physician-confirmed vs. ICD code-based) and the distinct populations (at-risk first-degree relatives vs. an elderly general population) between the two studies. In addition, a recent study developed a logistic regression model using six hematological and biochemical indicators to predict CD up to 10 years before diagnosis9. Despite conducting long-term prediction, the model showed limited predictive performance, with AUC values of 0.615 for 5 years and 0.503 for 9 years prior to diagnosis, and it did not include follow-up data beyond 10 years.

Our results identified nine proteins, including CD274, CHI3L1, REG1B, ITGAV, PRSS8, ITGA11, GDF15, DEFA1_DEFA1B, and IL6, that show the greatest importance in predicting future CD onset. Among these, CD274 and CHI3L1 emerged as the most critical predictors. PD-L1, encoded by CD274, is an immune checkpoint protein crucial for maintaining mucosal tolerance by inhibiting the proliferation of activated CD4+ and CD8+ effector T cells14. Dysregulation of PD-L1 expression has been reported in IBD, with elevated levels observed in the intestinal mucosa of patients, suggesting a potential role in disease pathogenesis15,16,17,18,19,20. CHI3L1, a 40-kDa heparin- and chitin-binding glycoprotein, is implicated in inflammation, tissue remodeling, and angiogenesis21,22. Its expression is significantly upregulated in IBD and has been associated with intestinal strictures and increased endoscopic disease activity in CD23,24,25,26. Notably, Olamkicept, a selective interleukin-6 inhibitor, has shown promising results in Phase II clinical trials for IBD27,28,29,30,31,32,33. Previous studies have shown that REG1B, ITGAV, PRSS8, ITGA11, GDF15 and DEFA1_DEFA1B are also associated with CD or IBD34,35,36,37,38,39,40,41,42. These proteins are thought to influence CD development through mechanisms involving inflammation, immune responses, and tissue repair34,35,36,37,38,39,40,41,42. However, limited research has explored the predictive roles of these proteins in CD, and our findings confirm their significant predictive value in CD.

By incorporating key proteins identified through association analysis and importance ranking, we developed proteomics-based models to predict future CD onset. A simple model based on the nine most important proteins achieved great long-term predictive performance (0.72 for a 75/25 training/testing split; 0.76 for an 80/20 split) in the UKB testing set (geographically distinct), outperforming models based on demographic, serologic, and PRS data (highest AUC 0.67). The model was further externally validated in EPIC-Norfolk (AUC 0.73) and demonstrated high discriminatory capacity in the Southern China cohort (AUC 0.79). When the protein panel was combined with clinical risk factors, the model’s predictive performance was further enhanced, with AUCs of 0.78 in predicting all-time incident CD. Notably, our model exhibited strong performance in long-term predictions, up to 16 years, which is significant but has never been explored before.

To enhance the clinical relevance of our model, we examined the associations between the nine proteins and modifiable CD-related phenotypes, such as obesity, smoking, physical inactivity, poor diet, and depression, and identified significant associations. These findings suggest that the identified protein signature may not only serve as an early biomarker for CD but also as a reflection of modifiable risk factors. This may provide opportunities for early intervention or prevention strategies in high-risk populations through lifestyle modifications.

The major strengths of our study lie in the long-term follow-up and large-scale, high-throughput proteomic analysis of a sizable community-based cohort, which allowed us to identify plasma biomarkers and establish proteomics-based models for desirable CD predictions up to 16 years before diagnosis. Remarkably, the predictive performance of our protein model was confirmed by external validation in an independent cohort. However, certain limitations should be considered when interpreting these results. First, although the UKB provides a broad assessment of circulating proteins, it does not encompass the entire human proteome, and there may be biases in the selection of secreted proteins for measurement. Second, several clinically recognized markers, such as serum antimicrobial antibodies or gut microbiome, were not available within the UKB and therefore could not be compared with the proteomics data. However, previous models involving these biomarkers underperformed, with AUC values below 0.710,11. Third, our model was based on an older CD population, with a mean age at diagnosis of 66 years. This predominantly elderly onset may limit the generalizability of our findings to a broader age range of incident CD cases. Fourth, CD diagnosis was based on ICD-10 code, which may lag behind the actual clinical diagnosis, leading to potential misclassification—especially for cases with shorter follow-up durations. Nevertheless, sensitivity analyses excluding these cases remained largely consistent with the main findings, supporting the robustness of our model. Fifth, the performance of our protein model was validated in EPIC-Norfolk and demonstrated high discriminatory capacity for CD in the Southern China cohort. However, given the limited number of incident CD cases in EPIC-Norfolk and the cross-sectional design of the Southern China cohort, further validation in large-scale, new-onset inception cohorts with pre-diagnostic samples is warranted. A final limitation lies in our modeling strategy. To improve rare case detection, we applied class weighting, which led to some overestimation of risk in the unmatched population. This reflects a trade-off between calibration and sensitivity, aligned with our aim to develop a screening-oriented model rather than a diagnostic tool.

In summary, using one of the largest long-term prospective community cohorts, we have identified key plasma biomarkers for predicting future CD and established proteomics-based ML models to accurately and non-invasively predict CD up to 16 years before diagnosis. These findings offer significant potential to enhance the early screening process for individuals at elevated risk of CD and to support the implementation of interventions.

Methods

Studying population

The UKB is a significant biomedical resource that enrolled approximately 500,000 individuals aged 40–70 years between 2006 and 2010, collecting various types of data, including blood and urine biomarkers, lifestyle factors, and medical health records. Our access to the UKB data was granted in compliance with their ethical guidelines and access protocols, under project code 83339. For this study, participants with CD at baseline and those lacking proteomic data were excluded, resulting in a final analysis cohort of 52,896 individuals without CD. The 52,896 participants were randomly divided into training and testing cohorts according to the UKB recruitment centers in a ratio of 75%:25% roughly. The UKB training cohort included participants from Cardiff, Glasgow, Stoke, Reading, Bury, Newcastle, Leeds, Nottingham, Sheffield, Liverpool, Middlesbrough, Hounslow, and Croydon. The UKB testing cohort included participants from Stockport (pilot), Manchester, Oxford, Edinburgh, Bristol, Barts, Birmingham, Swansea, and Wrexham (Supplementary Table 13).

Model performance was evaluated in two additional cohorts. The EPIC-Norfolk study43 is a population-based prospective cohort that recruited participants aged 40–79 years from Norfolk, UK, between 1993 and 1997. At baseline, participants completed general health questionnaires and underwent a panel of measurements. Following the same inclusion criteria as in the UKB cohort, this study included 2944 individuals, serving as the external validation cohort.

The Southern China cohort comprised 37 patients with CD and 37 non-IBD controls recruited from Guangdong Provincial People’s Hospital between 2024 and 2025. Eligibility criteria for CD cases included individuals aged ≥16 years with a prior or new diagnosis of CD, confirmed by two experienced clinicians based on standard clinical, radiological, endoscopic, and histopathologic criteria44. Controls were individuals without IBD and with no clinical or endoscopic evidence of intestinal inflammation. Blood samples were collected at enrollment and centrifuged. Ultimately, 74 participants were enrolled in the study. This cross-sectional cohort was used to assess the protein panel’s ability to distinguish CD patients from controls.

Plasma proteomics

The UK Biobank Pharma Proteomics Project (UKB-PPP) consortium has provided extensive proteomic data derived from blood samples. Most of these samples were collected during participants’ initial visits to UK assessment centers between 2007 and 2010, with additional samples gathered from consortium members and participants involved in the COVID-19 repeat-imaging study. Blood specimens were obtained using EDTA tubes, followed by centrifugation at 4 °C for 10 min to obtain plasma, which was then promptly stored at −80 °C for preservation45. They were conveyed on dry ice to the Olink Analysis Service in Sweden for proteomic profiling using the Olink™ Explore 3072 Proximity Extension Assay (PEA)46. After rigorous quality control measures (https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/PPP_Phase_1_QC_dataset_companion_doc.pdf), the study quantified 2736 unique proteins from 54,219 participants, distributed across eight protein panels: cardiometabolic, cardiometabolic II, inflammation, inflammation II, neurology, neurology II, oncology and oncology II47. Protein levels were translated into Normalized Protein expression (NPX) values on a log2 scale.

For the EPIC-Norfolk study, serum samples were collected at baseline between 1993 and 1997 and immediately stored in liquid nitrogen to preserve protein integrity. Proteomic profiling was performed using the Olink Explore 1536 and Olink Explore Expansion panels, targeting 2923 unique proteins via 2941 assays. Participants with failed proteomic quality control or missing data on age, sex, BMI, or smoking status were excluded.

In the Southern China cohort, enzyme-linked immunosorbent assay (ELISA) analyses were conducted to evaluate the reproducibility of the protein model developed in the training set. Blood samples were collected using EDTA tubes, and centrifuged at 1000 × g for 15 min, and the plasma supernatants were aliquoted and stored at −80 °C. Plasma samples of the case and control groups were used for ELISA analyses within 3 months after collection. ELISA assays were performed per the manufacturer’s protocol. Briefly, plasma samples (100 µL) and standards were added to pre-coated 96-well plates and incubated at 37 °C for 80 min. After washing three times with 350 µL washing buffer, 100 µL of biotin-labeled antibody working solution was added and incubated at 37 °C for 50 min, followed by horseradish peroxidase (HRP)-conjugated reagent incubation under the same conditions. Plates were washed again before adding 90 µL of 3,3′,5,5′-tetramethylbenzidine (TMB) substrate for color development at 37 °C for 20 min in the dark. The reaction was terminated by adding 50 µL stop solution, and absorbance was measured at 450 nm using a CMax Plus microplate reader (Molecular Devices, USA). The resulting protein data were log2-transformed and standardized. A list of the ELISA kits used is provided in Supplementary Table 14.

Clinical predictors

In the UKB, to compare the predictive efficacy of proteins with other phenotypic indicators and to assess the combined predictive power, we included a range of clinical predictor data: sociodemographic characteristics (age, sex, and ethnicity), socioeconomic status (TDI and BMI), lifestyle factors (alcohol intake frequency, smoking status, dietary habits and physical activity), comorbidities (anxiety and depression), medication history (antibiotics and NSAIDs), serums (27 serum measures) and PRS for CD. Sex was determined based on self-report during recruitment. The serum biomarkers were derived from baseline hematological assessments conducted during the initial recruitment of UKB individuals. Detailed information on the serum predictors utilized in this analysis is provided in Supplementary Data 9. PRS was generated by UKB using a Bayesian method based on summary statistics from external Genome-wide association study (GWAS) meta-analyses and did not overlap with the UKB population, as described online (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=5202).

Definition of CD

In the UKB, individuals with a CD diagnosis were identified through the International Classification of Diseases-10 (ICD-10) under code K50. The diagnosis was based on hospital admission data. The follow-up period commenced from the initial visit to the UKB assessment centers, coinciding with the collection of blood samples and other clinical data. Individuals were followed up until the earliest recorded date of either a CD diagnosis, death, or the censoring date (31 October 2022), whichever came first. Those diagnosed with CD prior to baseline were excluded from the analysis.

In the Epic-Norfolk study43, mortality and hospitalization data were obtained linkage to the National Health Service digital database using participants’ NHS numbers. These records were coded by trained nosologists according to ICD-10. To ensure comparability with the 16-year observation window in UKB, cases occurring after 16 years of follow-up were treated as censored when evaluating predictive performance. For the Southern China cohort, all patients with CD were confirmed by two experienced physicians.

Statistical analysis

We conducted Cox proportional hazards model to figure out the association between each protein and incident CD in the training set, using hazard ratios (HRs) with 95% confidence intervals (CIs). Model 1 was adjusted for age, sex and ethnicity. For model 2, we additionally adjusted for BMI, TDI, alcohol intake frequency, smoking status, dietary habits, physical activity, anxiety, depression, and medication history of antibiotics and NSAIDs. Given the limited number of incident cases within each sex, stratified analyses were not performed, but sex was accounted for as a covariate in all models. All of these covariates were obtained from baseline data, and missing values observed were less than 20%. In this study, the missing covariate data were imputed using the multiple imputation by chained equations approach. Bonferroni corrections were employed to evaluate the significance of the associations accounting for the total number of proteins analyzed (n = 2736).

Enrichment analyses, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), were performed to gain a deeper insight into the biological function and pathway behind the significant proteins identified by the two Cox proportional models. The statistical difference was determined using the hypergeometric test and the FDR multiple test adjustment method.

The important proteins were then identified using machine learning algorithms. Proteins that remained significant after Bonferroni corrections in both model 1 and model 2 were input into an FLAML AutoML framework, wherein LGBM was applied using 5-fold cross-validation with 10 repetitions on the training set. The significant proteins were subsequently ranked according to their mean SHAP values—a measure of feature importance assessing their contribution to model performance. Based on the SHAP ranking, a sequential forward selection approach was employed, where the top 10 proteins were sequentially added, and the protein panel yielding the highest AUC in the training set was ultimately selected, which was then visualized using SHAP plots.

Based on the optimally selected protein panel, we developed a protein predictive model using four machine-learning algorithms (LGBM, XGBoost, RF, and Extra Trees). To compare the predictive efficacy of proteins with other clinical indicators and to assess the combined predictive power, we further selected the clinical predictor data and delivered three sets of variables, demographic information (age), serological indicators (11 measures), and PRS of CD. These variables were selected from the training set using t-tests or chi-squared tests for demographic factors and Cox proportional hazard models for serological indicators. More details are shown in Supplementary Table 15 and Supplementary Data 9. Finally, we established 11 machine-learning models, including protein panel, demographics, serological indicators, and PRS of CD, both alone or in combinations. All models were constructed using the same AutoML framework described above. We split the participants into training and testing sets according to the recruitment centers. To further reduce overfitting, internal 5-fold cross-validation (10 repeats) was performed on the training set, and the maximum number of iterations was capped at 100 with early stopping. Hyperparameter tuning was carried out using the built-in AutoML feature to identify the optimal hyperparameters for maximizing performance. Class imbalance was addressed using automated balanced class weighting in scikit-learn’s ‘balanced’ mode, with the computed weights incorporated into AutoML training (via the sample_weight parameter). Receiver operating characteristic (ROC) area under the curve (AUC) analyses were calculated to assess model performance for predicting CD in the UKB testing set (geographically distinct) and the EPIC-Norfolk study, and for distinguishing CD patients from controls in the cross-sectional Southern China cohort. The training and testing performance for each model was the average of 10 crossovers using the bootstrap method (1000 iterations per run). Mann–Whitney U tests were utilized to determine if there were statistically significant differences in the AUC values between models. Furthermore, we compared our predictive model to that developed by the PREDICTS study.

To further evaluate the generalizability of above models and the predictive robustness of their predictive accuracy, we adopted five sensitivity analyses: (1) randomly splitting the studied population into 80% training (42,264 samples) and 20% (10,632 samples) testing set according to the recruitment centers (Supplementary Table 16); (2) conducting the analyses after matching the incident CD and control populations by age, sex and ethnicity in a 1:1 ratio; (3) repeating the analyses across different incidence time groups, ranging from 10 to 16 years; (4) repeating the analyses after excluding individuals who developed CD in the first 2 years of follow-up; (5) repeating the association analyses between the nine selected proteins and CD using Cox models with additional adjustment for key chronic inflammatory comorbidities48,49, including rheumatoid arthritis, diabetes mellitus, asthma, psoriasis, ankylosing spondylitis, multiple sclerosis, and ischemic heart disease, to assess the potential confounding by comorbid inflammatory conditions.

We also assessed model calibration using calibration slope and calibration curves50,51 (Supplementary Table 17 and Supplementary Fig. 10). In the 1:1 matched setting, calibration plots showed good agreement between predicted and observed outcomes, with slopes close to 1. In the unmatched population, models trained with class weighting tended to overestimate absolute risk. This is likely because class weighting increases the penalty for false negatives during training, encouraging the model to assign higher predicted probabilities to rare positive cases and potentially leading to calibration bias. Models trained without class weighting showed improved calibration but assigned lower predicted risks to true cases, which may limit sensitivity. Notably, we prioritized class weighting over perfect calibration to enhance sensitivity, consistent with our goal of building a screening-oriented model rather than a diagnostic tool. Subsequently, Kaplan–Meier survival curves were generated to visualize how the protein model stratifies the risk of CD onset. The model probability cutoff value for the protein model was determined by maximizing the Youden index using the training dataset. The same cutoff was applied to the testing dataset. Cox proportional hazard models were then applied to assess the associations between the protein model and CD risk in both the training and testing sets. Additionally, participants were categorized into low or high-protein level groups based on the median score of individual proteins within the model to further evaluate their role in the clinical progression of CD over time.

Lastly, we performed linear regression analyses to assess the cross-sectional associations between the levels of the nine proteins in the protein model and CD-related phenotypes. Protein levels were treated as the outcome variables, while phenotypes such as obesity, physical activity, smoking status, alcohol intake frequency, dietary habits, depression, and anxiety served as explanatory variables. All models were adjusted for age, sex, and ethnicity. The FDR method was applied to correct for multiple testing across all exposures and outcomes.

A two-sided significance level of P < 0.05 was used. Data analyses were implemented with FLAML (v.2.1.1), LightGBM (v.4.3.0), XGBoost (v.1.6.2), scikit-learn (v.1.1.3), SHAP (v.0.45.0), and scipy (v1.10.1) under Python (v.3.10.12), and with survival (v.3.7-0), dplyr (v.1.1.4), and tableone (v0.13.2) under R (v.4.2.1).

Ethics approval

Ethical approval for this study was obtained from the North West Multi-Center Research Ethics Committee for the UK Biobank (Approval numbers: 11/NW/0382, 16/NW/0274, and 21/NW/0157; Project number: 83339), the Norfolk Research Ethics Committee for the EPIC-Norfolk study (Approval numbers: 05/Q0101/191), and the Ethics Review Committee of Guangdong Provincial People’s Hospital (Approval number: KY-Z-2021-021-04). Written informed consent was obtained from all participants. No participant compensation was included.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.