Introduction

Ulcerative colitis (UC) is a chronic inflammatory bowel disease primarily characterized by abdominal pain, diarrhea, and colorectal bleeding1. It often requires lifelong treatment and is associated with various complications during disease progression, significantly affecting patients’ quality of life and imposing a substantial economic burden2. In recent years, the global incidence of UC has steadily increased3,4. To date, the exact etiology of UC remains unclear, but it is widely believed to be closely related to genetic factors, gut microbiota, and immune responses5. Early diagnosis is crucial for timely symptom management and prevention of complications. However, current diagnostic methods are inadequate for meeting early detection needs6. Gastrointestinal endoscopy and histopathological biopsy are the primary diagnostic tools for UC, but they often fail to provide a definitive diagnosis in cases with atypical endoscopic or pathological features, leading to delayed treatment7. Therefore, investigating potential biomarkers for UC holds significant value for early diagnosis and personalized treatment.

Autophagy is generally considered a self-protective cell mechanism that enables the clearance of damaged proteins, organelles, and invading pathogens, which is critical for maintaining intracellular homeostasis8. The mechanisms of autophagy are primarily classified into three types: chaperone-mediated autophagy, microautophagy, and macroautophagy9. Autophagy is closely associated with the development and progression of neurodegenerative diseases10, metabolic disorders11, and cancers12. Autophagy can regulate the apoptosis of intestinal epithelial cells and maintain the function of the intestinal epithelial barrier13. Dysregulation of autophagy may lead to intestinal homeostasis imbalance, gut microbiota dysbiosis, and exacerbation of intestinal inflammation14. Given that autophagy plays a vital role in regulating intestinal homeostasis, we hypothesized that autophagy may participate in the pathogenesis and progression of UC. We employed bioinformatics techniques to investigate key genes associated with autophagy and their relevance to UC. Our objective was to elucidate the relationship between autophagy-related genes (ARGs) and the development of UC and provide novel targets and insights to facilitate early diagnosis and personalized treatment.

Materials and methods

Data acquisition

Data were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/), and GSE8746615 and GSE9241516 were selected as the training set. GSE87466 contained 87 UC samples and 21 normal samples. GSE92415 contained 162 UC samples and 21 normal samples. GSE3871317 and GSE7521418 were selected as the validation set, with the former containing 30 UC samples and 13 normal samples, and the latter containing 74 UC samples and 11 normal samples (data from Crohn’s disease samples and inactive patient samples were excluded; Table 1). The ARG set was obtained from the HADb database (http://www.autophagy.lu/index.html).

Table 1 Details of datasets.

Data processing

The probe IDs of each gene in the dataset downloaded from the GEO database were mapped to the gene symbols. When multiple probe IDs corresponded to the same gene, the average expression value of the corresponding probe IDs was taken as the expression value of the gene. If there were missing data in the dataset, the entry was deleted. In order to reduce interference caused by technical reasons, the limma package was used to normalize the data. After merging the GSE87466 and GSE92415 datasets, the removeBatchEffect function was used to remove the batch effect of both datasets and construct a unified gene dataset. The same data processing method was used to process the GSE38713 and GSE75214 datasets for subsequent validation of the results.

Screening and functional enrichment of UC and mitochondria-related differentially expressed genes

The limma package was used to identify differentially expressed genes (DEGs) between UC samples and normal samples in the combined dataset, with P < 0.05 and |logFC| > 0.5 used as the criteria for DEG selection. We used the ggplot2 and pheatmap packages to visualize the DEGs, and draw volcano.

maps and clustered heat maps. The intersect function was used to identify the overlap between the selected DEGs and 222 ARGs, yielding the genes associated with both UC and autophagy. The results were visualized using a Venn diagram. To explore the biological functions of the DEGs and their associated pathways, gene ontology (GO)19 and Kyoto Encyclopedia of Genes and Genomes (KEGG)20 enrichment analyses were performed using the clusterProfiler package. Enriched pathways were selected based on a threshold of P < 0.05, and the results were visualized. Gene sets were analyzed using the GSVA package21 to investigate the enrichment of sample genes in hallmark gene sets, which were obtained from the MSigDB database. The normalized enrichment score was used to assess the magnitude of enrichment, with a value greater than 0 indicating a positive association with the pathway, and a value less than 0 indicating a negative association.

Machine learning model construction and identification of target genes

To identify potential biomarkers for UC, three machine learning models were employed to further screen the differentially expressed ARGs (DE-ARGs): least absolute shrinkage and selection operator (LASSO) regression, random forest (RF) regression, and support vector machine-recursive feature elimination (SVM-RFE). The LASSO algorithm was used to find the best model by introducing Ī» (the shrinkage operator), and the target genes were screened by 10-fold cross-validation. The Lasso algorithm was operated via the glmnet package in R. The RF algorithm was constructed using the randomForest package, which is designed to produce smaller fluctuations in DEG prediction. The RF algorithm was constructed with 500 decision trees, employing leave-group-out cross-validation with classification accuracy as the evaluation metric. The SVM-RFE model was constructed using the e1071 and sigFeature packages. The SVM-RFE model scores and ranks each gene, removes the gene with the.

smallest score, and then uses the remaining genes to model again. Finally, we obtained an optimal model. The SVM-RFE model was implemented using a radial basis function kernel, with 5-fold cross-validation used to identify the optimal DEGs set. The genes identified by each algorithm were intersected to obtain the characteristic genes associated with UC and autophagy. The final target genes were screened based on the importance score of each gene in the model.

Evaluation and validation of targeted gene diagnostic capabilities

The pROC package was used to plot ROC curves for individual target genes, and the area under the curve (AUC) was calculated to evaluate the ability of the target genes to distinguish between UC samples and normal samples. Similarly, the discriminative performance of the target genes was validated using the GSE38713 and GSE75214 datasets.

Animal model establishment and treatment procedures

C57BL/6J mice (male, 6–8 weeks) were supplied by Zhejiang Vital River Laboratory Animal Technology Co, Ltd (Zhejiang, China) and kept in a specific-pathogen-free room with a 12/12Ā h light/dark cycle (animal usage license number: SYXK(Su)2022-0070). All animal experiments were conducted in accordance with the ARRIVE guidelines (https://arriveguidelines.org) and were approved by the Institutional Animal Care and Use Committee of Nanjing University of Chinese Medicine. (application number: 2023DW-082-01). After 7 days of acclimatization, 36 C57BL/6J mice were randomly divided into six groups: the normal (Ctrl) group (n = 6), and the DSS treatment (DSS) group (n = 6). To establish the UC model, 3% DSS was administered in the drinking water for 7 days. On the morning of day 15, mice were anesthetized using an isoflurane vaporizer and maintained in a surgical plane of anesthesia, as confirmed by a negative pedal withdrawal reflex test. After confirming adequate anesthesia, mice were humanely euthanized via cervical dislocation. Colonic tissues were immediately collected, rinsed with cold phosphate-buffered saline (PBS), and prepared for subsequent histological and molecular analyses.

Hematoxylin and Eosin (H&E) staining

The colon was fixed with paraformaldehyde and embedded in paraffin, after which 4 μm thick specimens were sliced off and deparaffinized with xylene and ethanol. After the samples were rinsed with tap water, hematoxylin and eosin (H&E) staining was performed using a standard protocol; specimens were soaked in hematoxylin for 5 min and then in eosin dye for another 5 min. The specimens were dehydrated in ethanol and finally washed with xylene.

Real-time PCR

Total RNA was extracted and used to synthesize cDNA with a cDNA Synthesis Kit. The cycle threshold values were determined using the LightCycler® 96 system (Roche) and a qRT-PCR SYBR Green Kit (Vazyme). β-actin served as the reference gene. All primers were designed and synthesized by Generay Biotech (Shanghai), and their sequences are listed in Table 2.

Table 2 Primer sequences.

Protein-protein interaction network construction

Protein–protein interaction (PPI) networks were constructed to analyze interactions between proteins encoded by DE-ARGs, using the STRING online database (https://cn.string-db.org/). A medium confidence level of 0.4 was set.

Immune infiltration analysis

Immune infiltration analysis of the dataset was performed by using the GSVA package to evaluate the enrichment levels of immune cells in different groups, and boxplots were generated. To assess the correlation between the target genes and immune infiltration levels, a correlation analysis was conducted, and a correlation heatmap was created using the pheatmap package. The immunity-related gene dataset was obtained from http://cis.hku.hk/TISIDB/data/download/CellReports.txt.

Statistical analysis

All data processing, statistical analysis, and plotting were performed using R software (version: 4.3.2). Differences between the two groups of samples were analyzed using the Wilcoxon rank sum test. Pearson correlation analysis was used to determine the correlation between immune cells and marker genes. The level of statistical significance was set at P < 0.05.

Results

Data processing

The GSE87466 and GSE92415 datasets were downloaded from the GEO database as the training set, and GSE38713 and GSE75214, as the validation set. All datasets were judged by the function without log2 logarithmic transformation. To eliminate variation in expression due to experimental techniques and other factors, we used the limma package to normalize the data across all datasets and boxplots to observe baseline levels (Fig.Ā 1a). The data from GSE87466 and GSE92415 were merged and de-batched using the limma package (Fig.Ā 1b, Supplementary Fig.Ā 1).

Screening for differentially expressed genes

The limma package was used to construct the comparison matrix, and the eBayes function within the package was applied to fit a linear model for identifying DEGs. The criteria for DEG selection were P < 0.05 and |logFC| > 0.5. In UC patients, 1,607 and 1,262 genes were upregulated and downregulated, respectively, compared with their expression levels in the normal population. The ggplot2 package was employed to create a volcano plot of the DEGs (Fig.Ā 1c). The top 10 most significant DEGs among the upregulated and downregulated genes were selected, and the pheatmap package was used to generate a heatmap illustrating the clustering of these genes (Fig.Ā 1d). As shown, the expression levels of the top 20 genes differed distinctly between UC samples and normal samples. To explore the relationship between UC and autophagy, the identified DEGs were intersected with ARGs, resulting in a total of 37 DE-ARGs. Among these, 28 genes were upregulated and 9 were downregulated in UC patient samples (Fig.Ā 1e).

Fig. 1
figure 1

Data preprocessing and differentially expressed gene visualization. (a) Boxplot after data preprocessing. (b) Boxplot after data merging and batch effect removal. (c) Volcano plot of differentially expressed genes between UC patients and normal population, with red representing upregulated genes and blue representing downregulated genes. (d) Heat map illustrating clustering of the top 20 differentially expressed genes. The color gradient from red to blue represents the change from high to low expression. (e) Venn diagram of the intersection between UC differentially expressed genes and autophagy-related genes. Represented are differentially expressed genes (DEGs, red), autophagy-related genes (ARGs, green), and their intersection (overlap).

Functional enrichment analysis

To explore the biological properties of the DEGs, GO enrichment analysis was performed on upregulated and downregulated genes. The graph shows the top six pathways with the most significant differences in biological process (BP), cellular component (CC) and molecular function(MF) terms, at a significance level of P < 0.05 (Figs.Ā 2a, b). Based on the enrichment analysis graph, upregulated DEGs were mainly involved in the cellular response to organic substance, response to organic substance, apoptotic process, programmed cell death, cell death and response to chemicals, and they were mainly localized in CCs such as the aggresome, endoplasmic reticulum lumen, inclusion body, and canonical inflammasome complex. Enriched MFs included cytokine activity, molecular function regular activity, and misfolded protein binding. The downregulated genes were mainly associated with protein autophosphorylation, vacuole organization, autophagy, process utilizing autophagy mechanism, macroautophagy, and autophagosome assembly and were mainly localized in the autophagosome and vacuole. They were also associated with protein kinase activity, phosphotransferase activity alcohol group as acceptors, kinase activity, and transferase activity transferring.

Functional features and related pathways of the DEGs were analyzed using KEGG pathway enrichment analysis, and the top 10 entries were visualized. The upregulated DEGs were associated with PD-L1 expression and the PD-1 checkpoint pathway in cancer, the NOD-like receptor signaling pathway, and viral protein interaction with cytokines and cytokine receptors. The downregulated DEGs were enriched in the autophagy-animal pathway (Figs.Ā 2c, d).

Gene set variation analysis (GSVA) differs from the GO and KEGG pathway enrichment analyses. Whereas GO and KEGG analyze individual DEGs, GSVA analyzes the entire set of genes and assesses the enrichment differences between two sets of samples across hallmark gene sets. GSVA preserves all the genes, maps the gene expression data to a predefined set of genes, calculates enrichment scores for each sample in each gene set, and sorts the genes based on the enrichment score (Fig.Ā 2e). UC-related genes were enriched in TNFA signaling via NFKB, epithelial mesenchymal transition, and inflammatory response pathways and were negatively correlated with bile acid metabolism, xenobiotic metabolism, and oxidative phosphorylation pathways.

Fig. 2
figure 2

Gene ontology (GO) enrichment analysis, Kyoto Encyclopedia of Genes and Genomes20,22,23 (KEGG) pathway enrichment analysis, and gene set variation analysis (GSVA). (a) GO enrichment analysis of differentially expressed genes (DEGs). The color gradient from blue to red indicates significance of the differences. (b) GO enrichment analysis of downregulated DEGs. The color gradient from blue to red indicates increasing significance of the differences. (c) KEGG pathway enrichment analysis of upregulated DEGs. The color gradient from blue to red indicates increasing significance of the differences. (d) KEGG pathway enrichment analysis of downregulated DEGs. Enrichment of downregulated genes was significant only in the autophagy pathway. (e) GSVA of DEGs. The heatmap visualizes t-values derived from GSVA scores, representing the activity levels of various biological pathways. Positive t-values (right side, red color scale) indicate upregulated pathways in ulcerative colitis, while negative values (left side, blue color scale) denote downregulated pathways.

Machine learning to identify target genes

To identify genes with the most significant predictive power, we used LASSO regression analysis, RFĀ 

regression, and SVM-RFE. In the LASSO regression analysis, after 10-fold cross-validation, we identified 0.0018 as the optimal Ī» value, and 14 characterized genes were screened from 37 DE-ARGs (Fig.Ā 3a). In the RF algorithm, 19 characterized genes were screened, and importance ranking histograms were plotted by constructing 1000 decision trees (Fig.Ā 3b). When using SVM-RFE to screen DE-ARGs, we performed a 5-fold cross-validation and repeated the model construction 5 times to produce a characterized genes selection scheme that incorporates approximately 1–37 genes simultaneously. Error was minimized when the model incorporated all the genes. A ranked histogram (Fig.Ā 3c) was plotted based on the importance of each gene in the model. The intersection of the feature genes obtained by using the three algorithms was found, and 11 feature genes (PEA15, HSPA5, SERPINA1, CASP4, CASP1, CCL2, CX3CL1, BAG3, ULK3, TP53INP2, and DAPK2) were ultimately obtained (Fig.Ā 3d). By combining the three algorithms to rank the importance of the DEGs, we found that SERPINA1, PEA15, CASP4, and CASP1 had the greatest importance in all three algorithms. Thus, we used these four genes as target genes for subsequent analysis and exploration.

Fig. 3
figure 3

Differentially expressed gene (DEG) selection using LASSO regression analysis, random forest (RF) regression, and support vector machine-recursive feature elimination (SVM-RFE). (a) DEG selection using LASSO regression analysis. (b) DEG selection using RF regression. The righthand figure shows the ranking of DEG importance determined by RF regression. (c) DEG selection using SVM-RFE. The figure below shows the ranking of DEG importance determined by SVM-RFE. (d) Venn diagram of the intersection of DEG selected by the three machine learning methods.

Target gene expression and diagnostic performance evaluation

Using boxplots, we evaluated the differential expression of the four target genes between UC patients and a healthy population (Fig.Ā 4a). The boxplots revealed significant differential expression for all four genes between the two groups. To assess the discriminative ability of the target genes, we constructed ROC curves (Fig.Ā 4d). The AUC values for all four genes exceeded 0.9, demonstrating their strong discriminative power.

Diagnostic performance validation of target genes

To validate the discriminative performance of the four target genes, analyses were conducted using the GSE38713 and GSE75214 datasets. Boxplot analysis revealed that in both external validation sets (Fig.Ā 4b, c), the expression trends of the four target genes were consistent with those observed in the training set and differed significantly between UC samples and normal samples (P < 0.05). ROC curve analysis showed that the AUCs for the 4 target genes exceeded 0.8, demonstrating their strong discriminative performance (Fig.Ā 4e).

Fig. 4
figure 4

Boxplots and receiver operating characteristic (ROC) curve of differentially expressed autophagy-related genes (DE-ARGs) from the training set (the merged dataset of GSE87466 and GSE92415) and test sets (GSE38713 and GSE75214). (a) Boxplot of DE-ARG (PEA15, SERPINA1, CASP1, and CASP4) expression levels in the training set, comparing the UC group (blue) and the normal group (red). (b) Boxplot of DE-ARG (PEA15, SERPINA1, CASP1, and CASP4) expression levels in the validation set GSE38713. (c) Boxplot of DE-ARG (PEA15, SERPINA1, CASP1, and CASP4) expression levels in the validation set GSE75214. (d) ROC curve illustrating the classification performance of the four DE-ARGs in the training set. (e) ROC curve illustrating the classification performance of the four DE-ARGs in the validation set GSE38713 and GSE75214, where a larger area under the curve (AUC) indicates better predictive performance and discriminative ability of the gene.

Animal experiment validation

To validate the findings of the machine learning analysis, we established a DSS-induced UC mouse model and assessed disease severity using indicators such as colon length and body weight changes. The DSS intervention significantly shortened colon length, induced colonic congestion and edema, and markedly reduced body weight (Fig.Ā 5a, b). Histological analysis via H&E staining revealed notable pathological features in the DSS group, including crypt loss, inflammatory cell infiltration, and severe mucosal damage (Fig.Ā 5c). Furthermore, immunofluorescence staining showed a significant reduction in the expression of MUC2, a key intestinal mucosal barrier protein, in the colonic tissue of DSS-treated mice (Fig.Ā 5d), indicating compromised intestinal barrier integrity. To further explore molecular mechanisms, we performed qPCR analysis to examine the mRNA expression levels of Serpine1, Pea15a, Caspase-4, and Caspase-1 in colonic tissues (Fig.Ā 5e). The results revealed that the expression levels of these genes were significantly upregulated in the DSS group compared with those in the control group, suggesting their potential involvement in the progression of UC and highlighting them as potential therapeutic targets.

Fig. 5
figure 5

Experimental validation of differential gene expression in mice. (a) Typical colon anatomy of mice in each group. (b) Changes in the weight of mice throughout the experiment. (two-way ANOVA). (c) H&E staining of colon tissue; scale bar, 100 μm. (d) Immunofluorescence staining was used to show MUC2 expression in the colon. (e) The relative expression of Serpine1, Pea15a, Casp4, and Casp1 in colon tissue were detected by real-time PCR.(n ≄ 6).*P < 0.05 vs. DSS group, **P < 0.01 vs. DSS group, ***P < 0.001 vs. DSS group, ****P < 0.0001 vs. DSS group.

PPI network construction

We used the STRING online database (https://cn.string-db.org/) to construct PPI network of DE-ARGs and the 11 target genes (Fig.Ā 6a, b) and identify the interrelationships among the proteins encoded by the genes. In the PPI network of DE-ARGs, we removed genes that were not linked to any other genes, set a confidence level of 0.4, and clustered the genes. SERPINA1, CASP4, and CASP1 were all associated with HSPA5 in the network map of target genes. We hypothesized that HSPA5 is involved in the key process of ulcerative junction genesis, which requires further investigation.

Immune infiltration analysis

To investigate the differences in the enrichment levels of immune cells in different samples, we performed immune infiltration analysis using the GSVA package and drew a clustered heatmap to show the ssgsea results (Fig.Ā 6c). According to the clustered heat map image, the levels of immune cells differed significantly between UC samples and normal samples. We plotted a grouped boxplot of the degree of immune cell infiltration (Fig.Ā 6d) and found that, except for memory B cell levels, the levels of the other 27 immune cells differed significantly between UC and normal samples, and the levels of most immune cells were higher in UC samples. Finally, we constructed a correlation heatmap to assess correlations between the expression of the four target genes and the levels of immune cells (Fig.Ā 6e). The expression of four target genes was positively correlated with the levels of most immune cells (except for CD56dim natural killer [NK] cells), with a high degree of correlation.

Fig. 6
figure 6

Protein-protein interaction (PPI) network and immune infiltration analysis. (a) PPI network diagram illustrating the interactions among differentially expressed autophagy-related genes (DE-ARGs). The nodes represent genes or proteins, and differently colored nodes indicate distinct functional modules or clusters. (b) PPI Network of 11 target genes. (c) Clustering heatmap presenting immune cell infiltration levels across different samples. The color gradient from blue to red represents infiltration levels from low to high. (d) Boxplot comparing the expression levels of different immune cells between the normal group (blue) and the ulcerative colitis (UC) group (red). (e) Heatmap illustrating the correlation between the target genes (PEA15, CASP4, CASP1, and SERPINA1) and the infiltration levels of different immune cells. The color gradient from blue to red represents the degree of correlation from negative to positive, with deeper colors indicating stronger correlations.

Discussion

The incidence of UC has been rising steadily4, establishing it as a globally prevalent disease that poses a significant threat to health. However, current clinical treatments for UC, such as sulfasalazine, are associated with severe adverse effects24, further compounding the treatment burden for patients. Thus, identifying novel targets for UC diagnosis and treatment is crucial for improving disease management and enhancing patients’ quality of life.

A stable autophagic process is crucial for maintaining intestinal epithelial homeostasis, and the imbalance of intestinal epithelial homeostasis is considered a key factor in the development of UC. Therefore, the stability of the autophagic process may play an important role in the prevention and treatment of UC. Autophagy may undergo dynamic regulation in UC. For instance, one study showed that DSS-induced inflammation enhances autophagy in acute colitis, while curcumin and resveratrol exert protective effects on the intestinal mucosa by reducing autophagy25. Conversely, another study demonstrated that resveratrol alleviates intestinal mucosal barrier dysfunction in mice with DSS-induced chronic colitis by enhancing autophagy26. Further investigation into the bidirectional regulatory mechanisms of autophagy may not only provide deeper insights into the pathogenesis of UC but also offer novel molecular targets for its diagnosis and treatment .

In this study, genetic data of UC cases were obtained from the GEO public database. Using the GSE87466 and GSE92415 datasets, DEGs were identified based on the criteria P < 0.05 and |logFC| >Ā 0.5. These DEGs were then intersected with ARGs, resulting in 37 DE-ARGs. Further screening using three machine learning methods yielded 11 DE-ARGs, and based on importance ranking, 4 top-ranked genes (PEA15, SERPINA1, CASP4, and CASP1) were selected as target genes for subsequent analysis. Finally, immune infiltration analysis was performed on the four target genes to investigate their immune-related associations.

GO and KEGG pathway enrichment analyses were performed separately for the upregulated and downregulated DE-ARGs. Pathways enriched for the upregulated genes were predominantly associated with cell death and protein structural alterations, processes highly involved in immunity and cancer. It is well established that prolonged mucosal inflammation in UC patients increases their risk of developing colorectal cancer27. Therefore, further analysis of these upregulated DE-ARGs may help uncover the mechanisms underlying the transition from inflammation to cancer and provide strategies for early intervention to reduce the incidence of colitis-associated colorectal cancer. The downregulated genes were found to primarily function in protein kinase and transferase activities, which phosphorylate amino acid residues, altering protein structure and interactions. Inhibition of protein phosphorylation has been shown to reduce inflammation in UC. For example, the FPR agonist Cmpd43 suppresses phosphorylation of cAMP-responsive element-binding proteins, reducing pro-inflammatory mediator secretion28. Similarly, EFHD2 inhibits Cofilin phosphorylation, preventing TNF-induced epithelial apoptosis and protecting the intestines29. As kinase pathways regulate inflammation and influence cancer progression30, targeting these genes may offer novel therapeutic strategies for UC.

In addition, we performed GSVA of the entire gene set. The occurrence of UC was positively correlated with TNFα signaling via the NF-κB pathway, epithelial-mesenchymal transition, and inflammatory response pathways, which were highly enriched in the UC group. These findings suggest that the activation of these pathways may play a significant role in the progression of UC. In contrast, compared with the UC group, the normal group showed enrichment in bile acid metabolism, xenobiotic metabolism, and oxidative phosphorylation pathways. The low expression of genes associated with these pathways in the UC group may be an important factor contributing to the pathogenesis of UC.

We applied three machine learning algorithms to develop a feature selection model for gene screening. The feature genes identified by the three algorithms were intersected to determine common candidates. The identification of the four target genes PEA15, SERPINA1, CASP4, and CASP1 was primarily based on two key criteria: (1) their top-ranking positions in our analytical results and (2) existing literature evidence supporting the relevance of three of these genes in related studies, which further corroborates the reliability of our findings. The inclusion of PEA15 was particularly warranted by limited existing research, prompting thorough investigation and discussion to potentially uncover novel biological significance.

To further validate the discriminatory performance of the target genes, we used the GSE38713 and GSE75214 datasets as external validation sets. The boxplots demonstrated that, in both validation datasets, the expression trends of the four target genes were consistent with those observed in the training set and showed significant differential expression between UC and normal samples (P < 0.05). ROC curve analysis further demonstrated that the AUC values for the target genes exceeded 0.8 in both validation datasets, confirming their robust discriminatory performance.

In addition, a DSS-induced UC mouse model was established to further validate the expression differences in the four identified genes. qPCR analysis revealed that Pea15, Casp1, Serpina1, and Casp4 were significantly upregulated in the colonic tissues of model mice compared with those in the control group (P < 0.05). These findings not only corroborate the results obtained from prior analyses but also highlight the potential involvement of these genes in the pathophysiology of UC. Given their elevated expression levels in the disease state, these four genes may represent promising molecular targets for therapeutic intervention in UC. Further studies are warranted to elucidate their specific roles and underlying mechanisms in UC pathogenesis.

The four target genes analyzed in this study all participate in the occurrence and progression of the disease by influencing inflammatory responses. SĆøndergaard et al. found that levels of alpha-1 antitrypsin, encoded by SERPINA1, can be used to differentiate between mild, moderate, and severe ulcers31. SERPINA1 is also linked to cancer progression, with high expression levels in colon cancer associated with poor prognosis due to enhanced STAT3 signaling32. We hypothesized that SERPINA1 overexpression may exacerbate UC progression by enhancing the STAT3 signaling pathway, thereby promoting chronic inflammation and epithelial barrier dysfunction. Its elevated expression in colorectal cancer suggests a potential role in inflammation-driven carcinogenesis associated with long-standing UC. It has been identified as a potential early diagnostic marker for colorectal cancer33 and is also implicated in thyroid, gastric, and breast carcinomas34,35,36,37.

CASP1 and CASP4, which belong to the cysteine asparaginase family, are involved in inflammatory and immune responses. CASP1, upon activation via the NLRP3 inflammasome, promotes the maturation of cytokines such as IL-1β and IL-1838,39. Inhibiting CASP1 activity with disulfiram and Cu²⁺ has been shown to reduce inflammation in UC40. CASP4 plays a key role in intestinal inflammation, is highly expressed in colorectal cancer, and may serve as a therapeutic target41. CASP1 and CASP4 trigger pyroptosis, a form of programmed cell death42, by cleaving GSDMD, thereby activating the pro-inflammatory cytokines IL-1β/IL-18, exacerbating inflammatory responses, and further aggravating intestinal barrier damage43. Another study confirms that CASP1 serves as a key biomarker in pyroptosis of intestinal epithelial cells in inflammatory bowel disease44. As both genes are involved in cellular pyroptosis, the role of this process in UC requires further study.

PEA15 is a phosphatidylethanolamine-binding protein implicated in multiple cancers. It has been integrated into a prognostic prediction model for gastric cancer, where it demonstrates strong predictive power for prognosis and treatment response45. Additionally, PEA15 has been identified as a potential biomarker for renal cell carcinoma and hepatocellular carcinoma46,47 and holds promise as a therapeutic target for various diseases48. According to our findings, PEA15 consistently ranked highly in importance scores across all three algorithms. However, its specific role in the development of UC remains unclear. Previous studies suggest that PEA15 may be associated with the ERK1/2-related pathway48. Therefore, we hypothesize that PEA15 may modulate mucosal inflammation by altering the ERK1/2 signaling pathway in UC. PEA15 merits further investigation and may represent a novel therapeutic target for UC.

These four genes likely contribute to UC through distinct yet interconnected pathways. Further functional studies are needed to validate their roles and therapeutic potential.

Finally, we performed immune infiltration analysis to evaluate differences in immune cell enrichment across different samples. Notably, CD56dim NK cell infiltration levels were lower in UC samples than in normal samples, whereas the levels of the other immune cells were higher in UC samples. The correlation heatmap further illustrated the relationships and strengths of association between the target genes and immune cells. Interestingly, the correlation between the target genes and CD56dim NK cells differed from those of the other 27 immune cell types, as the expression of the target genes was negatively correlated with the level of CD56dim NK cells. We hypothesize that in UC, PEA15, CASP1, CASP4, and SERPINA1 may collectively contribute to an immunosuppressive microenvironment, impairing the immune surveillance function of CD56dim NK cells and thereby exacerbating disease progression.

In summary, we identified PEA15 as a potential diagnostic marker for UC, representing a novel finding distinct from those of previous studies. This discovery provides new targets and insights for understanding the mechanisms of UC and developing treatment. Additionally, we identified three other target genes—SERPINA1, CASP4, and CASP1—using three machine learning methods, which reconfirmed the diagnostic value of these genes in UC. In the future, diagnostic models incorporating these markers could be developed to facilitate early diagnosis of UC and advance precision medicine.

Our study had certain limitations. First, the parameters used in this study were limited to genetic data, neglecting the potential influence of other clinical correlates on UC. The model could be improved by incorporating these additional factors. Second, the sample size of this study was relatively small, increasing the potential for error. Expanding the sample size in future studies will be necessary to validate the model’s performance more accurately. Third, our study primarily analyzed gene expression at the mRNA level, and validation at the protein level was not performed. Further verification studies should be performed to confirm whether the DEGs exhibit corresponding changes at the protein level and verify the biological significance of the findings. Finally, as our data were derived from genetic datasets in the GEO database, the diagnostic potential of the identified markers and constructed model must be validated through in vitro experiments and clinical trials to confirm their applicability in clinical settings.