Introduction

Intervertebral disc degeneration (IDD) is a common chronic degenerative disease and one of the main causes of low back pain1. Epidemiological studies have shown that the global prevalence of IDD ranges from 50% to 90%, significantly affecting patients’ quality of life2. Although current evidence-based medicine has identified IDD as a result of multiple factors, including genetics, trauma, inflammation, lifestyle, and aging, the pathogenic processes involved in IDD development remain unclear3. In recent years, the development of genomics and bioinformatics technologies has provided new ideas for in-depth exploration of the molecular mechanisms of IDD4.

Glycosylation, an important modification that occurs after proteins and lipids have been translated, is vital for many biological functions, including cell adhesion and the transmission of signals5,6. Previous studies have indicated that the occurrence of multiple degenerative diseases, such as osteoarthritis and Alzheimer’s disease, is closely related to abnormal glycosylation7,8. However, the impact of glycosylation-related genes (GRGs) on the progression of IDD is still not well understood. Therefore, this study aims to screen and validate IDD-related glycosylation genes through bioinformatic methods, providing new ideas for understanding the molecular mechanisms of IDD and exploring potential therapeutic targets.

Currently, multiple studies9,10,11,12 have reported biological markers and pathogenic mechanisms related to IDD, but most are limited to the protein level, and the regulatory mechanisms at the gene level have not been thoroughly explored. In recent years, with the development of omics technologies, researchers have begun to focus on the screening and functional analysis of IDD-related genes13,14. However, these studies mainly concentrate on the identification of differentially expressed genes (DEGs), with relatively less research on their functions and regulatory mechanisms. Moreover, IDD is a complex disease involving multiple pathological processes, and single omics analysis may not fully reveal its pathogenesis.

Zhu et al. demonstrated the significant contribution of genes related to mitochondrial dysfunction to the advancement of IDD through an extensive bioinformatics investigation15. Another study utilized genome-wide analysis of DNA methylation profiles to identify differentially methylated sites associated with human IDD16. However, there are few reports on the use of multi-omics methods to study IDD-related glycosylation genes.

This study intends to integrate multiple IDD gene expression profile datasets, taking GRGs as a starting point, and comprehensively apply various bioinformatic methods such as differential analysis, Weighted Gene Co-Expression Network Analysis (WGCNA), and Gene Set Enrichment Analysis (GSEA) to analyze the molecular mechanisms of IDD development from the perspectives of gene co-expression networks and pathway enrichment, screening potential markers and therapeutic targets. At the same time, a diagnostic model for IDD based on Least Absolute Shrinkage and Selection Operator (LASSO) regression was developed and tested for reliability and accuracy. The study also investigated the correlation between critical genes and immune infiltration using CIBERSORT and single sample gene set enrichment analysis (ssGSEA). By categorizing IDD samples into immune subtypes, the research highlighted the important role of immune microenvironment changes in IDD heterogeneity. Additionally, the study delved into the regulatory networks of transcription factors, miRNAs, RNA-binding proteins, and drugs on essential genes.This research is crucial for elucidating IDD pathophysiology and guiding the enhancement of clinical diagnosis and treatment strategies. Furthermore, the research may serve as a valuable reference for exploring mechanisms of other degenerative diseases.

Materials and methods

Data collection and downloading

Figure 1 showed the workflow chart of the present study. The IDD datasets GSE3409517, GSE7036218, and GSE14738319 were downloaded from the GEO database20 using the R package GEO query21. The samples in these datasets were all human, derived from intervertebral disc nucleus pulposus tissue. The datasets GSE34095, GSE70362, and GSE147383 used the chip platforms GPL96, GPL17810, and GPL570, respectively. GSE34095 included 3 IDD samples and 3 control samples; GSE70362 included 16 IDD samples and 8 control samples; GSE147383 included 2 IDD samples and 2 control samples. GRGs were collected using the GeneCards22 and MSigDB23 databases and published literature24, obtaining a total of 625 unique genes (Supplementary Table S1). The R package sva24 was used for batch effect removal, and the integrated dataset contained 21 IDD samples and 13 control samples (see Table 1 for details). The R package limma24 was used for data normalization and batch effect removal, and principal component analysis (PCA)25 was performed to verify the batch effect removal (Fig. 2).

Fig. 1
figure 1

Technology roadmap. IDD, Intervertebral Disc Degeneration; DEG, Differentially Expressed Genes; GRGs, Glycosylation-Related Genes; GRDEGs, Glycosylation Related Differentially Expressed Genes; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; GSEA, Gene Set Enrichment Analysis; ssGSEA, Single-Sample Gene-Set Enrichment Analysis; SVM, Support Vector Machine; LASSO, Least Absolute Shrinkage and Selection Operator; PPI Network, Protein-Protein Interaction Network; TF, Transcription Factor; RBP, RNA-Binding Protein.

Table 1 GEO microarray chip Information.
Fig. 2
figure 2

Batch effects removal of combined datasets. (A) Boxplots of GEO Datasets (Combined Datasets) distribution before batch removal. (B) Post-batch integrated GEO Datasets (Combined Datasets) distribution boxplots. (C) PCA plot of the datasets before debatching. (D) Go to the PCA map of the Combined GEO Datasets after batch processing. IDD, Intervertebral Disc Degeneration; PCA, Principal Component Analysis. The intervertebral disc degeneration (IDD) dataset GSE34095 is blue, the IDD dataset GSE70362 is red, and the IDD dataset GSE147383 is yellow.

IDD-related glycosylation-related differentially expressed genes

In the integrated GEO dataset, samples were categorized into the IDD group and the Control group. Differential gene analysis was performed using the R package limma, with the threshold set at |logFC| > 0.3 and p-value 0.3 were considered upregulated, while those with logFC < – 0.3 were deemed downregulated. The results of the analysis were visualized using a volcano plot generated with the R package ggplot2. Intersection of all DEGs with GRGs yielded IDD-related glycosylation-related differentially expressed genes (GRDEGs). The expression levels of GRDEGs were compared between groups using the Mann-Whitney U test, and a heatmap was constructed using the R package pheatmap. Chromosomal locations of GRDEGs were visualized using the R package RCircos26.

Gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) enrichment analysis

GO analysis, as referenced by27, is a technique employed for conducting functional enrichment studies across biological processes (BP), cellular components (CC), and molecular functions (MF). KEGG was also utilized as a database delineating information on genomes, biological pathways, diseases, and drugs28,29,30. The R package clusterProfiler facilitated the GO and KEGG enrichment analysis of GRDEGs28. To ensure statistical significance, the Benjamini-Hochberg method was employed for p-value correction, with screening criteria set at adj. p < 0.05 and FDR < 0.25.

GSEA between IDD and control groups

GSEA31 was conducted to assess the distribution of predefined gene sets within the gene table and their impact on the phenotype. The genes from the integrated GEO dataset were arranged based on logFC values, and GSEA analysis was performed using the R package clusterProfiler. The analysis was carried out with specific parameters, including a seed of 2022, 1000 permutations, and gene set sizes ranging from 10 to 500 genes. The gene set c2.all.v2022.1.Hs.symbols.GMT [Curated/Pathway] (6449) from the MSigDB32 database was utilized for the analysis. The Benjamini-Hochberg correction method was applied with an adjusted p-value < 0.05 and false discovery rate (FDR) < 0.25 for significance.

Weighted gene co-expression network analysis

Using the R package WGCNA33,34, weighted correlation coefficients between genes were initially computed to build a scale-free network. Initially, pairwise correlation coefficients were computed between all genes, with weighted correlation values applied to establish a scale-free network topology for gene connectivity. A hierarchical clustering tree was then constructed based on inter-gene correlations, where distinct branches represented gene modules (color-coded), followed by module significance assessment. For the integrated GEO datasets, the variance was calculated across all genes to select the top 3000 high-variance genes, with parameters set to minimum module size = 80 and optimal soft-thresholding power = 5. Module-trait correlations between IDD and Control groups were measured, defining all genes within each module as module eigengenes. Modules exhibiting |r| > 0.40 were screened, and their constituent genes intersected with Glycosylation-Related Differentially Expressed Genes (GRDEGs) to generate Venn diagrams, with all intersecting genes from qualified modules designated hub genes. Finally, Spearman correlation analysis was performed on hub gene expression profiles within the integrated GEO dataset. Resulting correlation matrices were visualized using R packages igraph and ggraph with defined strength thresholds: |r| < 0.3 (Weak/Non-significant), 0.3 ≤ |r| < 0.5 (Weak), 0.5 ≤ |r| < 0.8 (Moderate), and |r| ≥ 0.8 (Strong).

Construction of IDD diagnostic model

To construct the IDD diagnostic model, logistic regression analysis was conducted on hub genes to analyze the association between IDD and Control. Genes with a p-value < 0.05 were identified as GRDEGs. These genes were then used to construct a forest plot displaying the expression levels in the logistic regression model. Subsequently, a model was created utilizing the Support Vector Machine (SVM) algorithm, focusing on genes with the highest accuracy35. Finally, LASSO regression analysis was performed using the R package glmnet36, setting the seed to 500 and the family to “binomial” to reduce overfitting and improve generalization ability. The LASSO results were depicted in the diagnostic model and variable trajectory plot. The hub genes in the results were defined as key genes. The risk score formula based on the LASSO regression coefficients was calculated as follows:

$$\:\text{R}\text{i}\text{s}\text{k}Score\:=\:\sum\:_{i}Coefficient\:\left({gene}_{i}\right)\ast\:mRNA\:Expression\:({gene}_{i})$$

Validation of IDD diagnostic model

Based on the LASSO regression results, a nomogram37 was drawn using the R package rms to display the relationships among hub genes. Calibration plots were used to evaluate the accuracy and discrimination of IDD diagnostic models. Decision curve analysis (DCA)38 was performed using the ggDCA package in R to assess the clinical utility of the predictive model. In addition, ROC curves based on the LASSO risk score and key genes were plotted using the R package pROC, and AUC values were calculated to assess the diagnostic performance. An AUC close to 1 indicated high diagnostic accuracy. Based on the risk score, samples with IDD were categorized into high-risk and low-risk groups. The Mann-Whitney U test was employed to assess the expression variances of key genes between these groups, with the outcomes presented in group comparison plots.

Immune infiltration analysis between IDD and control groups

The CIBERSORT algorithm39 and LM22 signature gene matrix were employed to analyze the transcriptome expression matrix of the combined GEO dataset samples. This process estimated the composition and abundance of immune cells, with a focus on data displaying immune cell enrichment scores above zero. Stacked bar plots illustrating the proportions of LM22 immune cells in both the IDD and Control groups were generated using the R package ggplot2. Additionally, correlations between immune cells and key genes, as well as correlations among immune cells themselves, were calculated using the Spearman algorithm. Correlation heatmaps were then created using the R packages pheatmap and ggplot2. Based on the correlation heatmap, the four key genes with the strongest correlations with immune cells were selected and further displayed through scatter plots.

Construction of high and low glycosylation score groups

The ssGSEA algorithm40 was utilized with the R package GSVA to compute glycosylation scores (Gs) for all samples in the integrated GEO dataset. These scores were then used to categorize IDD samples into high (HighScore) and low (LowScore) score groups. ROC curves of glycosylation scores and key genes were generated using the R package pROC to assess their diagnostic performance for IDD. The AUC values varied from 0.5 to 1, where 0.5 ~ 0.7 indicated low accuracy, 0.7 ~ 0.9 indicated moderate accuracy, and above 0.9 indicated high accuracy. Subsequently, the Mann-Whitney U test was employed to compare the expression variances of key genes between the high and low score groups, and the outcomes were illustrated through group comparison plots.

Immune infiltration analysis between high and low glycosylation score groups

Utilizing the CIBERSORT algorithm39 in conjunction with the LM22 signature gene matrix, IDD samples within the combined GEO dataset were analyzed to estimate the composition and abundance of immune cells. The analysis included only data with immune cell enrichment scores above zero. The results were visualized using stacked bar plots drawn with the R package ggplot2, showing the proportions of LM22 immune cells in the HighScore and LowScore groups. The Spearman algorithm was conducted to examine the relationship between hub genes and the abundance of immune cell infiltration, retaining results with p-value < 0.05, and a correlation heatmap was drawn using ggplot2. Finally, immune cells significantly correlated with key genes were selected, and lollipop plots were drawn using ggplot2 to further display these relationships.

Immune infiltration analysis and consensus clustering

ssGSEA40 quantitatively evaluated the infiltration abundance of immune cells in IDD samples from the comprehensive GEO dataset, encompassing diverse immune cell subtypes including activated CD8 T cells and dendritic cells. The IDD samples were subjected to analysis using consensus clustering41 based on the immune cell infiltration matrix, with the R package ConsensusClusterPlus42. The number of clusters ranged from 2 to 9, and the clustering process was repeated 50 times with 80% of the sample size. Furthermore, the analysis included examination of the expression variances of key genes across different IDD subtypes, as well as the expression disparities of immune cells within the various IDD subtypes. Finally, the Spearman algorithm was employed to explore the correlation between key genes and immune cells, with the resulting correlations visualized in heatmaps.

Construction of protein-protein interaction network

Based on central genes, the STRING 12.0 database was used to construct a protein-protein interaction (PPI) network of key genes43, selecting genes with a minimum interaction coefficient greater than 0.150 for in-depth analysis. Meanwhile, the GeneMANIA 3.5.1 database44 was utilized to predict and analyze functionally similar genes of key genes, further constructing a PPI network to assist in gene function analysis and prediction.

Construction of regulatory networks

Transcription factors (TFs) regulate gene expression by interacting with specific key genes. The regulatory effects of TFs on key genes were analyzed using the ChIPBase45 and hTFtarget46 databases, while the mRNA-TF regulatory network was visualized using Cytoscape 3.10.147. Additionally, the relationship between key genes and miRNAs was examined through the StarBase v3.0 database48, and the mRNA-miRNA regulatory network was visualized. Predictions for target RNA-binding proteins (RBPs) of key genes were made using the same database, and the mRNA-RBP regulatory network was visualized49. Finally, drug targets of key genes were predicted using the CTD database50, and the mRNA-Drug regulatory network was visualized through Cytoscape, completing the network construction.

Statistical analysis

All data processing and analysis in this study were conducted using R software (Version 4.3.1). In cases where continuous variables were compared between two groups, the independent Student’s T-test was utilized for normally distributed variables, while the Mann-Whitney U test (Wilcoxon Rank Sum Test) was employed for non-normally distributed variables. For comparisons involving three or more groups, the Kruskal-Wallis test was applied. Spearman correlation analysis was used to calculate correlation coefficients between different molecules. Unless otherwise specified, all statistical p-values were two-sided, with a significance level set at p < 0.05.

Results

Intervertebral disc degeneration-related glycosylation-related differentially expressed genes

Utilizing the R package limma for differential gene analysis, a total of 559 DEGs were screened, including 282 upregulated genes and 277 downregulated genes. A volcano plot was drawn based on the differential analysis results of this dataset (Fig. 3A).

According to the differential analysis method, DEGs and GRGs were obtained, and a Venn diagram was drawn by taking their intersection (Fig. 3B), yielding a total of 25 GRDEGs: IGFBP3, MUC1, MAN2B2, ST6GALNAC2, ST8SIA1, HEXA, CNIH3, PIGT, PTGDS, MAN1A1, DPAGT1, GALNT7, SERPINA1, PDPN, EDEM3, RAPGEF5, C1GALT1C1, TSPAN1, GLA, TLR4, PLOD2, ATP6AP2, GALNT3, CHI3L1, and THBS1. Detailed information on GRDEGs is listed in Table 2. Based on the intersection results, the locations of the 25 GRDEGs on human chromosomes were analyzed using the R package RCircos, and a chromosomal location map was drawn (Fig. 3C). The chromosomal location map shows that the 25 GRDEGs are located on chromosomes 1, 2, 3, 4, 6, 7, 9, 11, 12, 14, 15, 17, 20, and X.

Table 2 List of GRDEGs of differential expression analysis.

A simple value heatmap (Fig. 3D) and group comparison plot (Fig. 3E) were drawn using the R package ggplot2 to display the analysis results. All 25 GRDEGs showed significant differences between the different sample groups. The genes IGFBP3, ST6GALNAC2, CNIH3, PTGDS, MAN1A1, DPAGT1, GALNT7, PDPN, EDEM3, C1GALT1C1, GLA, TLR4, PLOD2, ATP6AP2, CHI3L1, and THBS1 were significantly up regulated in the IDD group, while the genes MUC1, MAN2B2, ST8SIA1, HEXA, PIGT, SERPINA1, RAPGEF5, TSPAN1, and GALNT3 were abundantly expressed in the Control group.

Fig. 3
figure 3

Differential gene expression analysis. (A) Volcano plot of DEGs analysis between IDD and Control in the integrated GEO dataset, with GRDEGs marked. (B) Venn diagram of DEGs and GRGs in the integrated GEO dataset. (C) Chromosomal location map of GRDEGs. (D, E) Simple value heatmap (D) and group comparison plot (E) of GRDEGs expression levels between IDD and Control groups in the integrated GEO dataset. IDD, Intervertebral Disc Degeneration; DEGs, Differentially Expressed Genes; GRGs, Glycosylation-Related Genes; GRDEGs, Glycosylation-Related Differentially Expressed Genes. *Represents p-value < 0.05, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance; *** represents p-value < 0.01, indicating extreme statistical significance. In the grouping, red represents IDD and blue represents Control; in the simple value heatmap, red represents high expression and blue represents low expression.

GO and KEGG enrichment analysis

The detailed results are shown in Supplementary Table S2. The results indicate that 12 GRDEGs are mainly enriched in protein glycosylation (BP), lysosomal lumen (CC), and hydrolase activity (MF) in IDD (Fig. 4A–C). They are also enriched in the mucin type O-glycan biosynthesis pathway (KEGG) (Fig. 4D).

Fig. 4
figure 4

GO and KEGG enrichment analysis for GRDEGs. (AD) Bar plot display of Gene Ontology (GO) biological process (BP), cellular component (CC), molecular function (MF), and pathway (KEGG) enrichment analysis results of GRDEGs: BP (A), CC (B), MF (C), and KEGG (D). The horizontal axis represents GO terms and KEGG terms. GRDEGs, Glycosylation-Related Differentially Expressed Genes; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; BP, Biological Process; CC, Cell Component; MF, Molecular Function. The screening criteria for Gene Ontology (GO) and pathway (KEGG) enrichment analysis were adj.p < 0.05 and FDR value (q value) < 0.25, and the p-value correction method was Benjamini-Hochberg (BH).

GSEA between IDD and control groups

The GSEA results (Fig. 5A) are shown in Table 3. The results demonstrate that all genes in the integrated GEO dataset are markedly elevated in iron metabolism in placenta (Fig. 5B), adaptation to hypoxia down (Fig. 5C), apoptosis by serum deprivation up (Fig. 5D), integrated TGF-β EMT up (Fig. 5E), and other biologically relevant functions and signaling pathways.

Fig. 5
figure 5

GSEA for intervertebral disc degeneration between IDD and control groups. (A) The gene set enrichment analysis (GSEA) 7 biological functions enrichment plot display of the Combined GEO Datasets. (BE) Gene set enrichment analysis (GSEA) showed that all genes were significantly enriched in Hypoxia Dn (B), Apoptosis By CDKN1A Via TP53 (C), Emt Breast Tumor Dn (D), and apoptosis by CDKN1A via TP53 (C). Circadian Rhythm Genes (E). IDD, Intervertebral Disc Degeneration; GSEA, Gene Set Enrichment Analysis. The screening criteria of gene set enrichment analysis (GSEA) were adj.p < 0.05 and FDR value (q value) < 0.25, and the p value correction method was Benjamini-Hochberg (BH).

Table 3 Results of GSEA gene set enrichment analysis between IDD and control groups in combined datasets.

Weighted gene co-expression network analysis

The WGCNA results (Fig. 6A) show that the top 3,000 genes with the highest variance were clustered and annotated with grouping information through a clustering tree (Fig. 6C). The genes were aggregated into 12 modules (Fig. 6B). Using |r value| > 0.40 as the criterion for modules, two modules were selected for subsequent analysis: MEpink and MEgreen. The 25 GRDEGs were intersected with the genes contained in the two modules, and a Venn diagram was drawn (Fig. 6D), yielding a total of 9 hub genes: IGFBP3, MAN2B2, PTGDS, MAN1A1, SERPINA1, RAPGEF5, GLA, PLOD2, and CHI3L1.

Finally, the correlation heatmap of hub gene expression levels (Fig. 6E) shows that the gene SERPINA1 has the strongest significant positive correlation with the gene MAN2B2 (r value = 0.617, p-value < 0.001), while the gene GLA exhibits the most robust and significant negative correlation with the gene MAN2B2 (r value = – 0.679, p-value < 0.001).

Fig. 6
figure 6

WGCNA for combined datasets. (A) Display of the scale-free network with the optimal soft threshold in weighted gene co-expression network analysis (WGCNA). The left plot shows the optimal soft threshold, and the right plot shows the network connectivity under different soft thresholds. (B) Display of the correlation analysis results between clustering modules of the top 3,000 genes with the highest variance and the Control and IDD groups. (C) Display of the module aggregation results of the top 3,000 genes with the highest variance. (D) Display of the Venn diagram of 25 GRDEGs and genes contained in the MEpink and MEgreen modules. (E) Correlation heatmap of expression levels between hub genes. IDD, Intervertebral Disc Degeneration; WGCNA, Weighted Gene Co-Expression Network Analysis; GRDEGs, Glycosylation-Related Differentially Expressed Genes. *Represents p-value < 0.05, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance; *** represents p-value < 0.01, indicating extreme statistical significance. The absolute value of the correlation coefficient (r value) below 0.3 is weak or uncorrelated, between 0.3 and 0.5 is weakly correlated, between 0.5 and 0.8 is moderately correlated, and above 0.8 is strongly correlated. Red represents positive correlation and blue represents negative correlation.

Construction of IDD diagnostic model

The logistic regression model was constructed using the 9 hub genes and displayed through a forest plot (Fig. 7A). Detailed information is shown in Table 4. The results indicate that all 9 hub genes have statistical significance in the logistic regression model (p-value < 0.05). Next, an SVM model was constructed based on the 9 hub genes and the SVM algorithm, obtaining the genes with the lowest error rate (Fig. 7B) and the highest accuracy (Fig. 7C). The results show that when the number of genes is 8, the accuracy of the SVM model is the highest. These 8 hub genes are MAN2B2, IGFBP3, MAN1A1, CHI3L1, PLOD2, RAPGEF5, GLA, and PTGDS.

Fig. 7
figure 7

Diagnostic model of intervertebral disc degeneration. (A) Forest Plot of the nine hub genes included in the Logistic regression model in the diagnostic model of intervertebral disc degeneration (IDD). (B, C) Visualization of the number of genes with the lowest error rate (B) and the number of genes with the highest accuracy (C) obtained by the SVM algorithm. (D, E) Diagnostic model plot (D) and variable trajectory plot (E) of LASSO regression model. IDD, Intervertebral Disc Degeneration; SVM, Support Vector Machine; LASSO, Least Absolute Shrinkage and Selection Operator.

Table 4 Results of univariate logistic regression.

A LASSO regression analysis was conducted using the 8 hub genes from the SVM model to develop an IDD diagnostic model. Visual representations of the LASSO regression model diagram (Fig. 7D) and the LASSO variable trajectory diagram ((Fig. 7E) were created. The analysis identified 7 key genes in the LASSO regression model: MAN2B2, IGFBP3, MAN1A1, CHI3L1, PLOD2, RAPGEF5, and GLA.

Validation of IDD diagnostic model

To further substantiate the value of the IDD diagnostic model, a nomogram was drawn based on the key genes to display the relationships among key genes in the integrated GEO dataset (Fig. 8A). The findings suggest that the expression level of the critical gene MAN2B2 is significantly more useful for the IDD diagnostic model compared to other variables. In contrast, the expression level of GLA has notably less utility for the IDD diagnostic model compared to other variables.

The calibration curve plot of the IDD diagnostic model indicates that the calibration line, represented by the dashed line, slightly deviates from the diagonal line of the ideal model but closely coincides with it (Fig. 8B). The DCA plot demonstrates that the model line consistently outperforms All positive and All negative within a specific range, indicating that the model provides greater benefits and superior performance (Fig. 8C). Furthermore, the ROC curve reveals that the risk score’s expression level in the integrated GEO dataset exhibits high accuracy (AUC > 0.9) across different groups (Fig. 8D). Simultaneously, drawing an ROC curve based on the expression levels of 7 critical genes in the integrated GEO dataset shows moderate accuracy (0.7 < AUC < 0.9) between different groups in this dataset (Fig. 8E). The formula for calculating risk score (Eq. 1) is as follows:

$$\begin{aligned}\text{R}\text{i}\text{s}\text{k}Score & =MAN2B2\:\ast\:\:(-39.650)\:+\:IGFBP3\:\ast\:\:(7.466)\:+\:MAN1A\ast\:\:(8.530)\:+\:CHI3L1\:\ast\:\:(6.682)\\ & \quad +\:PLOD2\:\ast\:\:(-11.957)\:+\:RAPGEF5\:\ast\:\:(-1.705)\:+\:GLA\:\ast\:\:(-2.332)\end{aligned}$$
(1)

Subsequently, group comparison plots were utilized to explore the expression differences of critical genes in IDD. The results of the differential analysis of the expression levels of the 7 essential genes in the HighRisk and LowRisk groups of IDD are displayed in Fig. 8F. It was found that the essential genes MAN2B2, IGFBP3, and GLA exhibit statistically significant differences in expression levels between the HighRisk and LowRisk groups of IDD (p-value < 0.05). Specifically, IGFBP3 and GLA are highly expressed in the high-risk group, while MAN2B2 is highly expressed in the low-risk group.

Fig. 8
figure 8

Diagnostic and validation analysis of intervertebral disc degeneration. (A) Nomogram of key genes in the IDD diagnostic model in the integrated GEO dataset. (B) Calibration curve plot of the IDD diagnostic model based on key genes in the integrated GEO dataset. (C) DCA plot of the IDD diagnostic model based on the risk score (RiskScore) in the integrated GEO dataset. (D) ROC curve of the risk score (RiskScore) in the integrated GEO dataset. (E) ROC curve of key genes in the integrated GEO dataset. (F) Group comparison plot of key genes in the HighRisk and LowRisk groups of IDD. The vertical axis of the calibration curve plot represents the net benefit, and the horizontal axis represents the probability threshold or threshold probability. IDD, Intervertebral Disc Degeneration; DCA, Decision Curve Analysis; ROC, Receiver Operating Characteristic; AUC, Area Under the Curve; CI, Confidence Interval; TPR, True Positive Rate; FPR, False Positive Rate. ns represents p-value ≥ 0.05, indicating no statistical significance; * represents p-value < 0.05, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance. When AUC > 0.5, it indicates a trend of the molecule’s expression promoting the occurrence of the event, and the closer the AUC is to 1, the better the diagnostic effect. AUC between 0.5 and 0.7 indicates low accuracy, AUC between 0.7 and 0.9 indicates moderate accuracy, and AUC above 0.9 indicates high accuracy. Red represents the HighRisk group and blue represents the LowRisk group.

Immune infiltration analysis between IDD and control groups

The CIBERSORT algorithm was used to calculate the immune infiltration abundance in the IDD and Control groups. The results show that 17 immune cell types are enriched in IDD samples (Fig. 9A). According to the correlation heatmap of immune cell infiltration abundance (Fig. 9B), follicular helper T cells have the strongest significant positive correlation with activated dendritic cells (r value = 0.556, p-value < 0.001), while activated NK cells have the strongest significant negative correlation with activated mast cells (r value = – 0.584, p-value < 0.001). According to the correlation heatmap between key genes and immune cell infiltration abundance (Fig. 9C), in IDD samples, the gene GLA has a significant positive correlation with eosinophils (r value = 0.581, p-value < 0.001), the gene PLOD2 has a significant positive correlation with M0 macrophages (r value = 0.544, p-value < 0.001), the gene GLA has a significant negative correlation with activated dendritic cells (r value = – 0.438, p-value < 0.01), and the gene RAPGEF5 has a significant negative correlation with regulatory T cells (Tregs) (r value = – 0.429, p-value < 0.05). Finally, scatter plots (Fig. 9D–G) were drawn to further display the correlations between the gene GLA and eosinophils (Fig. 9D), the gene PLOD2 and M0 macrophages (Fig. 9E), the gene GLA and activated dendritic cells (Fig. 9F), and the gene RAPGEF5 and regulatory T cells (Tregs) (Fig. 9G).

Fig. 9
figure 9

Immunoinfiltration analysis between IDD and control groups (CIBERSORT). (A) Stacked bar plot of the proportions of LM22 immune cells in the integrated GEO dataset. (B) Correlation heatmap of immune cell infiltration abundance in the integrated GEO dataset. (C) Correlation heatmap between immune cell infiltration abundance and key genes in the integrated GEO dataset. (D) Scatter plot of the correlation between the gene GLA and eosinophils. (E) Scatter plot of the correlation between the gene PLOD2 and M0 macrophages. (F) Scatter plot of the correlation between the gene GLA and activated dendritic cells. (G) Scatter plot of the correlation between the gene RAPGEF5 and regulatory T cells (Tregs). IDD, Intervertebral Disc Degeneration. *Represents p-value < 0.01, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance; *** represents p-value < 0.001, indicating extreme statistical significance. The absolute value of the correlation coefficient (r value) below 0.3 is weak or uncorrelated, between 0.3 and 0.5 is weakly correlated, between 0.5 and 0.8 is moderately correlated, and above 0.8 is strongly correlated. In the grouping, red represents the IDD group and blue represents the Control group. In the correlation heatmap, red represents positive correlation and blue represents negative correlation, with the color depth representing the strength of the correlation.

Immune infiltration analysis between high and low glycosylation score groups

Based on the expression levels of the 7 critical genes in the integrated GEO dataset, glycosylation scores (Gs) for all samples were calculated using the ssGSEA algorithm. Additionally, ROC curves were generated using the R package pROC based on the Gs in the integrated GEO dataset. The ROC curves (Fig. 10A) indicate that Gs expression level demonstrates moderate accuracy (0.7 < AUC < 0.9) between different groups in the integrated GEO dataset. Furthermore, a group comparison plot was created using the R package ggplot2 (Fig. 10B), revealing highly significant statistical differences in Gs between IDD and Control groups within the integrated GEO dataset (p-value < 0.001). Subsequently, based on median Gs values of IDD samples, they were divided into high (HighScore) and low (LowScore) glycosylation score groups. Differential analysis of genes in these high and low-score groups was conducted using the R package limma, with results displayed through group comparison plots (Fig. 10C). These results demonstrate statistically significant differences in expression levels of key genes CHI3L1, MAN1A1, and PLOD2 between HighScore and LowScore groups of IDD (p-value < 0.05), with all three being highly expressed in the low score group.

Subsequently, the results of immune infiltration analysis were used to create a stacked bar plot illustrating the proportions of immune cells in the integrated GEO dataset (Fig. 10D). Additionally, a correlation heatmap was generated to visually represent the correlation between critical genes and immune cell infiltration abundance (Fig. 10E). The findings indicate that in IDD samples, the gene PLOD2 exhibits a strong significant positive correlation with M0 macrophages (r-value = 0.736, p-value < 0.001). Conversely, the gene MAN2B2 demonstrates a strong significant negative correlation with eosinophils (r-value = -0.534, p-value < 0.05). The key genes with significant correlations with eosinophils are the most numerous. The key genes MAN2B2, GLA, and CHI3L1 have significant correlations with eosinophils, and lollipop plots (Fig. 10F) were drawn to further display the correlations between the genes MAN2B2, GLA, CHI3L1, and eosinophils.

Fig. 10
figure 10

Immunoinfiltration analysis between high and low score groups (CIBERSORT). (A) ROC curve of the glycosylation score (Gs) in the integrated GEO dataset. (B) Group comparison plot of the glycosylation score (Gs) between IDD and Control groups in the integrated GEO dataset (Combined Datasets). (C) Group comparison plot of key genes in the HighScore and LowScore groups of IDD samples. (D) Stacked bar plot of the proportions of LM22 immune cells in IDD samples. (E) Correlation heatmap between immune cell infiltration abundance and key genes in IDD samples. (F) Lollipop plot of the correlations between key genes MAN2B2, GLA, CHI3L1, and eosinophils. A positive correlation coefficient indicates a positive correlation between the two variables, while a negative correlation coefficient indicates a negative correlation between the two variables. IDD, Intervertebral Disc Degeneration. ns represents p-value ≥ 0.05, indicating no statistical significance; * represents p-value < 0.01, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance; *** represents p-value < 0.001, indicating extreme statistical significance. The absolute value of the correlation coefficient (r value) below 0.3 is weak or uncorrelated, between 0.3 and 0.5 is weakly correlated, between 0.5 and 0.8 is moderately correlated, and above 0.8 is strongly correlated. In the group comparison plot (B), red represents the IDD group and blue represents the Control group; in the group comparison plot (C) and stacked bar plot (D), red represents the HighScore group and blue represents the LowScore group; in the correlation heatmap, red represents positive correlation and blue represents negative correlation, with the color depth representing the strength of the correlation.

Immune infiltration analysis and consensus clustering based on immune characteristics of IDD samples

Using the k-means unsupervised clustering method, based on the infiltration levels of 28 immune cell types, all samples were clustered into two IDD subtypes (Fig. 11A): subtype 1 (Cluster1) and subtype 2 (Cluster2). The PCA results (Fig. 11B) show that in the reduced dimensionality space, there is a clear and distinct boundary between the two groups of samples, indicating good clustering performance.

Subsequently, the expression differences of key genes between subtype 1 (Cluster1) and subtype 2 (Cluster2) are shown in a volcano plot (Fig. 11C). The results indicate that the genes MAN1A1 and IGFBP3 are highly expressed in subtype 2. At the same time, the group comparison of 28 immune cell types between subtype 1 and subtype 2 (Fig. 11D) demonstrates that 12 immune cell types are enriched in IDD samples and have significant statistical differences between subtype 1 and subtype 2: activated CD4 T cells, activated dendritic cells, CD56dim natural killer cells, central memory CD4 T cells, effector memory CD8 T cells, eosinophils, MDSCs, memory B cells, plasmacytoid dendritic cells, regulatory T cells, T follicular helper cells, and type 1 T helper cells. Subsequently, a simple value heatmap (Fig. 11E) was used to further display the differences in the infiltration levels of the 12 immune cell types between subtype 1 and subtype 2 in IDD samples.

Finally, based on the correlation heatmap between essential genes and immune cell infiltration abundance (Fig. 11F), it was found that in IDD samples, the gene MAN2B2 exhibits a strong significant positive correlation with memory B cells (r-value = 0.649, p-value < 0.01). Conversely, the gene GLA shows a strong significant negative correlation with memory B cells (r-value = – 0.645, p-value < 0.01).

Fig. 11
figure 11

Immunoinfiltration analysis between Cluster1 and Cluster2 groups (ssGSEA). (A) Consensus clustering results of IDD samples from the integrated GEO dataset based on the infiltration levels of 28 immune cell types calculated by the ssGSEA algorithm. (B) PCA plot of the two IDD disease subtypes. (C) Volcano plot of the differential analysis results in IDD subtypes, with key genes marked. (D) Group comparison plot of the infiltration levels of 28 immune cell types between the two IDD subtype groups. (E) Simple value heatmap of the infiltration levels of the selected 12 immune cell types between the two IDD subtype groups. (F) Correlation heatmap between key genes and the infiltration abundance of 12 immune cell types. IDD, Intervertebral Disc Degeneration; PCA, Principal Component Analysis. ns represents p-value ≥ 0.05, indicating no statistical significance; * represents p-value < 0.01, indicating statistical significance; ** represents p-value < 0.01, indicating high statistical significance. In the grouping, blue represents subtype 1 (Cluster1) and red represents subtype 2 (Cluster2); in the simple value heatmap, red represents high expression and blue represents low expression; in the correlation heatmap, red represents positive correlation and blue represents negative correlation, with the color depth representing the strength of the correlation.

Construction of PPI network

First, the PPI network results (Fig. 12A) indicate that 6 key genes are related: MAN2B2, IGFBP3, MAN1A1, CHI3L1, PLOD2, and GLA. Then, use the Cytoscape software to draw the PPI network of these 6 key genes (Fig. 12B). Subsequently, the GeneMANIA website was used to predict and construct the interaction network of the 6 key genes and their functionally similar genes (Fig. 12C). Different colors of connecting lines represent the co-expression, shared protein domains, and other information between them. The network includes 6 key genes and 20 functionally similar proteins.

Fig. 12
figure 12

PPI network of key genes. (A) PPI network of key genes calculated by the STRING database. (B) PPI network of these 6 key genes drawn using Cytoscape software. (C) Interaction network of key genes and their functionally similar genes predicted by the GeneMANIA website. The circles in the figure represent the key genes and functionally similar genes, and the colors of the connecting lines represent the functions that connect them. PPI, Protein-protein Interaction.

Construction of regulatory networks

First, the ChIPBase and hTFtarget databases were used to obtain TFs that bind to the 6 key genes, construct an mRNA-TF regulatory network, and visualize it using the Cytoscape software (Fig. 13A). The network includes 6 key genes and 49 TFs. Detailed information was in Supplementary Table S3. Subsequently, the StarBase database was utilized to identify miRNAs related to these key genes, construct an mRNA-miRNA regulatory network, and visualize it using the Cytoscape software (Fig. 13B), involving 3 key genes and 32 miRNAs. Specific information is shown in Supplementary Table S4. Finally, the StarBase database was also used to predict RBPs related to these key genes, construct an mRNA-RBP regulatory network, and visualize it using the Cytoscape software (Fig. 13C), containing 6 key genes and 116 RBPs. Detailed information is shown in Supplementary Table S5.

Finally, the CTD database was utilized to identify potential drugs or molecular compounds associated with the 6 essential genes. Cytoscape was employed to construct and visualize an mRNA-drug regulatory network (Fig. 13D), which includes 6 essential genes and 20 drugs or molecular compounds. Specific information can be found in Supplementary Table S6.

Fig. 13
figure 13

Regulatory network of key genes. (A) mRNA-TF regulatory network of key genes. (B) mRNA-miRNA regulatory network of key genes. (C) The mRNA-RBP regulatory network of key genes. (D) Key genes (mRNA-Drug Regulatory Network). TF, Transcription Factor; RBP, RNA-Binding Protein. mRNA in purple, TF in blue, miRNA in red, RBP in yellow, and Drug in green.

Discussion

IDD is a major global health problem associated with severe pain and disability, affecting hundreds of millions of people worldwide51. Existing treatment methods, such as surgery and medication, have limitations and often require a balance between treatment effects and side effects52. Therefore, in-depth research on the molecular mechanisms of IDD, especially gene expression changes in the glycosylation process, is crucial for developing new diagnostic and therapeutic strategies. This study conducted bioinformatic analysis of gene expression in IDD based on datasets downloaded from public databases.

Hypoxia is a typical feature of the microenvironment of the NP tissue, especially when intervertebral disc degeneration occurs. Due to annular rupture, reduced blood vessels, and increased metabolic demands of nucleus pulposus cells, the local partial pressure of oxygen drops sharply53. The results of GSEA suggest that the integrated dataset significantly enriches gene sets such as Adaptation to Hypoxia (down) and Hypoxia (down), indicating that the hypoxia adaptation ability of NP tissue is significantly impaired under the state of IDD. GSEA analysis shows that the differentially expressed genes in the integrated dataset are significantly enriched in the Apoptosis By Serum Deprivation pathway (up), suggesting that the apoptotic pathway is activated in IDD, which is highly consistent with the pathological features of NP tissue degeneration. As IDD progresses, the nutrient channels of the intervertebral disc calcify, and NP cells face a lack of nutrients such as glucose and amino acids, thereby initiating the apoptotic program54. GSEA analysis shows that the differentially expressed genes in the integrated dataset are significantly enriched in the Integrated TGF-β EMT pathway (up), suggesting that the TGF-β induced EMT- related pathway is activated in IDD. When IDD occurs, the TGF-β signaling pathway may drive fibrosis and NP dehydration by inducing an imbalance in the synthesis and degradation of ECM components55.

The diagnostic model constructed in this study (Fig. 7D, E) included 7 glycosylation-related key genes: MAN2B2, IGFBP3, MAN1A1, CHI3L1, PLOD2, RAPGEF5, and GLA. According to the results of differential expression gene analysis ((Fig. 3D, E), diagnostic model verification results ((Fig. 8F), and glycosylation score results (Fig. 10B, C), MAN2B2 and RAPGEF5 showed expression patterns suggestive of a protective association with IDD, whereas IGFBP3, GLA, MAN1A1, PLOD2, and CHI3L1 were correlated with IDD in this dataset. These associations require experimental validation to confirm any functional roles. Among them, MAN2B2 has the highest diagnostic efficiency for IDD diagnosis model. And we found that MAN2B2, IGFBP3, PLOD2, CHI3L1 and other genes have been confirmed by relevant studies, with high accuracy and specificity. Eva Morava and Xue Zhang56,57 reported two cases of patients with MAN2B2 mutation defect, which showed serious arthritis, malformation and immune deficiency, etc. After transduction of wild-type MAN2B2, the patients’ related symptoms were relieved. Studies of Grad et al.58 have shown that IGFBP3 can affect the dynamic balance of matrix synthesis and degradation by regulating IGF-1 activity, and the polymorphism of IGFBP3 gene is closely related to lumbar disc degeneration. Levi59 showed that PLOD2 encodes a collagen lysine hydroxylase, which is highly expressed after tissue injury and can regulate extracellular matrix remodeling by affecting collagen fiber crosslinking. Huan Wang60 showed that CHI3L1 secreted by M2a macrophages promoted the imbalance of extracellular matrix metabolism by activating IL-13Rα2/MAPK pathway, thus promoting IDD. Other genes such as RAPGEF5, GLA, MAN1A1, etc., have not been reported to be related to IDD at present, and their specific functions need to be further explored. The potential roles of these genes in IDD remain to be clarified and warrant further investigation through experimental studies.

Immune cell infiltration of IDD is another highlight of this study. A large number of studies have shown that immune cell infiltration and inflammatory response are important factors leading to IDD61,62. We used CIBERSORT algorithm to identify multiple invasive immune cell subpopulations in degenerative tissues and analyze the correlation between key genes and immune cells. Subsequently, we further analyzed the action mechanism of immune cells on IDD (Figs. 9C, 10E and 11F) and found that T cells CD4 memory resting, Neutrophils and B cells memory belonged to protective immune cells. Macrophages M0, CD56dim natural killer cell, T cells CD8 and MDSC belong to pathogenic immune cells. Eosinophils, T cells regulatory, Dendritic cells activated, and T cells follicular helper have conflicting effects on IDD. The results in Fig. 9 indicate that follicular helper T cells exhibit the highest correlation coefficient with activated dendritic cells (r > 0.5, p < 0.001). Meanwhile, there is a strong negative correlation between activated NK cells and mast cells (r < -0.5, p < 0.001). These closely associated immune cells might reflect a specific immune regulatory network within the microenvironment of IDD. CD4 memory resting T cells are moderately positively correlated with regulatory Tregs. This finding suggests that in the context of IDD, resting T cells may be activated and differentiate into Tregs, thereby participating in the suppression of inflammation. This interplay embodies the dynamic equilibrium of immune regulation. A weak positive correlation exists between eosinophils and mast cells, which could be attributed to their collaborative secretion of anti-inflammatory cytokines such as IL − 4/IL − 13. This interaction may represent an anti-inflammatory compensatory mechanism in IDD. This is consistent with the negative correlation between eosinophils and MAN2B2 (a protective gene) observed in “Immune infiltration analysis between high and low glycosylation score groups”, further corroborating the protective role of eosinophils in IDD.

Ming-Xiang Zou63 conducted single-cell RNA sequencing of intervertebral discs in IDD, suggesting that Neutrophils interfered with nucleus pulposus cells to promote the progression of IDD. In addition, a bioinformatics study64 found that imbalances in Neutrophils and γδT cells were significantly associated with IDD progression. However, this study suggests that MAN1A1 gene may inhibit IDD by regulating Neutrophils. Zhengxu Ye65 suggested that M1 polarization of macrophages could accelerate disc degradation and promote IDD. Our analysis suggests a possible association between M0-to-M1 macrophage polarization and IDD, and genes such as CHI3L1, MAN1A1, and PLOD2 may be linked to this process. However, these hypotheses require experimental confirmation. MDSC expression is elevated in inflammatory and chronic diseases. This study suggests that CHI3L1 and PLOD2 genes may promote IDD by regulating MDSC. Juan Du66 showed that circulating MDSCs were significantly positively correlated with the severity of clinicopathological stages of LDH. Yang Sun67 have shown that eosinophils regulate the polarization of macrophages by secreting cytokines such as IL-4 and IL-13, and have anti-inflammatory effects. Therefore, it is speculated that eosinophils may play a protective role in IDD through this mechanism, and MAN2B2, GLA, CHI3L1 and other genes may participate in this process by regulating eosinophils. For other immune cells, such as T cells CD4 memory resting, B cells memory, T cells CD8, etc., no studies have been found on IDD, and its mechanism is still unclear. In summary, this study found that a variety of immune cells have different effects on IDD, and some immune cells have contradictory effects on IDD, suggesting the importance and complexity of immune cell infiltration in IDD, which needs to be further studied.

On the other hand, this study used the ssGSEA algorithm to perform clustering analysis on IDD samples based on the infiltration levels of 28 immune cell types and divided the samples into two immune subtypes. By comparing the expression differences of key genes and immune cell infiltration characteristics between the two subtypes, it was found that MAN1A1 and IGFBP3 were highly expressed in subtype 2, and the two subtypes had significant differences in the infiltration levels of 12 immune cell types, such as activated CD4 T cells, activated dendritic cells, and effector memory CD8 T cells. Studies have shown that immune cell infiltration patterns in the IDD process have stage specificity, with macrophage infiltration predominating in the early stage, while T lymphocyte and dendritic cell infiltration increases in the middle and late stages68. The results of the clustering analysis in this study support this view, suggesting that changes in the immune microenvironment play an important role in the formation of IDD heterogeneity, and different immune subtypes may correspond to different stages or severities of the disease. Therefore, immune phenotypes may become a new type of molecular marker for judging the degree of degeneration and guiding the selection of treatment plans.

In terms of revealing the biological functions of key genes, we found that multiple TFs, such as CEBPB and FOXA1, may be involved in the disease process by regulating the expression of IGFBP3, PLOD2, and other genes. In addition, miRNAs such as miR-19a-3p and miR-96-5p may also affect intervertebral disc homeostasis by targeting IGFBP3, PLOD2, and other genes. Studies have found that circARL15 plays a key role in IDD by regulating DISC1 expression through miR-431-5p69. This study further analyzed the regulatory role of key miRNAs in IDD and their association with previous research. By reviewing existing literature on the identified miRNAs, it was found that some miRNAs (such as miR-19a-3p, miR-96-5p, etc.) have been previously confirmed to be closely related to the occurrence and progression of IDD. miR-19a-3p can influence the survival status of disc cells by regulating apoptosis and inflammatory responses, while miR-96-5p plays a significant role in extracellular matrix metabolism and tissue repair processes70. The miRNAs predicted in this study are highly consistent with these known functions, further validating the reliability and biological significance of the bioinformatic screening results. However, the roles of some miRNAs also exhibit variations across different studies, potentially influenced by factors such as tissue type, sample source, or analysis strategies. Therefore, in this study, we supplemented the regulatory networks and potential target genes associated with IDD for these miRNAs and explored the functions of newly discovered miRNAs and their possible involvement in IDD pathogenesis. Overall, the miRNA network identified in this study provides a theoretical foundation and data support for understanding IDD’s molecular regulatory mechanisms and identifying novel diagnostic and therapeutic targets. Common environmental pollutants such as bisphenol A and tetrachlorodibenzodioxin can act on multiple key genes. Epidemiological studies have shown that exposure to environmental toxins such as tobacco and dioxins is a risk factor for IDD71. This suggests that exogenous chemical substances may promote the disease process by interfering with the expression of key genes, and the specific mechanisms still require more experimental research. In summary, this study explored the molecular mechanisms of IDD from the perspective of glycosylation abnormalities using bioinformatic methods, which may provide new methods and strategies for improving the prognosis of IDD patients.

Limitations of the study

This study still has certain limitations. Firstly, although the sample size was expanded by integrating multiple public databases, the combined dataset only included 21 IDD samples and 13 control samples, which is a limited sample size that may not fully represent the gene expression characteristics of IDD patients, thereby affecting the generalizability and reliability of the results. Although current research has mitigated the risk of overfitting by removing batch effects, reducing gene redundancy, and conducting cross-validation, future studies still need to incorporate external datasets for validation to further enhance the robustness and reliability of the results. Secondly, due to research constraints, it was not possible to independently collect clinical specimens or conduct related experimental validations, and all analyses were based on data from public databases. Consequently, there is a lack of validation at the protein level and functional experiments for key genes, preventing further elucidation of their specific mechanisms of action. Additionally, the limitations in sample size and experimental design made it impossible to systematically compare the expression differences of key genes in different degenerative tissue regions (such as the nucleus pulposus and annulus fibrosus). This study primarily focused on gene expression analysis of nucleus pulposus tissue, whereas the common degenerative phenotypes observed in clinical imaging often originate from annulus fibrosus lesions. In future research, we plan to expand the collection of clinical samples further, incorporating multicenter and multi-type tissue samples, and conduct protein-level and functional experimental validations to confirm and extend the conclusions of this study more comprehensively.