Introduction

Alzheimer’s Disease (AD) is the most common cause of dementia, accounting for 60–70% of cases1. The gold standard for AD diagnosis is neuropathologic examination, which identifies hallmark features such as extracellular depositions of senile plaques generated by amyloid β (Aβ) and neurofibrillary tangles formed by hyperphosphorylated tau in different brain regions2. Numerous AD risk factors have been identified, including age, genetic risk variants, stress, immune system dysfunction, and infectious diseases3. Based on the current knowledge of AD pathogenesis, both academic and pharmaceutical industries are actively engaged research and development to achieve a breakthrough in effective AD treatments4. Aβ has been investigated as the primary therapeutic target for many years. The US Food and Drug Administration has approved anti-amyloid antibodies such as aducanumab and lecanemab for AD treatment5,6. Additionally, there is a growing focus on therapies targeting the tau protein7. AD biomarkers are confirmatory in clinical decision-making, particularly important for advancing disease-modifying therapies targeting Aβ and tau proteins. These biomarkers are also essential for improving both AD diagnosis and prognosis8.

Recent studies have shown that innate immune genes and cells directly or indirectly affect AD. Even without crossing the blood–brain barrier, T cells regulate brain homeostasis through a cascade of immune signals and secretory molecules. A recent study showed that CD8+ T cells are abnormally expanded in the brains of patients with mild cognitive impairment (MCI) and AD, indicating that CD8+ T cells may affect neurodegeneration and cognitive impairment in AD9. CD4+ T cells infiltrate the brain to promote Aβ clearance and neuronal repair10. Removal of B cells can significantly reduce Aβ and reverse memory deficits in a 3xTg AD mouse model11. Blocking the transforming growth factor in the peripheral macrophage β (TGF-β)-mediated signaling pathway can reduce Aβ levels in the brain of the Tg2576 mouse model, which may be a potential treatment for AD12. However, the roles of innate and adaptive immune cells have not been fully clarified, and there is an urgent need to identify new immune-related biomarkers to explain the neuroinflammation and pathogenesis of AD further.

The analysis and prediction of biomarkers for AD can significantly increase our understanding of its pathology and improve the development of new drug targets, clinical trials, and overall diagnosis13. Weighted gene co-expression network analysis (WGCNA) has been used to screen for potential biomarkers for various diseases14,15, as it analyzes the correlation between genes rather than being limited to a single gene16. Machine learning, a branch of computer science and statistics, is instrumental in the detection, diagnosis, and treatment of diseases17. The combination of WGCNA with machine learning considerably improves the accuracy of identifying potential disease biomarkers18. Using this combined approach, small nuclear ribonucleoprotein polypeptide G (SNRPG) has been identified as a critical gene in AD and metabolic syndrome19. Additionally, guidance receptor Plexin-B1 (PLXNB1), GRAM domain containing 3 (GRAMD3), and gap junction protein alpha (GJA) have been identified in different cortex and cerebellum regions associated with the Braak NFT stage in AD20. Similar genomic transcription patterns in different cortex regions, according to the Braak 0-VI phases, may participate in the pathological progression of AD through the oxidation pathway21. Therefore, specific biomarkers have been screened for different research purposes using a combination of these two methodologies.

In this study, we aimed to discover new biomarkers for AD by combining WGCNA with machine learning algorithms. Initially, we identified 109 key differentially expressed genes (DEGs) using the GEO database and subsequently applied WGCNA. We then employed a protein–protein interaction (PPI) network and a joint machine learning of the GSE109887 dataset to identify and verify 13 potential biomarkers related to AD. By intersecting immune-related genes selected via the InnateDB database with these 13 potential AD-related genes, we identified UBE2N as an immune-related biomarker. Transcription factor prediction and Gene set enrichment analysis (GSEA) were used to explore the biological processes and pathways involving UBE2N in the context of AD. MFUZZ cluster analysis revealed that UBE2N is involved in T cell and B cell functions, as well as synaptic vesicle cycle signaling pathways. Single-cell data analysis of cerebrospinal fluid (CSF) and peripheral blood of AD patients demonstrated a close association between UBE2N and T cell function, with UBE2N levels being downregulated in CD4+ T cells from patients with mild MCI and AD. Additionally, we observed decreased UBE2N expression levels in TauP301S transgenic mice. All these results suggest that UBE2N is a novel biomarker for AD.

Results

Screening of DEGs in the brain of patients with AD

To screen for DEGs in the brains of patients with AD, we first removed the batch effect of genes between the AD and control groups and crossed the Principal Component Analysis (PCA) dataset that showed separation from each other, laying the foundation for subsequent analysis (Fig. 1A,B). The results included 124 samples from healthy individuals and 157 samples from patients with AD. By setting the screening criteria of | LogFC |≥ 0.5, p < 0.05, 86 upregulated and 153 downregulated genes were identified. The expression of DEGs is shown in heat and volcano maps (Fig. 1C,D). GO analysis revealed that DEGs were mainly enriched in vesicle-mediated synaptic transport, the synaptic vesicle cycle, and other related pathways (Fig. S1A). KEGG enrichment analysis showed that DEGs were mainly enriched in Alzheimer’s disease, cAMP, and MAPK signaling pathways (Fig. S1B).

Fig. 1
figure 1

Data preprocessing for DEG. (A) Raw PCA showing the analysis of the GSE13903, GSE5281, and GSE844223 datasets. (B) Combat PCA shows the three datasets after removal from the batch. (C) Heat map showing the top 30 upregulated genes and the top 30 downregulated genes after logFC sequencing. (D) The volcano map shows DEGs with | logFC |> 0.5, and the significant DEGs are marked. PCA, principal component analysis; DEGs, differential genes.

Screening characteristic genes of AD by construction of WGCNA

A total of 2,626 genes with expression variance in the top 25% were included in the WGCNA. Analysis of soft threshold selection showed that the average connectivity was high, and the scale-free network distribution reached its optimal level when β = 7 (R2 = 0.85) (Fig. 2A). Subsequently, we obtained nine independent modules (with the lowest number of genes in the module set to 30 and genes not included in the module shown in gray) (Fig. 2B) by setting the clustering height to 0.25, merging highly correlated modules, and confirming the independence of each module (Fig. 2C). Next, the correlation between the module genes and AD was analyzed. The results showed that the highest correlation module was the turquoise module, which was positively correlated with the control group (r = 0.43, p = 2e-14) and negatively correlated with the AD group (r = − 0.43, p = 2e-14) (Fig. 2D). The genes inside the turquoise module were screened according to the standard in which GS ≥ 0.35 and ME ≥ 0.8 were set as values closer to r, and 195 genes were obtained (Table S1). In total, 109 characteristic genes related to AD were identified by overlapping 195 genes with the DEGs (Fig. 2E). GO analysis showed that the characteristic genes were mainly enriched in exocytosis, glomerular development, and synaptic vesicle cycle (Fig. S2A). KEGG analysis showed that the characteristic genes were mainly enriched in GABAergic synapses, synaptic vesicle cycles, and cAMP signaling pathways (Fig. S2B).

Fig. 2
figure 2

Construction of the WGCNA co-expression network. (A) Soft threshold (β = 7) and scale-free topological fit index (R2). (B) Original and combined modules using a clustering tree. The cluster dendrogram was the result before module cutting, while the Merged Dynamic was the result after module cutting. (C) Cluster tree plot of the module feature genes. (D) Heat map of module-trait correlations. The values inside the brackets represent the p-value of significance, whereas the values outside the brackets represent the values of the correlation between the Control and AD groups. (E) Venn diagram of key module genes vs. DEGs. The overlapping regions represented the same 109 genes in DEGs and WGCNA. AD, Alzheimer’s Disease; DEGs, differential genes; WGCNA: Weighted gene co-expression network analysis.

Screening for potential biomarkers of AD by machine learning algorithms

Protein–protein interaction (PPI) networks were constructed based on the 109 potential targets. A total of 48 gene nodes in 80 lines were identified (Fig. 3A). First, 19 genes predicting the incidence of AD were obtained from 48 genes analyzed using the least absolute shrinkage and selection operator regression model (Fig. 3B). Then, 48 genes were analyzed using support vector machine recursive feature elimination, of which, 36 showed a high accuracy rate (0.787) and a low error rate (0.213) (Fig. 3C). The RF results showed that 33 AD-related genes were selected from 48 (Fig. 3D,E). Three machine learning algorithms obtained thirteen overlapping genes (Fig. 3F). ATP6V1E1, CCKBR, DYNC1I1, NRN1, SV2B, SYT1, TUBB2A, and UBE2N were expressed at low levels in the brains of patients with AD, whereas INPPL1, ITPKB, ITSN1, RAPGEF3, and TBL1X were highly expressed (Fig. S3). The validation set exhibited similar results (Fig. S4).

Fig. 3
figure 3

Screening for potential biomarkers of AD using a machine learning algorithm. (A) PPI network diagram; red and green represent upregulated and downregulated genes, respectively. (B) The minimum absolute contraction and selection operator models (LASSO) were used to select the characteristic genes. (C) The SVM-RFE algorithm selected biomarker feature genes. The red circle at the lowest point in the left image indicates a minimum error rate of 0.213 for 36 genes, whereas the red circle at the highest point in the right image represents a maximum accuracy rate of 0.787 for 36 genes. (D) The random forest tree algorithm was used to evaluate characteristic genes. (E) The top 20 most important genes were identified using the random forest tree algorithm. (F) Venn diagram of the three algorithms used to screen genes. The overlapping parts of the three circles represent 13 common genes obtained from the three machine learning algorithms. AD, Alzheimer’s Disease; SVM-RFE, support vector machine recursive feature elimination.

Identifying UBE2N as a biomarker for AD

The diagnostic value of the 13 biomarkers analyzed by ROC curves was greater than 0.7 for all AUC (Fig. 4C). A total of 1,696 immune genes were retrieved from the InnateDB database and overlapped with 13 AD biomarkers, identifying UBE2N as an overlapping gene (Fig. 4B). Biomarker correlation analysis revealed that UBE2N expression positively correlated with seven genes (TUBB2A, SV2B, NRN1, CCKBR, DYNC1I1, ATP6V1E1, and SYT1) and negatively correlated with five genes (INPPL1, ITSN1, ITPKB, RAPGEF3, and TBL1X) (Fig. 4A). The UBE2N column line graph was modeled by analyzing the calibration curves, and the differences between the normal and predicted values were small, indicating that the model was accurate (Fig. 4D,E). In the decision curve analysis (DCA), the model curve was above the grey line, implying that patients could benefit from the model within the threshold (Fig. 4F). The clinical impact curve also demonstrated a better overall net benefit in the threshold range (Fig. 4G). This suggests that the UBE2N columnar line graph model constructed in this study can be used to assess AD prognosis.

Fig. 4
figure 4

Evaluation of UBE2N as a biomarker of AD. (A) Venn plots of the immune and diagnostic markers. (B) Correlation between genes. Red squares show the positive correlation of genes and blue squares represent the negative correlation of genes. (C) Training focused on the ROC curves for the diagnostic markers. (D) Norman diagrams were used to predict AD incidence. (E) The ROC curve evaluates the clinical application value of the Norman diagram model. (F) DCA curve evaluates the clinical application value of the Norman diagram model. (G) Clinical impact curve: the red curve (number of high-risk individuals) represents the number of individuals classified as positive (high-risk) by the model at each threshold probability; the blue curve (the number of at-risk individuals with results) is the number of true positives at each threshold probability. ROC, Receiver Operating Characteristic; DCA, decision curve analysis.

Analysis of potential regulatory mechanisms of UBE2N

Based on the pooled median expression values, data from patients with AD were divided into two groups: high- and low UBE2N expression. KEGG analysis was performed on DEGs. The results showed that UBE2N might participate in the activation of five pathways, including nicotine addiction, calcium reabsorption regulated by endocrine and other factors, synaptic vesicle circulation, oxidative phosphorylation, alanine, aspartate, and glutamate metabolism. In addition, UBE2N may be involved in the inhibition of five signaling pathways, including the interaction of the viral protein with cytokines and cytokine receptors, graft rejection, and malaria and Staphylococcus aureus infections (Fig. 5A). GSEA analysis showed that the remaining six genes that were positively correlated with UBE2N were mainly enriched in the synaptic vesicle circulation pathway (Fig. S5). These results indicate that UBE2N may be related to vesicular function. In addition, the potential regulatory network analysis of UBE2N showed that hsa-miR-128-3p, hsa-miR-149-5p, hsa-miR-221-3p, hsa-miR-222-3p, hsa-miR-5010-5p, hsa-miR-522-3p, and hsa-miR-96-5p regulated UBE2N expression and the transcription factor ATF1 was also involved in the transcription regulation process of UBE2N (Fig. 5B).

Fig. 5
figure 5

Analysis of potential regulatory mechanisms of UBE2N. (A) Butterfly diagram showing the first five upregulated and downregulated pathways in the GSEA. (B) Regulatory network of UBE2N cells. The blue graphs represent microRNAs associated with UBE2N; ATF1 as the transcription factor (green triangle) is associated with UBE2N. GSEA: Gene Set Enrichment Analysis.

Cluster analysis of MFUZZ expression patterns

Fifty different clustering results were obtained, based on the expression patterns of UBE2N. The obtained clustering results for ssGSEA scoring were analyzed with the control and AD groups for correlation analysis, and cluster 27 showed the closest module related to UBE2N (Fig. 6A-C). Functional enrichment analysis showed that the cluster 27 module genes were mainly enriched in the T cell receptor, B cell receptor, and synaptic vesicle signaling pathways (Fig. 6D). Subsequently, the genes in the cluster 27 module overlapped with 13 AD biomarkers, and seven genes (ATP6V1E1, CCKBR, DYNC1I1, NRN1, SV2B, SYT1, and TUBB2A) were highly correlated with UBE2N expression (Fig. 6E).

Fig. 6
figure 6

Cluster Analysis of MFUZZ Expression Patterns. (A) Clustering results for MFUZZ expression patterns. (B) ssGSEA scores and expression characteristics of the clustering modules between the control groups (green column) and AD (red column). (C) Correlation between clustering modules and UBE2N. Red indicates a positive correlation with the cluster and green indicates a negative correlation with the cluster. The darker the color, the stronger is the correlation. (D) KEGG enrichment analysis of genes in cluster 2764,65: the darker the color, the more genes enriched in the pathway, and the longer the column, the smaller the p-value. (E) Core genes co-expressed with UBE2N in AD cells. Eight core genes were obtained by crossing the Cluster27 and AD datasets. GSEA: Gene Set Enrichment Analysis.

Validating and interpreting machine learning models

To determine the best machine learning model for predicting AD, we used 281 samples for the training set (124 normal and 157 AD samples) and 78 samples for the validation set (32 normal and 46 AD samples). Seven machine learning models including eXtreme Gradient Boosting (XGBoost), Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (Logistic), Gaussian Naive Bayes (GNB), and Multilayer Perceptron (MLP) were built to predict the results using model genes (UBE2N, ATP6V1E1, CCKBR, DYNC1I1, NRN1, SV2B, SYT1, and TUBB2A). By calculating the performance (accuracy, sensitivity, specificity, Positive predictive value, Negative predictive value, F1 score, Kappa score) of multiple machine learning models in the training cohort versus the validation cohort, we found that XGBoost model demonstrated strong performance in the test set, achieving an accuracy of 0.69, a sensitivity of 0.96, a negative predictive value of 0.96 and an F1 score of 0.79 (Fig. S7A,B). The XGBoost model had an AUC value of 0.77 and a PR value of 0.81 in the test cohort (Fig. 7A,B). Next, we investigated how the model predicted AD onset through Shapley Additive exPlanation (SHAP) values and elucidated the impact of each feature variable on the predictive model. Importance ranking of the feature variables based on the SHAP summary plot of the XGBoost model showed that UBE2N contributed most to the XGBoost model (Fig. 7C). In addition, we employed SHAP dependency analysis to characterize how a single feature variable affects the results of the XGBoost prediction model (Fig. 7D). The SHAP values of the feature variables were positively correlated with the accuracy of the model. The accuracy of the model can be improved by a positive value of UBE2N but reduced by a negative value of ATP6V1E1. The normal and AD samples in the testing cohort were further depicted using SHAP force diagram and Interpretable Model-agnostic Explanations (LIME) algorithm. The prediction outcomes for the normal sample (randomly selected from the test set) by XGBoost model showed a 6% probability of classifying this particular sample as normal and a 94% probability of classifying it as AD (Fig. 7E,G). At the same time, the XGBoost model predicted an AD sample (randomly selected from the test set) with an 88% probability of being AD and a 12% probability as normal (Fig. 7F,H). These results are consistent with the actual status of the samples. After excluding the feature gene UBE2N from the module, the performance of the XGBoost model obviously declined, with the ROC value decreasing to 0.72, which is a reduction of 0.05 from its initial performance. Additionally, the P-R value also dropped by 0.02 (Fig. S8).

Fig. 7
figure 7

Validating and interpreting machine learning models. (A) Specific values of AUC in the XGBoost model. (B) Specific values of P-R in the XGBoost model. (C) The importance matrix shows the contribution of each feature gene in the XGBoost model. (D) The SHAP summary plot shows the contribution of each feature to the XGBoost model. (D, E) The left side of the figure shows the predicted results of the LIME. The ten variables with the greatest impact on normal or AD onset are listed on the right side. The length of each feature bar indicates the importance of the corresponding feature in the prediction. (G, H) Predicted probabilities of normal and AD onset based on SHAP force maps, respectively. Red and blue bars represent increased and decreased likelihood of AD, respectively.

Consensus clustering identifies two AD subclasses related to UBE2N

Utilizing the “ConsensusClusterPlus” software package, we conducted a consensus cluster analysis involving UBE2N and its related genes: tubulin beta 2A class IIa (TUBB2A), synaptic vesicle glycoprotein 2B (SV2B), neuritin 1 (NRN1), cholecystokinin B receptor (CCKBR), dynein cytoplasmic 1 intermediate chain 1 (DYNC1I1), ATPase H + transporting V1 subunit E1 (ATP6V1E1), and synaptotagmin 1 (SYT1). Following a comprehensive analysis, k = 2 was determined to be the optimal cluster number. Consequently, AD samples were stratified into two distinct subclasses, designated as C1 and C2, as illustrated in the cumulative distribution function (CDF) plot (Fig. 8A). Principal component analysis (PCA), uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbor embedding (TSNE) demonstrated significant differences between these two subclasses (Fig. 8B). A variance heatmap analysis indicated that the expression levels of UBE2N and the seven related genes were significantly lower in the C1 group compared to the C2 group (Fig. 8C). Furthermore, we employed the Gene Set Variation Analysis (GSVA) method to examine the functional implications of differential gene regulation between the subclasses. The results showed that the differential bases were enriched in several signaling pathways, including apoptosis, the mammalian target of rapamycin (mTOR) signaling pathway, and the T cell/B cell receptor signaling pathway (Fig. 8D).

Fig. 8.
figure 8

Consensus clustering identifying two AD subclasses. (A) Consensus Clustering CDF for k = 2–9, and consensus clustering for k = 2. (B) tSNE, PCA, and UMAP analyses between the two subgroups. (C) Analysis of differences in gene expression between the two subgroups. (D) Pathway activity between two different subgroups based on the ssGSEA algorithm. *p < 0.05; **p < 0.01; ***p < 0.001; ns, no statistical significance.

Immune cell infiltration analysis

To evaluate the infiltration status of immune cells in the control and AD groups, we compared the expression levels of 28 types of immune cells. Compared with that in the control group, the infiltration rate of T cells in the AD group was higher, including natural killer, gamma delta, central memory CD8+, factor memory CD8+, and central memory CD4+ T cells. However, the infiltration rate of factor memory CD4+ T cells was lower in the AD group than in the control group (Fig. 9A). Subsequently, correlations between the eight genes and immune cells were analyzed. UBE2N and DYNC1I1 positively correlated with activated CD4+ T cells, SYT1 negatively correlated with activated CD8+ T cells, and CCKBR negatively correlated with immature dendritic cells (Fig. 9B). UBE2N had the highest correlation with effector memory CD4+ T cells (r = 0.7) (Fig. 9C, D).

Fig. 9
figure 9

Immune cell infiltration analysis. (A) Comparison of 28 immune cell types; green and orange represent the control and AD groups, respectively. (B) Correlation between characteristic genes and immunity. (C) Correlation between UBE2N and the 28 types of immune cells. (D) Correlation between UBE2N and effector memory CD4+ T cells.

Comprehensive analysis of single-cell transcriptome data from cerebrospinal fluid (CSF) of patients with AD and healthy controls

To further investigate the role of UBE2N in the immune cells of AD, we conducted a comprehensive analysis of 59 CSF samples, which included 45 healthy controls, 8 patients with MCI, and 6 AD patients. Following the filtering of single-cell sequencing data, we obtained 70,391 high-quality cells. Using the highly variable gene-based uniform manifold approximation and projection (UMAP) algorithm of the Seurat package, these cells were distributed into 18 distinct clusters labeled by 23 immune markers. Excluding the indeterminate clusters (13th and 17th clusters), we identified the following 10 cell types for subsequent analysis: B Cells, CD4+ T Cells, CD4+/CD8+ T Cells, CD8+ T Cells, CD14+/CD68hi Monocytes, CD14+/CD68lo Monocytes, dendritic cells (DCs), Natural Killer (NK) Cells, Plasma Cells, and Regulatory T (Treg) Cells (Fig. 10A-C). Quantification of immune cell proportions in each subpopulation revealed that the numbers of CD4+ T Cells, CD4+/CD8+ T Cells, and CD8+ T Cells were lower in MCI and AD patients compared with healthy individuals. Conversely, the number of Treg cells was higher in MCI and AD (Fig. 10D). To validate the accuracy of the classified cell types, we analyzed the top five differentially expressed genes within each subpopulation (Fig. 10E). UBE2N was found to be highly expressed in CD4+ T cells and DCs. Moreover, the expression of UBE2N in CD4+ T cells of MCI and AD patients was significantly reduced compared with healthy controls (Fig. 10F,G).

Fig. 10
figure 10

Comprehensive analysis of single-cell transcriptome data from cerebrospinal fluid of AD patients and healthy samples. (A) UMAP projections of 70,391 CFS cells from patients with AD. (B) UMAP projections for different subgroups and after the removal of batch effects. (C) Annotation of individual cell clusters by using marker genes for different cell types. (D) Proportions of different cell types in each subgroup. (E) Heatmap of the first five DEGs in each cell population. (F) UMAP projections of the UBE2N expression distribution in each cell population. (G) Violin plots of differences in UBE2N expression in each subgroup. *p < 0.05; **p < 0.01; ***p < 0.001; ns, no statistical significance.

Expression levels of UBE2N were decreased in the brain of an AD mouse model

Quantitative real-time PCR (qRT-PCR) results revealed that the gene expression levels of UBE2N, ATP6V1E1, CCKBR, SV2B, and TUBB2A were significantly reduced in the cerebral cortex of TauP301S mice, whereas the expression levels of DYNC1I1, NRN1, and SYT1 remained unchanged compared to the control group (Fig. 11A). Additionally, the protein levels of UBE2N decreased significantly in both the cerebral cortex and hippocampus of TauP301S mice (Fig. 11B,C). However, no significant change in UBE2N expression was observed in the cerebral cortex of APP/PS1 mice (Fig. S6). Immunofluorescence studies demonstrated that UBE2N co-localized with NeuN in the hippocampus and temporal cortex of TauP301S mice. Compared to C57BL/6 mice, TauP301S mice exhibited a marked reduction in UBE2N fluorescence intensity in both the hippocampus and cortex (Fig. 11D,E).

Fig. 11
figure 11

Expression of UBE2N in the brains of TauP301S mice. (A) Differential expression of UBE2N and related genes in the cortex of control (n = 8) and TauP301S mouse brains (n = 8) was analyzed by RT-qPCR using GAPDH as a control. (B) UBE2N expression in the temporal cortex of the control (n = 8) and TauP301S mice (n = 8). (C) UBE2N expression in the hippocampi of control (n = 8) and TauP301S mice (n = 8). Positive staining for UBE2N was observed in the cortex (D) and hippocampus (E), and it colocalized with red-labeled NeuN in the cortex and hippocampus in immunofluorescence images. The scale bar for immunohistochemistry images is 50 μm, and the scale bar for cortical and hippocampal immunofluorescence images is 200 μm and 500 μm, respectively. All data from at least three independent experiments are presented as mean ± SEM. * p < 0.05, ** p < 0.01 with respect to the control group.

Discussion

AD is a neurodegenerative disease that impairs cognitive function, mainly affecting brain regions related to learning and memory, such as the temporal lobe and hippocampus22. In the current study, we performed a comprehensive and in-depth analysis of gene expression profile in temporal lobe to identify AD-specific genes, resulting in the identification of 13 HUB genes. Among them, UBE2N was validated in the cerebral cortex and hippocampus of TauP301S mice and emerged as the most robust biomarker for AD in our study.

By analyzing microarray data from the temporal lobes of AD patients in the GEO database, we identified 239 DEGs, consisting of 86 upregulated and 153 downregulated genes. Overlapping the key modular genes obtained by WGCNA with DEGs yielded 109 key DEGs. These DEGs were mainly enriched in GABAergic synaptic, B-cell receptor, and synaptic vesicle cycle signaling pathways, all of which are critical pathological changes in the pathogenesis of AD23. Subsequently, a machine learning algorithm was used to identify 13 HUB genes. Cross-referencing immune genes in the InnateDB database revealed the UBE2N gene. Furthermore, gene correlation analysis demonstrated that ATP6V1E1, CCKBR, SV2B, DYNC1I1, NRN1, SYT1, and TUBB2A were positively correlated with UBE2N expression. The AUC areas under the ROC curves are all greater than 0.7, indicating that the constructed model could accurately predict the onset of AD24.

UBE2N plays an important role in several neurodegenerative diseases. Overexpression of UBE2N increases the aggregation of mutant Huntington’s proteins25. In contrast, knockdown of the E2 enzymes UBE2N, UBE2L3, UBE2D2 and UBE2D3 (UBE2D2/3) significantly reduces the autophagic clearance of depolarized mitochondria; furthermore, UBE2N, UBE2L3, and UBE2D2/3 synergistically promote Parkin-mediated mitochondrial autophagy26. Single-cell analysis revealed reduced expression of UBE2N in cerebrospinal fluid CD4+ T cells from patients with AD. Notably, the abnormal downregulation of UBE2N causes in vivo immunosuppressive dysfunction of regulatory T cells, leading to abnormal activation of T cells and induction of various inflammatory responses27. Therefore, we analyzed the infiltration of 28 immune cells and found that the infiltration rate of T cells was significantly higher in AD brains, suggesting that the balance of T cells may be dysregulated in AD. One reason could be the decrease in tight junction molecules in the vascular endothelium during AD progression, leading to an increase in the permeability of the blood–brain barrier. Owing to increased chemokines for T cells in the brain of patients with AD, these changes together promote T cell infiltration28,29. The accumulation of activated T cells has been demonstrated to induce neuronal death and exacerbate neuroinflammation30. Activated T cells can also promote the release of TNF-α, IL-1, and IL-6 pro-inflammatory factors from peripheral blood mononuclear cells (PBMC), exacerbating the inflammatory response31. In addition, a large number of CD8+ T cells are found in the hippocampus of AD patients, and tau-specific CD4+ T cells are widely distributed in the peripheral blood of AD patients, suggesting that T cells may be closely associated with AD progression, particularly in terms of tau pathology32,33. Therefore, we speculate that UBE2N may affect AD pathology by regulating T cells, which needs to be confirmed by further experiments.

Furthermore, MFUZZ cluster analysis showed that the cluster comprising the 27th modular gene had the highest correlation with UBE2N, and the functions of the modular genes were mainly related to immunity and synapses, as expected. Notably, Cluster 27 overlapped with 8 of the 13 identified hub genes (UBE2N, ATP6V1E1, CCKBR, DYNC1I1, NRN1, SV2B, SYT1, and TUBB2A), 7 of which were positively associated with UBE2N. Our qRT-PCR and Western blot results showed that UBE2N expression was significantly reduced in the cerebral cortex of TauP301S mice, but not in APP/PS1 mice, indicating that UBE2N may play a specific role in certain models of AD, particularly tau-related pathways in AD pathogenesis.

In addition to UBE2N, we found that the mRNA levels of ATP6V1E1, CCKBR, SV2B, and TUBB2A were decreased in the TauP301S cortex. ATP6V1E1 is a large multi-subunit complex divided into a peripheral structural domain (V1) and a proton transmembrane translocation structural domain (V0) that is upregulated in early AD and downregulated in late-stage AD34. ATP6V1E1 serves as a proton pump and mediates the acidification of endosomes, lysosomes, the Golgi and synaptic vesicles35. Its dysfunction, therefore, disrupts PH homeostasis, affecting organelle acidification and, consequently, contributing to AD. ATP6V1E1 is reduced not only in the brain but also in the peripheral blood of patients with AD, suggesting that ATP6V1E1 may play an important role in the diagnosis and treatment of AD36. Cholecystokinin (CCK) mediates its action through two G-protein-coupled receptors, CCKAR and CCKBR. Its absence leads to abnormalities in the cerebral cortex and corpus callosum development and further affects the migration of cortical interneurons37. Synthetic CCK analogs can effectively reduce Aβ load in the brain and normalize the levels of protein kinase A (PKA), cAMP response element binding protein (CREB), brain-derived neurotrophic factor (BDNF) and TrkB receptors, thereby improving APP/PS1 mice cognition38. Synaptic vesicle glycoprotein 2B (SV2B) is a synaptic protein involved in APP/Aβ metabolism39. There is evidence that SV2B knockout protects against Aβ-induced memory deficits and ameliorates cholinergic system dysfunction caused by Aβ injection. However, another study found that Aβ levels were significantly elevated in the hippocampus of SV2B knockout mice than in those of WT mice40. Here, we observed downregulation of SV2B in TauP301S mice, suggesting that SV2B may also be related to tau pathology. TUBB2A is a microtubule protein. Tau from the AD brains increased endogenous Tau in cortical neurons; furthermore, transcriptome sequencing results showed that TUBB2A is remarkably present in neurons41. Here, we first demonstrated that TUBB2A is reduced in the cerebral cortex of TauP301S mice, indicating that TUBB2A might play a role in the tau-related pathway in AD.

Our study integrates four AD datasets from the GEO database, which enhances the robustness and generalizability of the findings, but relying on these existing datasets might have inherent biases or limitations in terms of sample diversity and size. Our findings need to be validated in larger and more diverse cohorts. In future study, incorporating various heterogeneous biological data could enhance prediction accuracy. Due to advancements in high-throughput techniques, numerous heterogeneous biological datasets have emerged. These datasets have the potential to provide valuable insights into the underlying mechanisms of AD onset. Moreover, current analysis is based on cross-sectional data, which limits the ability to infer causal relationships. Longitudinal studies are needed to establish the temporal dynamics of UBE2N expression in AD progression. Machine learning models have powerful predictive capabilities; however, they are prone to overfitting when dealing with small datasets and sensitive to noise and outliers, which may potentially lead to degraded performance.

Among the various machine-learning techniques, Matrix factorization (MF) in recommender systems has received significant attention for predicting potential disease biomarkers. Previous studies predicted miRNAs as biomarkers for diseases using related computational models, such as IMIPMF and SMAP42,43,44,45,46. In this regard, further efforts to integrate these methods to discover novel AD biomarkers could improve the prediction accuracy. While our study links UBE2N to certain pathways, the precise functional mechanisms through which UBE2N influences AD pathology remain to be fully elucidated, such as the role of UBE2N in CD4+ T cell function in AD.

In conclusion, we identified an immune-related gene, UBE2N, as a biomarker for AD by combining WGCNA with machine learning approaches. UBE2N may play a role in synaptic vesicle function, neuroinflammation, and T-cell infiltration, thereby influencing AD pathologies, particularly tau-related. Our findings could not only improve the understanding of disease pathogenesis but also contribute to the treatment and detection of AD.

Materials and methods

Data acquisition and pre-processing

Four microarray datasets (GSE5281, GSE84422 and GSE132903, and GSE109887) related to AD were obtained from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The “SVA” package in R (4.2.1) was used to remove batch effects among data sets. Information about these datasets is provided in Table 147. The gene expression differences were analyzed by the “Limma” software package based on the screening criteria “Adjusted p < 0.05 and | logFC |≥ 0.5” 48. The volcanic and thermal maps were created by the “ggplot2” software package and the “pheatmap” software package, respectively.

Table 1 Source and sample size of data used in the article.

Weighted gene co-expression network analysis (WGCNA) to screen target genes

WGCNA was performed to identify co-expression modules using the R package of “WGCNA” (version 1.72.1). The top 25% of genes with the highest variance were applied for subsequent WGCNA analyses to guarantee the accuracy of the quality results by checking the missing values and clustering the samples. The “soft” threshold power (β) was calculated to construct a biologically meaningful scale-free topological network. In addition, a topological overlap matrix was constructed based on the adjacency matrix, and a dynamic tree-cutting algorithm was used to merge similar modules. Additionally, gene saliency (GS), module affiliation (MM), and correlation coefficients between the gene modules and clinical features were calculated to visualize the characteristic gene network. Finally, the potential gene targets for Alzheimer’s disease were obtained by the intersection of DEGs and genes within the significant gene module14.

Protein–protein interaction (PPI) networks construction

The STRING database (https://string-db.org/) was used to construct PPI networks for the AD potential target by setting a confidence level (0.7), followed by Cytascape (3.8.2) software to view this graph. Based on this network, the genes were selected as biomarker genes in the pathological process of patients with AD for subsequent screening.

Screening of AD markers by the machine learning algorithm

Machine learning algorithms were utilized to analyze the central genes in the PPI network and identify characteristic markers of AD. Initially, Lasso regression analysis was conducted using the “glmnet” (4.1.6) R package for variable selection. This method introduced an L1 regularization term to reduce model variables while maintaining predictive accuracy. Cross-validation was performed to determine the optimal value of lambda for regularization parameter setting, yielding 19 important genes49. Next, Support Vector Machine Recursive Feature Elimination (SVM-RFE) analysis was carried out using the R package “e1071” (version 1.7.13). It was configured with a linear kernel function and the penalty parameter C was adjusted to optimize the trade-off between bias and variance of the model. We identified 34 genes that exhibited significant discriminatory power through that analysis, representing the most informative features in our model50. Finally, Random Forest (RF) analysis was performed using the ‘randomForest’ (version 4.7–1.1) R package. The model’s performance was optimized by adjusting the number of trees and other relevant parameters. Genes were evaluated based on their importance scores, and only those with scores greater than two were retained, as they significantly contributed to the predictive ability of our model. Ultimately, a set of significant genes above a predefined threshold was filtered out51. The genes identified by combining three methods were considered potential biomarkers of AD.

Building and interpreting machine learning models

Based on feature genes, we utilized the Scikit-learn (version 1.3.2) Python package to construct seven machine learning models: eXtreme Gradient Boosting (XGBoost), RF, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (Logistic), Gaussian Naive Bayes (GNB), and Multilayer Perceptron (MLP). XGBoost is an ensemble learning method built on the gradient boosting framework, known for its exceptional proficiency in handling large datasets and its remarkable speed and performance52. RF algorithm is a comprehensive learning approach that enhances prediction performance by constructing multiple decision trees and subsequent voting or averaging. Its advantages encompass robustness against noise, effective feature selection, insensitivity to missing values, and efficient training53. KNN is an instance-based learning method that makes predictions by identifying the nearest neighbors among samples. Its strengths include simplicity, intuitive understanding, and the absence of training requirements54. SVM is a robust classifier that discriminates between classes by identifying optimally separated hyperplanes. It excels in handling nonlinear problems through the use of kernel functions efficacy, utilization of kernel functions, and capacity to address nonlinear problems55. Logistic is a linear model utilized for binary classification problems, wherein the results of linear regression are mapped to the interval [0, 1] via a sigmoid function. Its advantages include simplicity and interpretability. GNB is a classifier grounded on Bayes’ theorem, assuming independence among features. It offers computational efficiency and simplifies the assumption of conditional independence among features56. MLP is a feed-forward neural network that processes information by traversing multiple layers of interconnected neurons. It is notable for its capacity for nonlinear modeling and inherent flexibility57. Subsequently, we employed the Shapley Additive exPlanation (SHAP) values to analyze the significance of individual features and demonstrate their impact on the final machine learning model. Additionally, we used Local Interpretable Model-agnostic Explanations (LIME) algorithm to align and interpret the predictions of the optimal machine learning model58.

Curve analysis of receiver operating characteristics (ROCs)

The “Corrplot” (0.92) R software package was used to analyze the correlation of AD biomarkers screened through machine learning. We used the “pROC” (1.18.0) R software package to create ROC curves and calculated the area under the curve (AUC) to evaluate the clinical diagnostic value of biomarkers59.

Diagnostic column line graph construction and validation

The immune gene dataset is from the InnateDB database (http://www.innatedb.com), including 1696 immune genes60. Those immune genes and identified potential biomarkers of AD were intersected to screen the immunity-related biomarkers. We used the “RMS” (6.5.0) R software package to construct a column line graph model to predict the incidence rate of AD61. Calibration curves were used to evaluate the accuracy of the column line graph model, and decision curves and clinical impact curves were employed for clinical utility evaluation62.

Enrichment and regulatory mechanism analysis of UBE2N

We performed GSEA analysis on the selected immune biomarker-UBE2N and utilized the Enrichr database (https://maayanlab.cloud/Enrichr/) to analyze the transcription factor (TF) of UBE2N. The miRTarBase (https://mirtarbase.cuhk.edu.cn/), Starbase (https://starbase.sysu.edu.cn/), and TargetScan (www.targetscan.org) databases were used to predict the miRNAs that regulate UBE2N translation. Then the regulatory network diagram of UBE2N was constructed by Cytoscape (3.8.2).

Animals

TauP301S transgenic mice [B6C3-Tg (Prnp-MAPT*P301S) PS19 Vle/J] and APP/PS1 mice were originally purchased from the Jackson Laboratory (Bar Harbor, ME, United States), and C57BL/6 mice were obtained from Beijing HuaFuKang Bioscience Co., Ltd. (Beijing, China) for animal mating. In the offspring, TauP301S transgenic and wild-type mice were obtained in the same month through genotype identification. They (n = 8 in each group) were housed under a 12-h light/dark cycle of 8:00/20:00 and controlled temperature (24 ± 2 °C) and humidity (40–70%) for 9 months till sacrificed under isoflurane. All authors complied with the ARRIVE guidelines. All treatments and experimental procedures were performed in accordance with the National Institutes of Health guidelines and were approved by the Northeastern University Laboratory Animal Ethical Committee (EC-2023A012).

Immunohistochemistry

The mice in the two experimental groups (n = 8 in each group) were anesthetized, and half the brains were removed, fixed using 4% paraformaldehyde, and embedded in paraffin. Serial 5-µm coronal sections were incubated with blocking solution (5% bovine serum albumin and 1% normal goat serum) for 1 h and then incubated overnight at 4 °C with rabbit anti-UBE2N (1:200, Abcam). The following day, the sections were incubated with biotinylated goat anti-rabbit IgG (1:500) for 1 h at room temperature (RT), followed by incubation with the avidin–biotin-peroxidase complex (1:100) for 30 min at RT. After washing with phosphate-buffered saline, the sections were immersed in 3,3′-diaminobenzidine for development. One section was incubated with normal rabbit serum (1:100) for nonspecific staining and served as the negative control. Images of immunohistochemical staining were captured using a light microscope (DM4000B; Leica, Wetzlar, Germany).

Immunofluorescence staining and confocal laser scanning microscopy

The 5-μm coronal sections were preincubated with blocking buffer for 1 h and then with rabbit anti-UBE2N (1:200, Abcam) and mouse monoclonal anti-NeuN antibodies (1:200, Thermo Fisher Scientific) overnight at 4 °C. Alexa Fluor® 488-and Alex Fluor® 594-conjugated secondary antibodies were mixed together and treated to sections for 2 h and finally labeled using DAPI (1:500). After mounting with an anti-fade mounting medium, the cells were imaged using a laser scanning confocal microscope (Leica, TCS, SP8, Wetzlar, Germany).

Consensus clustering and principal component analysis

Consensus clustering (CC) analysis was carried out using the R package “ConsensusClusterPlus” (version 1.60.0)63. The algorithmic model was configured to implement agglomerative “pam” clustering, utilizing a Spearman correlation distance metric and resampling 80% of the sample population for a total of 1000 iterations.

scRNA-seq data preprocessing and analysis

The human AD scRNA-seq dataset is from the GEO database (GSE200164), and the raw gene expression data were converted to Seurat objects using the Seurat package (v4.3.0.1) in R (v4.3.1), retaining cells with less than 10% of mitochondrial genes. These Seurat objects were subsequently converted to h5ad objects using the “sceasy” package (v0.0.7) and samples were corrected and integrated using scvi (v0.6.8). Clustering was performed using the Leiden algorithm with a resolution parameter set to 0.8. Upon completion of the above steps, the h5ad object was converted to a Seurat object. The highly expressed genes in all cell subpopulations were then analyzed in R using the “FindAllMarkers” function with default parameters.

Western blot

The temporal cortex and hippocampus of the half-brains of C57BL/6 mice and TauP301Smice were lysed in RIPA buffer to extract the protein supernatant. Proteins (10 μg) were separated by 4–12% SDS-PAGE and transferred to polyvinylidene fluoride membranes (Millipore, Burlington, MA, USA). The membranes were incubated in 5% BSA solution at room temperature for 1 h. Subsequently, the membranes were incubated overnight at 4 °C with rabbit anti-UBE2N (1:2000, Abcam) and mouse anti-GAPDH (1:10000, A1978, Sigma, Burlington, MA, USA) antibodies. Finally, the membranes were washed and incubated with horseradish peroxidase (HRP)-conjugated secondary antibodies for 2 h. Bands were detected using a chemiluminescence imaging analysis system (Tanon 5500, Shanghai, China) and enhanced chemiluminescence (ECL) kits (EMD Millipore, Burlington, MA, USA). Each experiment was repeated at least thrice.

Quantitative reverse transcription polymerase chain reaction (qRT-PCR)

Total RNA was extracted from the cortex of C57BL/6 and TauP301S mice using Total RNA KIT I (R6834-02, OMGEA, USA), and 500 ng of template RNA was reverse transcribed into cDNA using the GoScript™ Reverse Transcription System (Promega, A5001) according to the manufacturer’s instructions. PCR reactions were performed with 20 ng of cDNA template in a 10-μL reaction mixture using the Bio-Rad CFX PCR system. The sequences of the genes encoding GAPDH and selected differential genes were obtained from the GenBank database, and specific primers were designed using Primer Premier 5.0 (Table 2). The mRNA expression was calculated using equation 2−∆∆CT.

Table 2 Primer sequences for qRT-PCR.

Statistical analysis

Data are presented as the mean ± SEM. The student’s t-test was used to analyze differences between groups, as appropriate. Analyses were performed using the ImageJ software and GraphPad Prism 9.0. Statistical significance was set at P < 0.05.