Introduction

Alzheimer’s disease (AD) stands as a progressive neurodegenerative disorder influenced by an interplay of genetic and epigenetic factors, alongside gene-environment interactions that potentially contribute to its onset1. AD holds the prominent position as the world’s foremost prevalent neurodegenerative disorder, commanding significant attention and concern globally. Clinically, AD manifests as a profound impairment of executive and cognitive functions2. Pathologically, the disease progression is characterized by escalating hippocampal and cortical atrophy, observable through neuroimaging and visual examination. This is accompanied by the presence of intracellular neurofibrillary tangles (NFTs) and the extracellular deposition of hyperphosphorylated amyloid-beta (Aβ) 1–42 peptides, leading to neuronal and synaptic loss, as well as reactive glial hyperplasia1,3,4,5. As the aging population burgeons, AD has emerged as the predominant cause of dementia, accounting for 50–75% of cases, with its incidence doubling approximately every 5 years after the age of 656. The escalating prevalence of AD, coupled with its growing social and economic burden, has positioned it as a significant societal challenge, imperiling the health of both urban and rural residents in China7. A recent national cross-sectional study in China reported 15.07 million dementia patients aged 60 and above, among whom 9.83 million are afflicted with AD7. Given that most FDA-approved drugs exhibit optimal efficacy in the early or middle stages of AD, there is an imperative need to explore the risks associated with the early onset of this debilitating condition.

Modern high-throughput sequencing technologies have empowered the generation of vast datasets, providing a robust methodology for delving into the etiology of AD. Molecular genetic investigations have revealed key genes implicated in AD, including amyloid precursor protein (APP)8,9, presenilin 1 (PSEN1)9,10, and presenilin 2 (PSEN2)9,11, identified as causative factors. Several risk factors influencing AD development have also been identified, encompassing elements such as smoking12, stress13, depression14, and insufficient sleep15. Furthermore, extensive family studies have pinpointed a robust risk gene—the E4 allele of Apolipoprotein E (ApoE), which significantly heightens the risk of AD across diverse populations16,17,18. Additionally, the triggering receptor expressed on myeloid cells 2 (TREM2) gene has emerged as another noteworthy contributor, elevating the risk ratio by 2.9% for the development of AD19,20.

The upregulation or downregulation of genes can instigate changes in metabolic, immune, and other physiological processes, contributing to the onset of diseases21,22. Therefore, the identification of differentially expressed genes (DEGs) is a crucial avenue for unraveling altered biological pathways in various diseases, including neurological disorders and cancers23. Given that the selective vulnerability of specific brain regions plays a pivotal role in neurodegenerative disorders, and AD is characterized by profound neuronal damage in the hippocampus situated in the medial temporal lobe24, exploring DEGs becomes particularly pertinent. While the occurrence of AD is known to increase with age, we contend that abnormal transcriptional alterations may also contribute to disease-related mechanisms25,26. In essence, it is imperative to investigate whether these DEGs in the hippocampus may exert an influence on the onset of AD.

Consistent with our previous research, the integration of uniform data from multiple studies has been shown to increase the statistical power of genetic analysis27. In this study, we performed a thorough search and retrieved three transcriptomic datasets of the hippocampus from the Gene Expression Omnibus. By leveraging integrated data from matched baselines between AD patients and controls, we conducted a meticulous analysis of DEGs to precisely investigate the pivotal role of gene transcription in the pathogenesis of AD. The objective of this investigation is to uncover novel risk genes associated with the pathogenesis of AD.

Materials and methods

Data source and processing

We conducted a thorough search on the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo) using the keywords “Alzheimer” and “hippocampus”. Subsequently, we identified three microarray datasets utilizing the GPL570 platform, namely GSE48350, GSE36980, and GSE5281 (Table S1). The raw “CEL” files for each dataset were procured from the GEO database and subjected to processing and normalization using the R package affy (version 1.68.0). Following this, the datasets were integrated to mitigate batch effects, employing the R package sva (version 3.38.0). The merged dataset encompassed 37 AD samples and 66 CTR samples. We have retrieved all the RNA sequencing data related to the hippocampus of AD patients from the GEO database. Although the sample size is limited, the sample we utilized is invaluable as it was derived from the hippocampus of actual patients. The hippocampus, a crucial region of the brain, is responsible for memory processing and spatial orientation. AD patients often exhibit hippocampal damage and functional decline. During the progression of AD, the hippocampus undergoes a range of pathological changes, including neuronal death, neurofibrillary tangles, and amyloid protein deposition. These alterations lead to hippocampal atrophy and loss of function, which subsequently impacts the patient’s memory and cognitive abilities. Hence, there is a close association between AD and the hippocampus.

The procedural details are visually represented in Figure S1, elucidating the workflow undertaken in our analysis.

Identification of DEGs and enrichment analysis

In our analysis, the representation of a gene’s expression value involves calculating the average value across corresponding probes when multiple probes are associated with the same gene. Conversely, when a probe corresponds to multiple genes, the expression values of those respective genes are uniformly represented by the expression value of the probe. DEGs between AD and CTR samples were identified using the R package limma (version 3.54.2). We applied stringent criteria, setting thresholds at an adjusted P value less than 0.05 and an absolute log2 fold change (log2FC) greater than 0.585. Subsequently, down-regulated and up-regulated DEGs were separately subjected to Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses28,29,30 using the R package clusterProfiler (version 4.6.2).

Similarly, DEGs between Braak stages (i.e., III_IV vs V_VI) were computed using the R package limma (version 3.54.2) and screened based on thresholds of a P value less than 0.05 and an absolute log2FC greater than 0.585.

Correlation analysis of DEGs

The Pearson’s correlations among pivotal DEGs were computed and visually presented using the R package Corrplot (version 0.92).

Protein–protein interaction (PPI) network analysis

The DEGs between AD and CTR samples were subjected to imputation using the STRING database (https://string-db.org/) to elucidate PPI with a combined score exceeding 0.7, indicative of high confidence. Subsequently, the PPI network was retrieved and visualized through Cytoscape software (version 3.6.1). In the resulting network, nodes represented genes, and their distinct colors delineated up-regulated or down-regulated genes. Edges symbolized interactions between genes, with their sizes reflecting the strength of these relationships.

Evaluation of immune cells infiltration

The transcriptome expression matrix, post-removal of batch effects, was utilized to assess immune cell infiltration using the web-based tool ImmuCellAI (http://bioinfo.life.hust.edu.cn/ImmuCellAI/#!/analysis). This analysis involved estimating the abundance of twenty-four immune cell types within the microenvironment of subjects’ hippocampus. These twenty-four cell types were categorized into two layers: Layer 1, encompassing DC, B cell, Monocyte, Macrophage, NK, Neutrophil, CD4 T, CD8 T, NKT, Tgd and Layer 2, comprising CD4 naive, CD8 naive, Tc, Tex, Tr1, nTreg, iTreg, Th1, Th2, Th17, Tfh, Tcm, Tem, MAIT.

Construction of classification model

The DEGs were individually utilized as features to construct classification models using the R package caret (version 6.0–94). In this context, three distinct machine learning methods were employed to develop predictive models, each exploring different tuning parameters. These methods comprised Random Forest (rf), Neural Network (nnet), and Support Vector Machines with Radial Basis Function Kernel (svmRadial). Details regarding these methods, including their associated R packages and tuning parameters, can be found in Table S2. For model training, 70% of the samples were randomly selected as the training set, with the remaining 30% constituting the test set. Receiver Operating Characteristic (ROC) curves were generated using the R package pROC (version 1.18.0), and the Area Under Curves (AUC) was employed to assess the predictive performance of the classifiers.

Construction of prognostic model

The identified pivotal DEGs were employed as features in a multivariable Cox proportional hazards regression model, executed using the R package survival (version 3.5–3). Kaplan–Meier survival curves were then constructed with the R package survminer (version 0.4.9), and the significance of survival differences was assessed through log-rank tests. The risk score was computed as follows:

$$Risk score=\sum_{i=1}^{8}{Coef}_{i}\times {Feature}_{i}=\sum_{i=1}^{8}{\text{ln}HR}_{i}\times {Feature}_{i}$$

where ‘i’ denotes the i-th feature, ‘Feature’ represents its expression value, and ‘Coef’ stands for its coefficient in the fitted Cox model. The ‘Coef’ corresponds to the natural logarithm of the hazard ratio (HR).

In our dataset, the median of the risk score was utilized to categorize samples into high- and low-risk groups. Subsequently, univariable Cox proportional hazards regression analysis was performed to ascertain the HR, along with its 95% confidence interval (CI) and associated P-value, comparing the high- and low-risk groups.

Results

Identification of differentially expressed genes in AD compared with CTR

As detailed in Table S1, three datasets involving the sequencing of hippocampal tissue were acquired from GEO. All samples were amalgamated after mitigating batch effects. To discern the differentially expressed genes in AD relative to CTR, we specifically opted for age- and sex-matched samples. This subset comprised 37 AD samples and 46 CTR samples. Notably, only those samples with an age exceeding 60 were retained for subsequent analyses (Table 1).

Table 1 The characteristics of all subjects used in this study.

In this analysis, a total of 316 DEGs were discerned in AD compared with CTR, comprising 25 up-regulated and 291 down-regulated genes (Fig. 1A, Table S3). The prevalence of down-regulated DEGs suggests that the biological pathways associated with these genes may be suppressed, potentially linking them with the pathogenesis of AD. Subsequently, the up-regulated and down-regulated genes underwent KEGG enrichment analysis and GO enrichment analysis, respectively. Due to the limited number of up-regulated genes, no significant pathways were enriched. Delving into the biology pathways influenced by the down-regulated DEGs, KEGG enrichment analysis and GO enrichment analysis were conducted. As anticipated, the results revealed significant enrichment in pathways such as Alzheimer’s disease and Pathways of neurodegeneration—multiple diseases, aligning with prior studies31,32. GO enrichment analysis further highlighted biological processes (BP) related to AD, including vesicle-mediated transport in synapses, neurotransmitter transport, learning or memory, exocytosis, and axonogenesis (Fig. 1B, Table S4B). Concurrently, significantly enriched cellular components (CC) encompassed distal axon, neuronal cell body, and synaptic membrane, while molecular functions (MF) were associated with transmembrane transporter binding, channel regulator activity, and metal ion transmembrane transporter activity (Fig. 1B). These findings underscore the potential association of these down-regulated DEGs with the transmission and transport of synaptic signals, potentially contributing to the onset of disease in AD patients.

Fig. 1
figure 1

Identification and analysis of DEGs between age-matched AD patients and controls. (A) Volcano plot shows the DEGs between AD samples and controls. (B) The significant GO BP, CC, and MF enriched by down-regulated genes in AD. (C) Those down-regulated DEGs involved in the top 10 BP. (D) PPI network with high confidence constructed by all DEGs. Nodes in red and in blue represented by up-regulated and down-regulated genes, respectively. Edges stand for the interaction score between proteins.

Furthermore, we highlighted the down-regulated DEGs involved in the top 10 BP (Fig. 1C), revealing numerous genes participating in multiple pathways. This observation may stem from the shared functionality of these signaling pathways, indirectly bolstering the credibility of the association between these DEGs and AD. For instance, SNCA has been identified as a key player in a multitude of BPs, encompassing vesicle-mediated transport in synapses, synaptic vesicle cycle, neurotransmitter transport, synapse organization, and regulation of membrane potential. Similarly, YWHAZ, TUBB, and GLRB were all implicated in synapse organization. Furthermore, we constructed a PPI network using all DEGs with a combined score exceeding 0.7 (Fig. 1D). This underlines the high-confidence interactions among these DEGs at the protein level.

Screening of Braak stage-related DEGs in AD

Furthermore, building upon the analysis of AD samples categorized by Braak stage (i.e., III_IV, V_VI), we delved deeper into the exploration of DEGs associated with the Braak stage (Table 1). In this context, we identified 46 up-regulated genes and 98 down-regulated genes (Fig. 2A, Table S5). Subsequently, these genes underwent intersection analysis with the DEGs derived from AD and CTR samples. Intriguingly, we observed that 4 DEGs consistently exhibited up-regulation, while 23 DEGs consistently displayed down-regulation (Fig. 2B). This suggests that these genes are not only implicated in the onset of AD but also play a role in the disease’s progressive development.

Fig. 2
figure 2

Identification and analysis of DEGs between Braak stage of AD patients. (A) Volcano plot shows the DEGs between Braak stage of AD patients. (B) Venn diagram presents those overlapped DEGs. (C) The abundance of 27 overlapped DEGs (KTN1, TOB2, EGFR, ANKRD36B, EPB41L3, PTPN3, TAC1, SST, LAMP5, YWHAH, YWHAZ, SLC39A10, KIFAP3, GLRB, MDH1, SCG2, SYNPR, SCG5, SNCA, KCNQ5, PTPN5, CALY, PNMAL1,PITHD1, TUBB, UQCRC2, SGIP1). (D) Pearson’s correlation at the transcriptomic level among the 27 overlapped DEGs.

As depicted in Fig. 2C, the 27 consistent DEGs were visually presented. Notably, intriguing patterns of positive correlations among the down-regulated DEGs and negative correlations between the down-regulated DEGs and the up-regulated DEGs were observed (Fig. 2D). This observation hints at a potential regulatory interplay wherein the four up-regulated DEGs may inhibit the expression of the down-regulated genes. This, in turn, could suppress the activity of certain pathways, contributing to the onset and progression of AD. Simultaneously, this intriguing correlation pattern suggests the potential of these 27 genes as biomarkers, indicating their candidacy for further exploration in the context of AD.

Evaluation and comparison of immune cells infiltration in hippocampus from AD subjects

The transcriptomic profiling derived from age- and sex-matched samples, comprising 37 AD samples and 46 CTR samples, was employed to assess immune cell infiltration in the microenvironment of the subjects’ hippocampus tissue. Utilizing ImmuCellAI, twenty-four cell types were estimated and categorized into two layers: layer 1 (DC, B cell, monocyte, macrophage, NK, neutrophil, CD4 T, CD8 T, NKT, Tgd) and layer 2 (CD4 naive, CD8 naive, Tc, Tex, Tr1, nTreg, iTreg, Th1, Th2, Th17, Tfh, Tcm, Tem, MAIT) (Fig. S2). Initially, the percentages of these twenty-four immune cells were individually compared between AD and CTR samples, revealing significant differences in six cell types (Fig. 3A). Specifically, neutrophil cells and cytotoxic T cells were observed to be enriched in AD, while CD4 T cells, Th1, Th2, and follicular helper T cells (Tfh) were found to be depleted in AD. Moreover, when comparing these immune cells based on the Braak stage of AD patients, no significant associations were identified. This suggests that the variations in immune cells appear to be linked to the AD disease itself rather than being correlated with the Braak stage of AD patients.

Fig. 3
figure 3

The proportion of immune cells in the subjects’ hippocampus evaluated by immuCellAI. (A) Comparison of several immune cells between AD patients and CTR. (B) Pearson’s correlation between the 24 immune cells and the 27 overlapped DEGs.

Furthermore, we conducted an exploration of Pearson’s correlations between the twenty-four immune cells and the aforementioned twenty-seven DEGs. Notably, as illustrated in Fig. 3B, we observed positive correlations between the down-regulated genes and the depleted immune cells (i.e. Th1, Th2, central memory cell), while negative correlations were evident with the enriched immune cells (i.e. neutrophil). This intriguing finding suggests that the differential infiltration of immune cells in the microenvironment may be a consequence of the altered expression patterns of these specific genes.

Construction of classifiers and Survival analysis based on the key DEGs in AD

To assess their potential as discriminative features for AD, the 27 identified key DEGs were employed to construct classification models using three machine learning algorithms—random forest (rf), neural network (nnet), and support vector machines with radial basis function kernel (svmRadial). The AUC for each model was then individually evaluated. Simultaneously, we discovered that fourteen classifiers, each characterized by one of the genes (KTN1, YWHAZ, KIFAP3, PNMAL1, MDH1, PITHD1, SCG5, UQCRC2, YWHAH, SLC39A10, TUBB, SNCA, GLRB, and PTPN3), consistently achieved an AUC exceeding 0.7 across all three machine learning algorithms. This observation underscores the promising potential of these feature genes in effectively characterizing AD samples from controls (Fig. 4A).

Fig. 4
figure 4

Classification model and Cox regression model constructed by the characteristic genes. (A) Fourteen classifiers featured by genes KTN1, YWHAZ, KIFAP3, PNMAL1, MDH1, PITHD1, SCG5, UQCRC2, YWHAH, SLC39A10, TUBB, SNCA, GLRB, and PTPN3, respectively. Three machine learning algorithms were used, including random forest(rf), neural network (nnet), and support vector machines with radial basis function kernel (svmRadial). (B) Forest plot of the Cox regression model constructed by nine feature genes. (C) Kaplan–Meier survival curve fitted by nine feature genes. (D) Kaplan–Meier survival curve fitted by six immune cells, including cytotoxic, neutrophil, CD4 T, Tfh, Th1, Th2.

Given that the hippocampal tissues used in this study were sampled within a few hours post-mortem, we proceeded to explore the potential of the identified key DEGs for survival analysis in AD. Specifically, we selected AD samples, utilizing the patients’ age as the survival time. Leveraging the fourteen feature genes mentioned earlier, we conducted a multivariable Cox proportional hazards regression analysis and employed a backward stepwise algorithm for model selection based on the Akaike Information Criterion (AIC). Subsequently, an exceptional prognostic model featuring nine genes, with a concordance index of 0.76 and a significance level less than 0.001, was identified (Fig. 4B). Notably, genes YWHAZ, PITHD1, SCG5, YWHAH, TUBB exhibited a Hazard Ratio (HR) less than 1, indicating that higher expression levels of these genes are associated with a lower risk of AD and a slower disease progression. Conversely, genes PNMAL1, SLC39A10, GLRB, PTPN3 presented a significant HR greater than 1, signifying that higher expression of these genes is linked to a greater risk of AD and a faster disease progression.

After calculating the risk score for each AD sample using the nine characteristic genes, we observed that the median risk score effectively stratified the AD samples into distinct high- and low-risk groups (Fig. 4C, HR = 2.72, 95% CI 1.94 ~ 3.81, P = 3.6e–10). Notably, the high-risk group exhibited a significantly poorer overall survival age.

Similarly, leveraging the six immune cells (i.e. Cytotoxic, Neutrophil, CD4_T, Tfh, Th1, Th2) that demonstrated significant differences in AD, we constructed a Cox prognostic model. Consistently, based on the median risk score, these AD samples were notably categorized into high- and low-risk groups with statistical significance (Fig. 4D, HR = 2.72, 95% CI 1.37 ~ 5.38, P = 0.0035).

Discussion

In this study, we conducted an integrative analysis of three datasets, employing bioinformatics methods to identify DEGs in the hippocampus between AD patients and controls (Figs. 1, 2). A predominant proportion of those DEGs exhibited down-regulation, and these down-regulated genes were notably enriched in processes related to the transmission and transport of synaptic signals, including neurotransmitter secretion and transport, synaptic vesicle cycle, and vesicle-mediated transport in synapses. Significantly, a multitude of studies has underscored the pivotal role of declined neurotransmission function and synaptic degeneration in the initiation and progression of cognitive decline33,34,35. Moreover, a recent scientific report in 2022 highlighted the interplay between extracellular endocytosis and autophagosome biogenesis at presynaptic sites, influencing activity-dependent synaptic vesicular cycling36. Additionally, research has demonstrated that the loss of neural network connections contributes to synaptic loss34.

Here, a comprehensive analysis revealed a total of 27 DEGs (Fig. 2C,D), comprising 23 down-regulated and 4 up-regulated genes (Fig. 2B), identified not only in AD but also correlated with the Braak stage of AD patients (Fig. 2C). This implies a potential association of these genes with both the onset and progression of AD. Specifically, KTN1, YWHAZ, KIFAP3, PNMAL1, MDH1, PITHD1, SCG5, UQCRC2, YWHAH, SLC39A10, TUBB, SNCA, GLRB, and PTPN3 were individually employed to construct classifiers, achieving an Area Under the AUC exceeding 0.7 (Fig. 4A). This underscores their potential for characterizing AD samples effectively. Furthermore, a nine-gene-feature prognostic model was established, significantly stratifying AD patients (Fig. 4B). The risk score derived from these nine genes demonstrated a capacity to indicate the progression of AD. Notably, SNCA, recognized as a causative gene for Parkinson’s disease, has also been reported to be associated with AD37,38,39. SNCA is intricately involved in the binding process with Aβ peptides to facilitate their aggregation40, and modulating the activity of BACE1 to regulate APP processing37,41. Dysregulated level of SNCA has been linked to cognitive performance42, and the effect of SNCA protein on memory has been documented43. Moreover, SST and TAC1, were identified as hub genes through PPI analysis of DEGs in the hippocampus of AD patients44, adding another layer to the complexity of AD. Additionally, we observed that TUBB, GLRB, and YWHAZ were enriched in the synaptic organization pathway.

Several prior studies have highlighted the pivotal roles of YWHAZ45,46 and TUBB47,48 in the development of AD, with reduced protein expression of YWHAZ reported in the hippocampus of AD patients46. Previous studies also showed that EGFR is a potential dual molecular target for AD49. This strategy can treat AD through EGFR protein degradation50.TAC1 is identified as the hub gene and may be related to synaptic function and inflammation which was also identified as a key gene in the frontal cortex of AD51. And more abundant TAC1 was showed in AD-resilient than AD-dementia brain52. Excessive SST-14 release accumulates near SST-positive interneurons(SST-INs) in the form of amyloids, which bind to Aβ to form toxic mixed oligomers. Conversely, chronic stimulation of postsynaptic SST2/4 on gulutamatergic neurons by hyperactive SST-INs promotes intense MAPK p38 activity, leading to somatodendritic p-tau staining and apoptosis/neurodegenerationn53. Loss of LAMP5 interneurons drives neuronal network dysfunction in Alzheimer’s disease54. Copper metabolism biomarkers SCG5 altered AD progression55. KCNQ5 was identified as attractive drug targets in neuropsychiatric diseases including AD56. PITHD1 exhibits a strong association with AD. PITHD1-/- mice exhibit olfactory bulb (OB) proteome changes related to synaptic transmission, cognition, and memory. OB PITHD1 expression increases with age in wild-type (WT) mice and decreases in Tg2576 AD mice at late stages57.

The infiltration of immune cells and the subsequent neuroinflammatory response within the brain parenchyma and adjacent structures are believed to play a pivotal role in the onset and progression of AD58. Recent discoveries, notably the association of immune receptor genes such as TREM2 and CD33 with AD, further underscore the significance of immune-related mechanisms59,60. Clinical analyses of pre-AD conditions, including Mild Cognitive Impairment (MCI), provide additional evidence of early and substantial involvement of inflammation in the disease pathogenesis61,62. Consequently, our analysis revealed distinct infiltration patterns of several immune cell types in the hippocampus of AD patients compared to controls, encompassing cytotoxic T cells, neutrophils, CD4 T cells, Th1, Th2, and Tfh (Fig. 3A). Intriguingly, these immune cells exhibited significant correlations with the aforementioned DEGs, implying a potential essential role of these DEGs in immune infiltration (Fig. 3B). Notably, no differences were observed in these immune cells based on the Braak stage of AD patients. However, this finding warrants further validation in larger cohorts to solidify its significance.

In addition, there are two limitations in this study. The sample size used in this study was relatively small and all of them were downloaded from GEO. It is necessary to obtain samples from real world in a larger sample sizes and test our findings. At the same time, Lab-based experiments is also needed to design and verify the above results in the future.

Conclusion

We explored novel genes linked to the onset and progression of AD, anticipating that they may prove to be effective biomarkers for AD onset. The novel genes identified are linked to the Braak stage in AD patients and hold the potential to effectively characterize AD. They can also significantly stratify AD patients and indicate the progression of AD. Considering that the major drugs for AD exhibit optimal efficacy in the early or intermediate stages, these risk genes associated with early onset have significant implications for guiding clinical medication. This study offers a new perspective that could contribute to enhancing strategies for the prevention and treatment of AD.