Introduction

With the aging of the global population, the prevalence of dementia is expected to reach an estimated 152.8 million individuals worldwide by 2050 1. Alzheimer’s disease (AD) is the leading cause of dementia, accounting for 60–80% of all cases2. The majority of AD cases are sporadic and are typically diagnosed in people over 65 years of age (late-onset AD). AD is characterized by its heterogeneity, with intricate interactions among various genetic and environmental risk factors. The heritability of AD is estimated to be in the range of 58–79% 3. The ε4 polymorphism in the protein encoded by the apolipoprotein E (APOE) gene, located on chromosome 19, is recognized as the most potent genetic risk factor for AD4,5. Recently, additional genetic risk factors influencing the development of AD have been identified by genome-wide association studies (GWAS) using a large number of samples6. However, a substantial proportion of the heritability remains unexplained. Genetic variants can affect gene expression, and perturbations in gene expression are frequently involved in the onset of various diseases. Additionally, perturbations in gene expression due to aging can be factors that ultimately lead to disease.

One of the most powerful tools for the comprehensive analysis of all gene expression in an organism is whole RNA sequencing (RNA-seq) using a next-generation sequencer7. The identification of differentially expressed genes (DEGs) between patients and healthy individuals in RNA-seq data, and subsequent gene set enrichment analysis (GSEA) of the expression signatures, contributes to our understanding of the biological mechanisms underlying disease pathogenesis and the identification of potential disease biomarkers. Recently, RNA-seq data from human blood have offered promising biomarkers for the diagnosis and treatment of a wide variety of diseases, including cancers8, circadian sleep-wake disorders9, and depression10. For AD, some potential biomarkers (e.g., PHGDH, BASE-AS1, NEAT1, and GAS5) have been identified from blood-based RNA-seq data11,12. Our previous study also reported several candidate AD biomarkers (e.g., EEF2, RPL7, and WDR37) through analyses of blood-based RNA-seq data13,14 and miRNA expression data15. However, AD progresses through three stages—cognitively normal (CN), mild cognitive impairment (MCI), and dementia—and stage-specific gene expression during the transitions from CN to MCI (CN-MCI) and from MCI to AD (MCI-AD) would shed light on what causes the progression from one stage to the next and have not yet been sufficiently investigated.

Here, we explored the stage-specific gene expression patterns in CN-MCI and MCI-AD by using a large-scale retrospective Japanese cohort dataset. We observed the involvement of the ribosomal pathway related genes in CN-MCI and genes associated with the immune system, cell cycle, and protein processing in MCI-AD. Our findings indicate that the onset of AD may be associated with gene expression changes in the immune system, cell cycle, and protein processing, following alterations in the expression of ribosomal protein genes during the MCI stage. Given the effectiveness of delaying MCI progression in preventing AD16, the expression of ribosome-related genes may serve as potential biomarkers. We believe that these characteristics will help to elucidate the mechanism of AD onset.

Results

Sample collection

A total of 1227 samples were included in this study, from 424 individuals with AD, 543 with MCI, and 260 CN individuals. The mean age and APOE ε4 allele frequency (AF) were highest for AD (age = 79.5 years, AF = 0.23), followed by MCI (age = 76.8 years, AF = 0.19), and CN (age = 72.6 years, AF = 0.085). Each category had more samples from females than from males, with the largest difference in the MCI group (Table 1). Similar allele frequencies were reported by Farrer et al.17.

Distribution of white blood cell types

We investigated the differences in cell type distribution between CN and MCI samples and between MCI and AD samples. By using CIBERSORT (see Methods), we estimated the relative proportions (as TPM) of 12 major leukocyte cell types in each sample (Fig. 1a). Of those, only γδ T cells showed a statistically significant difference between CN and MCI samples (FDR = 7.88 × 10− 3). Four cell types (B cells, 2.78 × 10–7; plasma cells, 6.28 × 10–3; γδ T cells, 1.64 × 10–7; neutrophils, 3.62 × 10–2) showed statistically significant differences between MCI and AD samples.

Fig. 1
figure 1

Proportions of immune cell types and distribution of differentially expressed genes.

Proportions of the 12 major immune cell types among CN, MCI, and AD samples (a). Statistical significance was set at a false discovery rate < 0.05. *: <0.05; **: <0.01; ***: <0.001, NS: not significant. Data are represented as box and whisker plots, depicting minimum, lower quartile (Q1), mean (Q2), upper quartile (Q3), and maximum values. Distribution of differentially expressed genes in CN-MCI (b) and MCI-AD (c) comparisons. Each point represents a gene. Red and blue dots represent upregulated and downregulated genes, respectively. FC, fold change; FDR, false discovery rate.

Detection of DEGs

To elucidate differences in DEGs between CN and MCI, and between MCI and AD, we conducted two-group comparisons. A total of 883 and 1169 statistically significant DEGs with Entrez Gene IDs were identified in CN-MCI and MCI-AD, respectively, with statistical significance defined as FDR < 0.05 and |FC| >1.2 (Supplementary Table 1). In CN-MCI, 293 DEGs were upregulated and 590 were downregulated (Fig. 1b). In MCI-AD, 1027 DEGs were upregulated and 142 were downregulated (Fig. 1c). Notably, a substantial number of upregulated genes were obtained in MCI-AD samples, whereas CN-MCI samples exhibited predominantly downregulated genes.

Gene set enrichment analysis of DEGs

To obtain further insight into the biological functions of the DEGs, we performed GSEA by using the DAVID gene functional classification tool. Our analysis revealed that DEGs were enriched in two KEGG biological pathways (coronavirus disease – COVID-19 and ribosome) for CN-MCI and in five pathways (TCR signaling pathway, oocyte meiosis, protein processing in endoplasmic reticulum, salmonella infection, and NF-κB signaling pathways) for MCI-AD. The significance level was set at FDR < 0.05 (Table 2). The number of unique genes was 38 in CN-MCI and 101 in MCI-AD.

Table 1 Summary of characteristics of AD, MCI, and CN samples
Table 2 Pathways and numbers of genes in pathways from gene set enrichment analysis.

In CN-MCI, a considerable number of genes were involved in the COVID-19 pathway, with more than 70% (26/36) overlapping with those in the ribosomal pathway (Supplementary Fig. 1a). In contrast, in MCI-AD, there were fewer gene overlaps among the identified pathways. However, more than 57% of genes (58/101) were associated with immune-related pathways, including the TCR signaling, salmonella infection, and NF-κB signaling pathways (Supplementary Fig. 1b). Given the involvement of many immune-related genes in the progression to AD, we further investigated potential differences in the diversities of TCR and BCR repertoires. However, no statistically significant difference was observed in either MCI-AD or CN-MCI (Supplementary Fig. 2).

Extraction of representative genes

We detected 38 candidate genes in CN-MCI samples and 101 in MCI-AD samples through GSEA. To determine the optimal n representative genes from these candidates (n ≤ 38 for CN-MCI; n ≤ 101 for MCI-AD, see the Methods), we applied a logistic regression method. Gene selection was based on hierarchical clustering, with nine-tenths of the dataset for clustering and model fitting in each cross-validation step. The adjusted model was then evaluated on the remaining one-tenth of the dataset, with the process repeated 10 times (10-fold cross-validation, Fig. 2). Consequently, we identified 8 representative genes for CN-MCI and 22 for MCI-AD (Supplementary Fig. 3).

Fig. 2
figure 2

Outline of the representative genes’ extraction.

AD, Alzheimer’s disease; CN, cognitively normal; CV, cross-validation; MCI, mild cognitive impairment.

Functional modules in MCI and AD samples

All eight representative genes in CN-MCI were associated with ribosome functions, while among the 22 representative genes in MCI-AD, 11 were linked to the immune system, 7 were related to the cell cycle, and 4 were involved in the protein processing (Fig. 3 and Supplementary Table 2). There were no overlapping genes between CN-MCI and MCI-AD, suggesting distinct molecular mechanisms in the progression of MCI and AD.

Fig. 3
figure 3

Functional classification of representative genes identified.

The descriptions obtained from the KEGG pathway were further classified into four categories (Ribosome, Immune, Cell cycle, and Protein processing). Red and blue bars represent beta coefficients obtained from a logistic regression method.

To further investigate these molecular mechanisms, we performed a PPI network analysis by using NetworkAnalyst 3.0 18 with the STRING Interactome database19. The resulting PPI network comprised 929 nodes and 2056 edges, identifying 18 hub genes (node degrees ≥ 25) (Fig. 4a). These hub genes were categorized into four groups: CN-MCI (ribosome) and MCI-AD (immune, cell cycle, and protein processing). Alterations in CN-MCI were observed in eight ribosomal protein genes, while the remaining 10 genes were involved in NF-κB mediated immune response, cell cycle, and protein processing in MCI-AD. Interestingly, seven of the eight hub genes were down-regulated in CN-MCI, whereas nine of the ten were up-regulated in MCI-AD (Fig. 4a). Additionally, several ribosomal protein genes (RPL7, RPL11, RPS14, and RPL21) and immune-related genes (ACTR2 and CDC42) showed opposing responses between CN-MCI and MCI-AD (Fig. 4a). These findings suggest that ribosome and immune functions play a crucial role in the pathogenesis of AD.

Fig. 4
figure 4

Functional modules obtained from PPI network analysis.

(a) PPI network diagram including representative genes. Gene names are shown for nodes with ≥ 25 edges. The size of the circle representing each gene reflects the number of edges. Green and magenta dashed circles represent the modules in CN-MCI and MCI-AD, respectively. Pairs of arrows indicate genes that are upregulated (↑ in red), unchanged (→ in yellow), or downregulated (↓ in blue) in CN-MCI and MCI-AD, respectively.

(b) Validation by quantitative PCR analysis. Each dot represents one CN (blue), MCI (green), or AD (red) sample. The results were consistent with the RNA-seq results for all genes. Data are represented as box and whisker plots, depicting minimum, lower quartile (Q1), mean (Q2), upper quartile (Q3), and maximum values. The data are normalized such that the mean in CNs is 1 for each gene. P values are shown in parentheses, and statistical significance was set at P < 0.05. *: <0.05; **: <0.005; ***: <0.0005, NS: not significant.

To validate the RNA-seq results, we performed qPCR analysis for RPL11 and RPS14 from the ribosomal group and for CDC42 from the immune group. The qPCR results (Fig. 4b) were overall consistent with the trend observed in the RNA-seq results (Supplementary Table 2). Although there was no significant difference in either RPS14 or CDC42 between CN and MCI, this may have been due to the small number of samples (n = 10 each).

Discussion

Through transcriptome analysis and subsequent pathway and PPI network analyses, we revealed distinct molecular mechanisms differentiating MCI from CN and AD from MCI. The involvement of genes associated with the ribosome function was observed in the CN-MCI comparison, whereas the MCI-AD comparison highlighted genes associated with the immune system, cell cycle, and protein processing, although validation using brain tissue samples will be necessary in the future.

Most of ribosomal protein genes detected in this study (i.e., RPL7, RPL11, RPL19, RPL21, RPS8, and RPS14) were also identified in our previous study, where we observed their downregulation in AD through transcriptomic comparison of CN with AD13. This finding is consistent with previous reports indicating changes in ribosomal protein abundance not only in AD20,21, but also in Parkinson’s disease22, spinal muscular atrophy23, and recently in tauopathy24,25. These observations suggest an early and widespread involvement of ribosomal proteins in neurodegenerative diseases. However, our stage-specific transcriptome analyses revealed that the downregulations of these ribosomal protein genes have occurred in only CN-MCI, whereas a slight upregulation was observed in MCI-AD. Also, ribosome-associated quality control deposition at the core of amyloid plaques in AD brains may result from translational stalling due to ribosomal defects in AD pathogenesis26. Moreover, recent proteome analysis of AD patients identified the upregulation of ribosomal proteins, including PRL7, RPL11, PRL14, and RPS8—consistent with our results, but specifically in brain capillaries, not in brain parenchyma27. These results suggest that the expression of ribosomal protein genes, especially RPL7, RPL11, RPL14, and RPS8, might be a potential target for AD prevention. Furthermore, considering the reported association between mitochondrial dysfunction and the pathogenesis of AD28,29, mitochondrial ribosomal protein genes differentially expressed between CN and MCI, such as MRPS11 and MRPS14, could also serve as potential targets for AD prevention.

Many genes involved in the immune system, cell cycle, and protein processing were upregulated in MCI-AD. Abnormal activation of NF-κB in the immune system is involved in synaptic dysfunction30 and is currently attracting attention as a therapeutic target for AD31,32. Cdc42, a member of the small G protein family, plays a crucial role in regulating synaptic plasticity. Activation of the Cdc42 signaling pathway is a key factor in mediating the progression of AD-like phenotypes33. Although ACTR2 protein expression has decreased in the postmortem parietal cortex of AD patients34, our stage-specific investigations indicated that expression of the ACTR2 decreased at the MCI stage. Cell-cycle reactivation is involved in apoptosis in neurons of AD patients35,36. Protein processing, including by ubiquitin ligases such as SKP1 and UBE2D3, has a key role in the pathogenesis of AD37,38. Overall, our findings are consistent with previous reports despite blood-based transcriptome analysis.

The primary limitation of this study lies in the small sample size and the sample variation between the groups (AD, n = 424; MCI, n = 543; CN, n = 260). Collecting samples from healthy individuals ≥ 60 years old has been challenging owing to the need to verify their cognitive normality. Increasing the number of samples could improve resolution and provide other important factors in the CN-MCI-AD process. Furthermore, our findings were identified from blood samples and should be validated using brain tissues to elucidate AD pathology. However, they do align with a study27 demonstrating similar expression trends among certain ribosomal proteins in brain or brain capillaries, adding strength to their usefulness.

Finally, our transcriptome analyses among three phenotypes (CN, MCI, and AD) provided detailed insights into the complicated changes in gene expression from CN to AD through MCI. Unlike the ribosomal protein genes observed in CN-MCI, none of the genes involved in immune, cell cycle, and protein processing in MCI-AD were detected in our previous CN-AD comparison13. This suggests that changes in MCI-AD may have been masked by the large variations seen in CN-MCI, highlighting the importance of conducting stepwise investigations in both CN-MCI and MCI-AD. Moreover, the genes identified in this study could serve as valuable biomarkers, as they are associated with the progression of MCI and/or AD. Notably, several ribosomal protein genes and immune-related genes showed significant responses in both CN-MCI and MCI-AD comparisons. These findings suggest that the ribosomal protein genes (RPL7, RPL11, RPS14, and RPL21) and immune-related genes (ACTR2 and CDC42) may present promising targets for early intervention not only for AD but also for MCI.

Methods

Clinical samples

All blood samples and their associated clinical data were obtained from the National Center for Geriatrics and Gerontology (NCGG) Biobank. Of the 1227 samples, 424 were from patients with AD, 543 from individuals with MCI, and 260 from CN donors. The AD and MCI subjects were diagnosed with probable or possible AD according to the criteria of the National Institute on Aging and the Alzheimer’s Association workgroups39,40. For this study, only patients with probable AD were used as AD subjects. The CN samples were obtained from individuals who had subjective cognitive complaints but normal cognition on a neuropsychological assessment, including a score > 23 on the Mini-Mental State Examination, a comprehensive neuropsychological test. All samples were obtained from adults aged 60 years or older.

This study protocol was approved by the ethics committee of the NCGG of Japan. All participants were volunteers and completed informed consent in writing before registering with the NCGG Biobank.

RNA sequencing

Isolation of buffy coat from whole blood and of total RNAs from the buffy coat was conducted in accordance with the standard operating procedure of the NCGG Biobank13. Only high-quality samples with an RNA integrity number (RIN) ≥ 6.0 were used to construct the sequencing library. Sequencing libraries were prepared by using 500 ng of total RNA for each sample; Illumina TruSeq Stranded Total RNA with Ribo-Zero Globin and IDT for Illumina TruSeq UD Indexes (Illumina, San Diego, CA) were used in accordance with the manufacturer’s instructions. The libraries were subsequently sequenced on the Illumina NovaSeq6000 platform with paired-end reads of 151 bp in accordance with the manufacturer’s instructions.

RNA sequencing data analysis

All RNA-seq data were downloaded from the NCGG Biobank database13. The quality of read sequences (fastq files) was assessed by using FastQC (version 0.11.7). Low-quality reads (< Q20) and trimmed reads with adaptor sequences (shorter than 50 bp) were discarded by using Cutadapt (version 1.16). The remaining clean, sequenced reads were mapped to the human reference genome (GRCh37) by using STAR41 (two-pass option, version 2.5.2b). Read counts for each gene were calculated with the featureCounts program42 from the subread package (version 1.6.6) to generate expression levels. Outlier read counts, defined as the top and bottom 5% of read counts for each gene, were replaced with the maximum and minimum, respectively, of the remaining effective counts.

An average of 39.1 million raw read sequences were obtained from AD samples, with 99.4% being high-quality (i.e., > Q20). For MCI this was 99.4% of an average of 39.4 million raw read sequences, and for CN it was 99.3% of an average of 35.6 million raw read sequences. After we had discarded low-quality read sequences and trimmed reads with adaptor sequences, 38.6 million reads of cleaned data remained for the AD samples, with 82.5% uniquely mapping to the human reference genome (GHCh37). For MCI, this was 79.1% of 39.1 million reads of cleaned data, and for CN, it was 82.1% of 35.1 million reads of cleaned data (Supplementary Table 3).

Proportion of immune cell types

After the alignment of RNA-seq reads to the human reference genome by using STAR, RSEM43 (version 1.3.0) was employed to quantify transcripts per million (TPM) that were suitable for use with CIBERSORT44 (version 1.0.1). We reclassified 22 immune cell types with CIBERSORT into 12 major cell types13: (1) B cells (naive and memory), (2) plasma cells, (3) CD8+ T cells, (4) CD4+ T cells (CD4+ T cells naive, memory resting, and memory activated; T cells follicular helper; and T cells regulatory), (5) γδ T cells, (6) NK cells (resting and activated), (7) monocytes, (8) macrophages (M0, M1, and M2), (9) dendritic cells (resting and activated), (10) mast cells (resting and activated), (11) eosinophils, and (12) neutrophils. Differences in the proportions of each cell type between CN and MCI samples and between MCI and AD samples were evaluated with Welch’s t test. False discovery rate (FDR) values were calculated with the Benjamini–Hochberg method. An FDR value less than 0.05 was considered statistically significant.

Differential gene expression analysis

The read counts from each sample were aggregated into a count file for subsequent differential expression analysis by using edgeR45 (version 3.18.1). Genes with a threshold CPM (counts per million reads mapped) > 1 in more than one-fourth of all sequenced samples were used for subsequent analysis. The caclNormFactors function in edgeR45 was employed to obtain a trimmed mean of M value normalization factors (TMM) to account for library sizes. Dispersion was calculated with the estimateCommonDisp and estimateTagwiseDisp functions in edgeR45. The exactTest function in edgeR45 was applied to obtain DEGs between CN and MCI samples (CN-MCI, MCI compared with CN) and between MCI and AD samples (MCI-AD, AD compared with MCI). DEGs with |log2(fold change: FC)| > log2(1.2) and an FDR < 0.05 were defined as statistically significant. This definition was based on the small sample size, following the approach used in the previous reports13,14.

Gene set enrichment analysis of DEGs

A GSEA was implemented by using the Database for Annotation, Visualization and Integrated Discovery (DAVID)46,47 (version 2021, http://david.ncifcrf.gov). Kyoto Encyclopedia of Genes and Genomes (KEGG)48,49 biological pathways with an FDR < 0.05 were identified as statistically significant.

Detection of immune receptor repertoires

Immune receptor repertoires in T cells and B cells were detected from RNA-seq data by using TRUST4 software50 (v1.0.5). The estimated CDR3 clonotypes included αβ/γδ T cell receptors (TCRs: TRA, TRB, TRG, and TRD) and B cell receptors (BCRs: IGH, IGK, and IGL). Clonal diversity of TCRs and BCRs was estimated by using an inverse Simpson index calculated with VDJtools51 (v1.2.1). A linear regression model was employed to identify significant increases or decreases in the proportion of TCR and BCR repertoires in CN-MCI and MCI-AD. An FDR value < 0.05 was considered statistically significant.

Extraction of representative genes

All datasets were classified for each category (MCI from CN-MCI and AD from MCI-AD). The CN-MCI and MCI-AD datasets were split into nine-tenths for a training set and one-tenth for a test set, respectively (Fig. 2). For GSEA-detected candidate genes, we performed logistic regression analysis, adjusting for clinical information (age, sex, and number of APOE ε4 alleles) and immune cell types obtained from the CIBERSORT results in the training set. P values and beta coefficients of the candidate genes were obtained. Subsequently, a hierarchical clustering analysis was conducted on the candidate genes within the training set. The number of clusters (n) was selected stepwise (n = 1, 2, …, maximum number of candidate genes), and a representative gene with the minimum P value was selected from each cluster.

Logistic regression analysis, incorporating clinical information, cell types, and the n representative genes, was applied in the training set and evaluated on one-tenth of the test set. This process was repeated 10 times. The average area under the curve (AUC) values from these iterations were used to determine the optimal n representative genes. Representative genes were finally defined using logistic regression analysis on the entire dataset, including both the training and test sets. The logistic regression method was implemented using the R statistical software packages stats (version 3.4.3) and ROCR (version 1.0.7).

Network-based meta-analysis

A network-based analysis was performed by using NetworkAnalyst 3.0 (https://www.networkanalyst.ca)18 with the STRING Interactome database19. The PPI network was constructed by using first-order interaction network analysis with a default confidence cutoff score of 900 and was visually represented by using Cytoscape v3.10.1 (https://cytoscape.org)52.

Quantitative PCR validation

cDNA was synthesized by using a PrimeScriptII 1st Strand cDNA Synthesis Kit (Takara Bio, Shiga, Japan). Quantitative PCR (qPCR) was performed by using TaqMan Fast Advanced Master Mix (Thermo Fisher Scientific, Waltham, MA), TaqMan Probes (Thermo Fisher Scientific), and the Quantstudio7 Flex Real-Time PCR System (Thermo Fisher Scientific) in accordance with the manufacturer’s instructions. The TaqMan gene expression assays used were commercially available: RPL11 (Hs00831112_s1), RPS14 (Hs00735285_m1), CDC42 (Hs00918044_g1), and ACTB (Hs99999903_m1). The qPCR conditions comprised one cycle of 50 °C for 2 min and 95 °C for 20 s, followed by 42 cycles of 95 °C for 1 s and 60 °C for 20 s. Each gene was assayed at least in duplicate. ACTB was preselected as a reference gene for normalization of target gene expression levels, and relative expression levels were calculated by using the ΔΔCt method. Gene expression was obtained for 10 AD patients, 10 individuals with MCI, and 10 CN individuals, all randomly selected.