Introduction

Parkinson’s disease (PD) is among the most rapidly proliferating neurological conditions globally, with an anticipated 12 million individuals affected by 20401. A lack of accurate biomarkers and diagnostic measures significantly limits the timely and effective treatment of PD, contributing to its growing global impact2. PD diagnosis is predominantly based on clinical features that frequently overlap with other neurological disorders, leading to high rates of misdiagnosis2,3. Consequently, dependable diagnostic biomarkers could substantially improve patient prognosis through early detection and possible therapeutic targets2,3.

Recent progress in PD diagnostic biomarkers includes the cerebrospinal fluid (CSF)-based alpha-synuclein (α-syn) seed amplification assay (SAA), exhibiting elevated sensitivity4,5,6. Blood-derived transcriptomic biomarkers in PD, however, are still extensively investigated7,8,9. The benefits of blood-based biomarkers include the less invasive and more broadly applicable nature of blood draws, in contrast to the more invasive lumbar puncture necessary for acquiring CSF samples7. Examining transcriptomic variation in PD may also enhance our understanding of the disorder’s fundamental biological underpinnings. Previous GWAS analyses have identified 90 distinct signals that account for 16–36% of PD’s heritable risk, suggesting a strong genetic component likely to be seen in the transcriptome10. Prior research has revealed significant gene-level alterations in PD blood samples, such as in pathways associated with immune activity, inflammation, mitochondrial function, cell death, etc.9. However, their small sample sizes reduce statistical power and make the reproduction of differentially expressed genes difficult9.

The Parkinson’s Progressive Markers Initiative (PPMI) and Parkinson’s Disease Biomarkers Program (PDBP) are two multicenter, longitudinal observational studies developed to identify PD biomarkers11,12. The two datasets include whole blood RNA-sequencing data for control and PD participants spanning multiple visits, including PD participants with known PD risk mutations (i.e., mutations in SNCA, GBA1, and LRRK2)11,12,13. In a prior study limited to the PPMI cohort, Craig et al. found a significant enrichment of immune-related genes and pathways, including upregulated ‘Neutrophil Degranulation’13. This finding aligns with previous research demonstrating a connection between PD and immunological activity, with multiple studies identifying neutrophil count or a neutrophil-to-lymphocyte ratio as potential biomarkers for PD14.

The role of neutrophils in the etiology of PD remains ambiguous. Certain studies suggest PD may result from dysregulated inflammatory responses that induce α-syn aggregates or an overall overexpression of the protein in dopaminergic neurons. Increased a-syn expression may worsen inflammation, triggering a cycle of protein aggregation that eventually leads to neurodegeneration15,16,17. In the brain, the buildup of α-syn has been linked to pro-inflammatory factors, alterations in astrocyte activity, and microglia hyperactivity18. However, evidence linking α-syn and neutrophil expression is scarce. According to the Human Protein Atlas, blood cell types such as neutrophils, monocytes, and dendritic cells do express the SNCA gene. However, SNCA expression is relatively low in neutrophils (9.8 protein-coding transcripts per million or pTPM) and highest in plasmacytoid dendritic cells (115.5 pTPM) (https://v19.proteinatlas.org/ENSG00000145335-SNCA/blood)19.

In this study, we examined the role of neutrophil expression in PD by analyzing 3965 PPMI and 2932 PDBP whole blood RNA-seq samples from 1497 PPMI and 1279 PDBP participants. Samples were collected from control participants with no diagnosis of PD (sample n = 2325), with idiopathic PD (n = 3006), and those with a germline mutation in SNCA (n = 57), LRRK2 (n = 834), or GBA1 (n = 675). We built a regression learning model to estimate neutrophil percentage for the remaining 5643 samples, given that only 1254 PPMI samples had complete blood counts (CBC). We performed a differential gene expression analysis—adjusted for predicted neutrophil percentage—on control and case cohorts only (n = 6490), then examined the results to find potentially relevant pathways. When we included neutrophil percentage as a covariate, SNCA showed a strong and significant differential expression signal, suggesting that PD suppresses SNCA in the whole blood transcriptome independent of neutrophil-related inflammation. We additionally identify a potential mitochondrial gene expression signature in PD cohorts distinct from control samples.

Results

Out of 8461 total samples, we identified 6897 whole-blood RNA-sequencing profiles from 2776 participants in PPMI and PDBP that passed quality control metrics and met our filtering criteria (Table 1 and Supplementary Tables 1 and 2). These criteria included excluding BioFIND samples due to their limited sample size (see “Methods”). Among the qualifying data, 1254 PPMI samples included documented neutrophil percentages from complete blood count (CBC) data. We used these known values to train machine learning models that predicted neutrophil percentages for the remaining 5643 PPMI and PDBP samples without CBC data. Among the 6897 screened samples, 407 had diagnoses other than ‘Case’ (clinical PD) or ‘Control’ (no PD diagnosis) at enrollment and were excluded from subsequent differential expression and pathway analyses. Separately, participants were tested for PD risk variants in SNCA, LRRK2, and GBA1 through clinical assessments and whole-genome sequencing.

Table 1 Demographic information and genetic status of PPMI and PDBP participants and samples

Neutrophil percentage linear modeling and prediction

We constructed several machine learning models from whole blood gene expression and evaluated their performance to identify the optimal model for neutrophil percentage prediction. To avoid data leakage between the training and testing sets, we ensured that samples from the same participant were not present in both sets simultaneously. This was achieved by performing a 0.8–0.2 train-test split based on participant IDs, after which all samples belonging to a given participant were assigned exclusively to the training or testing set. We additionally considered splitting samples by visit to prevent sample dependencies within sets (i.e., only samples taken at month 12 are in the training set). The resulting models demonstrated performance that was functionally equal or marginally inferior to that of models segmented by individuals (Supplementary Fig. 1). As a result, we chose to divide the samples by participant rather than by visit month.

For both the training and testing sets, gene counts were normalized and transformed independently using the DESeq2 variance stabilizing transformation (VST) with the design formula set to ~1. This ensured that the normalization process was not influenced by any sample metadata.

The initial model we created was a linear model founded on genes known to be enriched in blood cell transcriptomes (Fig. 1a). Genes enriched in blood cells were identified using the Human Blood Atlas RNA-seq data for neutrophils, lymphocytes (T-cells and B-cells), monocytes, eosinophils, basophils, and dendritic cells19. To address linear dependencies between cell types, separate linear models were initially developed for neutrophils, basophils, eosinophils, monocytes, lymphocytes, and dendritic cells. These models were used to identify the genes most predictive of neutrophil percentage within each cell type. We applied backward elimination until all genes in each model exhibited a p-value of less than 0.05, ensuring that only genes having a significant association with neutrophil percentage were included in the model. The 118 significant genes were combined to create a final linear model, once again employing backward elimination until all remaining genes exhibited a p-value less than 0.05. The final blood cell-based linear model contained 27 significant genes (Supplementary Table 3).

Fig. 1: Workflow diagram of regression model development for predicting neutrophil percentage from gene expression data.
figure 1

1254 passing samples with CBC test results were used to create machine learning regression models to predict neutrophil percentage. a, b, d Train-test splits for regression model development were created by randomly splitting the 600 unique participants between an 80% train set and 20% test set, then assigning the respective samples to the corresponding set. Three different linear models were created to compare the performance of different methods of feature selection: a biology-based via selection of only blood cell enriched genes, b data-driven via mutual information feature selection from all genes, and d combining the methods to include genes from both biology-based and data-driven selection. c Additionally, an XGBoost regression model (c) was developed with all 58,780 transformed gene counts. We used the best-performing model to predict neutrophil percentage for 2932 PDBP samples and 2711 PPMI samples with no known neutrophil percentage.

The second model was a linear model that utilized data-driven feature selection to investigate whether genes not enriched in blood cells could be informative of neutrophil percentage (Fig. 1b). We employed mutual information (MI) feature selection to determine which of the 58,780 total genes exhibited the greatest dependency on neutrophil percentage. We implemented an MI score threshold of 0.3, yielding 352 genes with high dependency. Like the first model, these 352 genes were utilized to develop a linear model, and backward elimination was applied until all p values were less than 0.05. The final MI-based linear model contained 17 genes, of which only 2 were present in the blood cell-based model (Supplementary Table 4). However, all genes in the model were enriched in neutrophils, based on the Human Blood Atlas19.

Alongside linear modeling, we employed an XGBoost regression model, which utilized all 58,780 genes to predict neutrophil percentage (Fig. 1c). XGBoost is a method of gradient tree boosting that has demonstrated enhanced predictive performance across various fields and applications, including expression-related predictions20,21. Nonetheless, this improved predictive performance may come at the cost of decreased interpretability. In our implementation, we used the XGBoost R package to create a regression model with hyperparameters nrounds = 10, eta = 0.3, and max depth = 3.

We developed a final third linear model utilizing significant genes in the preceding blood cell-based and MI-based models (Fig. 1d). The combined 42 unique genes were used to train the combined model, once again applying backward elimination until all p values were less than 0.05. The combined model contained 31 genes: 21 from the blood cell-based model, 8 from the MI-based model, and the 2 genes that were present in both (Supplementary Table 5).

We evaluated and compared the four models by calculating the average R-squared, root mean squared error (RMSE), and mean absolute error (MAE) in test sets across 100 0.8–0.2 train-test splits. The MI-based model exhibited the worst performance across all three measures, with statistically significant deficiencies in each category (Fig. 2). When evaluating based on the R-squared in test sets, the combined linear model markedly surpasses all alternative models (Fig. 2a). Regarding RMSE, although the combined model remains a leading performer, the difference between the combined and cell-based models was not statistically significant (Fig. 2b). The same applied to MAE (Fig. 2c). The combined model occasionally shows outlier behavior in terms of RMSE and MAE across certain train-test splits. This is likely due to the inclusion of genes from the MI-based model, which may contribute to higher prediction errors, and/or the use of underpowered train-test splits. Increasing the sample size could help mitigate these issues. Nonetheless, the test set R-squared values of the combined model remained consistently higher than all other models. Consequently, we opted for the combined linear model to predict neutrophil percentage in our samples with no CBC data.

Fig. 2: Comparison of different machine learning models to predict neutrophil percentage in PPMI and PDBP patients.
figure 2

ac Each model type was trained and tested on 100 train-test splits of 1254 samples with known neutrophil percentage, where samples were split 0.8–0.2 by participants. a plots the R2 value of each model when applied to the test sets, b plots the root mean squared errors, and c plots the mean absolute errors. A Wilcoxon signed-rank test was used to test the statistical significance of differences between the models for each metric. *, **, *** indicate p values less than 0.05, 0.01. and 0.001, while N.S. indicates no significance.

A final linear model was fitted utilizing all 1254 samples and 31 genes. Subsequently, we employed the model to predict neutrophil percentage in the 5643 samples without known blood cell counts. A total of 1254 known neutrophil percentages and 5236 predicted values were utilized in downstream analyses. Four hundred seven predicted neutrophil percentages were not used, as the associated samples were obtained from participants with a diagnosis of ‘Other’ at baseline.

Analysis of sample variation

Before performing differential expression analysis, we assessed the validity of our covariate design and the effect of incorporating predicted neutrophil percentage with principal component analysis (PCA). PCAs were generated using the plotPCA() function in DESeq2. Initially, we conducted PCA on VST counts from all 6897 samples, revealing a strong and statistically significant correlation between PC1 and sex of the participant (Supplementary Fig. 2a, b). We observed notable correlations between PC2 to PC5 and sample QC metrics such as percent intronic bases, percent mRNA bases, percent usable bases, etc. A slight correlation with percent chimeric reads was observed; therefore, we implemented a filter for >3% chimeric reads in the passing samples (see “Methods”). The predicted neutrophil percentages were most highly correlated with PC8, suggesting that the neutrophil percentages predicted by our model do appear to correlate with variation in gene expression. In a second PCA of the VST counts utilizing the limma removeBatchEffect() function in R to correct for our chosen design covariates (i.e. disease status, sex, percent mRNA bases, participant age, and neutrophil percentage), we effectively mitigate the influence of neutrophil percentage and other covariates/confounders (Supplementary Fig. 2c, d). Eliminating these effects significantly reduced the percentage of total variation explained by the first nine principal components, illustrating the beneficial impact of accounting for these covariates in the differential expression design matrix (Supplementary Fig. 2e, f).

Differential gene expression analysis with predicted neutrophil percentage

Craig et al. previously conducted neutrophil-corrected differential expression analysis using 1254 PPMI samples with known neutrophil percentage13. This study aimed to enhance their analysis by imputing a larger number of samples based on the predicted neutrophil percentages obtained from our linear model. We performed differential gene expression analysis using 6490 samples on multiple PD cohort vs. control comparisons. The full design = ~clinical diagnosis + sex + age squared + percent mRNA bases + predicted neutrophil percentage was determined through variance analysis and design testing (Supplementary Table 6). For each cohort comparison, we conducted the DE analyses using all PPMI and PDBP samples, as well as with samples segregated by study and at baseline (i.e., age at initial blood draw upon enrollment into PPMI or PDBP) (Figs. 3 and 4 and Supplementary Figs. 35).

Fig. 3: Differential expression in patients comparing cases to controls, with and without controlling for predicted neutrophil percentage.
figure 3

Volcano plots of differentially expressed genes, without controlling for predicted neutrophil percentage (a) and with controlling for predicted neutrophil percentage (b) in the design matrix. Genes with a log2 fold change of >0.1 or <−0.1 and adjusted p value > 0.05 are considered differentially expressed. Genes that are either known PD causal variants or mitochondrial genes are colored purple and orange, respectively. PD causal variants and mitochondrial genes that are differentially expressed are additionally labeled by their gene name. Histograms showing the distribution of significantly differentially expressed neutrophil- and lymphocyte-enriched genes in the differential expression analysis without (c) and with d controlling for predicted neutrophil percentage.

Fig. 4: Differential expression analysis of all samples by genetic cohort.
figure 4

PD causal variants are colored purple, and mitochondrial genes are colored orange. Differentially expressed PD causal variants and mitochondrial genes are additionally labeled by gene name. a Idiopathic case samples with no SNCA/LRRK2/GBA1 mutations were compared to control samples with no PD-related mutations (i.e., HC). HC samples were compared to case samples with GBA1+ (b), LRRK2+ (c), and SNCA+ (d) mutations. e LRRK2+ case samples were compared to LRRK2− case samples. f GBA1+ case samples were compared to GBA1 case samples. g log(CPM) SNCA expression in HC, IPD, SNCA+ case, GBA1+ case, and LRRK2+ case samples corrected by predicted neutrophil percentage, stratified by genetic cohort. Adjusted p values are labeled according to the adjusted p value of SNCA differential expression from respective DE analyses. The dotted red line represents the median SNCA expression in HC samples. *, **, *** indicate p values less than 0.05, 0.01. and 0.001, while N.S. indicates no significance.

After adjusting for predicted neutrophil percentage, we observed a substantial reduction in the number of differentially expressed genes between case vs. control samples from both PPMI and PDBP studies (Fig. 3a, b). A subset of these genes comprised of neutrophil-enriched genes that were eliminated by correction, indicating that the predicted neutrophil percentage serves as an effective proxy in DE analysis for actual neutrophil percentage (Fig. 3c, d). Neutrophil correction additionally appeared to correct for leukocyte-enriched genes overall, not exclusively those within neutrophils (Supplementary Fig. 6).

SNCA was the only identified PD causal variant that maintained differential expression after neutrophil correction. Not only did SNCA retain statistical significance, but the exclusion of genes related to neutrophil percentage further enhanced the DE signal of SNCA: SNCA ranked as the 973rd most significant DE gene without neutrophil correction and the 27th after neutrophil correction (Fig. 3b and Supplementary Table 6). Two other PD causal variants, LRRK2 and LRP10, were no longer differentially expressed after neutrophil correction (Fig. 3a, b).

Neutrophil correction eliminated the DE signals of blood cell-enriched or related genes while simultaneously improving the DE signal of SNCA in our case vs. control comparison. We further established that SNCA expression appeared uncorrelated with predicted neutrophil percentage, suggesting that the mechanisms behind lower SNCA expression in PD whole blood occur independently from immune cell-related activity (Supplementary Fig. 7). The absence of correlation was further corroborated by the increased significance of SNCA downregulation in DE analyses after neutrophil correction, as genes known to be associated with neutrophil expression (i.e., neutrophil-enriched genes, LRRK2, and LRP10) were predominantly rendered statistically insignificant (Fig. 3).

Subsequently, we performed DE analysis among distinct PD disease cohorts by genetic status (i.e., PD-SNCA+, PD-GBA1+, and PD-LRRK2+) and compared these case samples to healthy control (HC) samples, defined as participants classified as ‘Control’ at baseline without SNCA/GBA1/LRRK2 mutations. Only HC samples were employed to isolate the impact of known variants on disease expression. We continued to see a statistically significant downregulation of SNCA in each PD cohort relative to HC, with a similar improvement in the SNCA DE signal compared to uncorrected analysis (Fig. 4a–d and Supplementary Figs. 3–5). In baseline samples, SNCA was significantly downregulated in the DE analysis of case vs. controls; however, this was not observed in comparisons split by genetic cohort vs. HC, likely due to insufficient sample sizes (Supplementary Fig. 5). The statistically significant depression of SNCA was also observed when graphing logged counts per million (CPM) of the gene counts, corrected for predicted neutrophil percentage (Fig. 4g and Supplementary Fig. 8). When analyzed by age, SNCA depression was largely insignificant in younger participant cohorts, such as those younger than 50, most likely due to limited sample size (Supplementary Fig. 8b).

SNCA exhibited no differential expression in the PD-LRRK2+ vs. PD-LRRK2− and PD-GBA1+ vs. PD-GBA1− comparisons, aligning with our observation that SNCA was solely downregulated in contrast to control samples (Fig. 4e, f). The number of blood cell-enriched genes was also not meaningfully reduced in PD-LRRK2+ vs. PD-LRRK2− after correcting for predicted neutrophil percentage (Supplementary Fig. 6k, l). A reduction was noted in PD-GBA1+ vs. PD-GBA1− (Supplementary Fig. 6g, h). LRRK2 appeared to associate with genes enriched in blood cells, somewhat independent of PD-related blood cell activity22.

Pathway analysis highlights mitochondrial dysfunction in PD

To identify pathways with altered expression in the PD cohorts, we performed Ingenuity Pathway Analysis (IPA) on differential expression results from each PD cohort vs. control cohort analysis. As identified in Craig et al., the most significantly enriched pathway in all cases vs. controls without correcting for neutrophil percentage was ‘Neutrophil Degranulation’ (Supplementary Fig. 9). The IPA of neutrophil-corrected case vs. control DE results successfully eliminated enrichment of the ‘Neutrophil Degranulation’ pathway, with the most significantly enriched pathway identified as ‘Mitochondrial RNA Degradation’ (Supplementary Fig. 10). In PD-LRRK2+ vs. HC and PD-SNCA+ vs. HC, the ‘Neutrophil Degranulation’ pathway was statistically significant, but exhibited depression rather than enrichment in the PD cohorts (Supplementary Figs. 13 and 14).

Neutrophil correction reduced the number of pathways in the case vs. control contrast, with z-scores greater than 2 or less than −2 and p values less than 0.05 from 44 to 12. Notably, the ‘Parkinson’s Signaling Pathway’ exhibited significant expression in case vs. control and PD-LRRK2+ vs. HC with neutrophil correction (Supplementary Figs. 9, 10, 14). However, in IPD, PD-GBA1+, and PD-SNCA+ pathway analysis, ‘Parkinson’s Signaling Pathway’ was present in the results but was insignificant based on our z-score and p value thresholds (Supplementary Figs. 11–13). The absence of significance was likely due to the small sample size, given that numerous ‘Parkinson’s Signaling Pathway’ genes exhibit differential expression in each cohort.

In pathway analyses by genetic cohort, IPA continued to indicate significant enrichment of blood cell and immune-related pathways, even following neutrophil correction. As gene expression data was sequenced from blood samples, high enrichment of these pathways could be attributed to tissue type. Considering this possibility, we identified alternative categories of pathways associated with PD, specifically those related to mitochondrial function. Pathways related to mitochondrial dysfunction were significantly enriched in every PD cohort vs. HC analysis (Supplementary Figs. 10–14). We observed positive enrichment of the ‘Mitochondrial Dysfunction’ pathway in PD-GBA1+, PD-SNCA+, and PD-LRRK2+ samples. Additionally, ‘Granzyme A Signaling’ in PD-LRRK2+, ‘NAFLD Signaling Pathway’ in PD-SNCA+, and ‘Coronavirus Pathogenesis Pathway’ in PD-GBA1+ were pathways associated with mitochondrial dysfunction and positively enriched in PD cohorts. In the IPD vs. HC analysis, ‘BBSome Signaling Pathway’, a pathway positively associated with mitochondrial function, was one of nine significant pathways and was depressed in IPD samples8. ‘Leukocyte Extravasation Signaling’ was another mitochondria-related pathway enriched in the IPD analysis, involved in mitochondrial fission and fusion. This association between PD and mitochondrial dysfunction is further supported by differential expression analysis, as genes in the mitochondrial genome were consistently downregulated in the case vs. control contrasts (Fig. 3b). Mitochondrial gene downregulation strengthened after neutrophil correction (Fig. 3d). We again saw a similar overall downregulation of mitochondrial genes in DE analyses split by PD cohort (Fig. 4a–d). This downregulation is not as apparent in DE analysis split by PD cohort stratified by study (i.e., PPMI and PDBP), especially in GBA1+ samples, likely due to smaller sample sizes (Supplementary Fig. 3). PD-LRRK2+ samples appeared to have both lower mitochondrial gene expression and a larger number of significantly enriched mitochondria-related pathways compared to all other PD cohorts (Fig. 4c and Supplementary Fig. 14). We overall see consistent evidence of mitochondrial activity in relation to PD, with a specific trend towards increased dysfunction in disease samples.

PD cohorts exhibit niches of gene expression in unsupervised analysis

The presence of differentially expressed genes and enriched pathways from DE and IPA analyses indicated that PD cohorts may exhibit distinct overarching transcriptomic profiles. To further investigate any global transcriptomic differences between PD and control cohorts, we applied UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction to identify clusters of HC, IPD, PD-SNCA+, PD-GBA1+, and PD-LRRK2+ samples. As IPA revealed an enrichment of mitochondrial dysfunction pathways, we performed UMAP dimensionality reduction on VST counts of genes in the ‘Parkinson’s Signaling Pathway’ (Fig. 5a), the ‘Mitochondrial Dysfunction’ pathway (Fig. 5b), ‘BBSome Signaling Pathway’ (Fig. 5c), and ‘Leukocyte Extravasation Signaling’ (Fig. 5d), as listed in the IPA database.

Fig. 5: UMAP dimensionality reduction with pathway-specific genes by disease status and genetic cohort.
figure 5

UMAP embeddings were created from VST gene counts. Counts were then corrected for participant age and sex, as well as sample mRNA percentage and predicted neutrophil percentage. a UMAP of 36 genes found in the ‘Parkinson’s Signaling Pathway’, with samples labeled by disease and genetic status, excluding samples from participants with unknown genetic status (n = 5470). Corresponding density plots were made, stratified by disease and genetic status. b A UMAP and set of density plots were created from 143 genes in the ‘Mitochondrial Dysfunction’ pathway (n = 5339). c A UMAP of 59 ‘BBSome Signaling Pathway’ genes and density plots (n = 5358). d UMAP and density plots of 52 ‘Leukocyte Extravasation Signaling Pathway’ genes (n = 5345). In bd, a small cluster of ‘Healthy Control’ and ‘Idiopathic PD’ samples was removed for visualization purposes.

None of the four UMAPs demonstrated discernible clustering by cohort, including between control and case samples. As hierarchal clustering similarly failed to discriminate between cohorts, we can infer that PD whole blood gene expression may have no unique global transcriptomic structure compared to controls, even in disease-specific pathways (Supplementary Fig. 15). However, by generating density plots stratified by sample cohort for each UMAP, specific niches of gene expression emerged, particularly within PD-SNCA+, PD-GBA1+, and PD-LRRK2+ samples. In the UMAP of ‘Parkinson’s Signaling Pathway’ genes, we observed that HC and IPD samples were distributed consistently throughout the plot, whereas genetic cohorts clustered near the lower right quadrant (Fig. 5a). In the ‘Mitochondrial Dysfunction’ genes, we observed an even more pronounced division between HC/IPD and PD-SNCA+/PD-GBA1+/PD-LRRK2+, where genetic samples predominantly clustered in a central band, while HC/IPD samples clustered towards the upper and lower extremes of the UMAP (Fig. 5b). PD-SNCA+ samples typically exhibited the most distinct niche of expression for both the ‘Parkinson’s Signaling Pathway’ and ‘Mitochondrial Dysfunction’ genes. Differences between IPD and HC samples, however, were comparatively minor. Subtle distinctions were still discernible, including in the UMAP density plots derived from the IPD-enriched ‘BBSome Signaling Pathway’ and ‘Leukocyte Extravasation Signaling’ (Fig. 5c, d).

Demographic, clinical, and biological factors influencing SNCA gene expression

The most notable finding from our analysis of PD whole blood RNA sequencing data was the significant downregulation of SNCA in DE analyses. We further investigated which phenotypic characteristics of PD could be responsible for the SNCA signal by plotting CPM normalized and logged SNCA gene counts by mutation status and age at baseline.

The reduction of SNCA expression relative to all controls was most apparent in PD-SNCA+ samples, both in those from participants with a clinical PD diagnosis and without (Fig. 6a). This decreased expression was statistically significant even when solely analyzing baseline samples (Fig. 6b). However, the SNCA+ sample size is much smaller than other PD cohorts, and our results should be interpreted with caution and merit additional validation. Mutations in SNCA are typically missense, indicating that transcriptomic depression of the gene may be due to some indirect mechanism that potentially becomes active in PD23. The observation that PD-SNCA- samples also demonstrated SNCA downregulation suggests that SNCA mutations alone are not solely responsible for diminished gene expression. When examining other potential genetic drivers, we observed a more modest decrease in SNCA expression in PD-LRRK2+ and PD-GBA1+ samples. This decrease was statistically significant when analyzing all samples but became non-significant in analyses of only samples at baseline (Figs. 4g and 6a, b). This is likely due to the relatively smaller pool of PD-LRRK2+ and PD-GBA1+ participants compared to controls and a more subtle decrease in SNCA expression compared to PD-SNCA+ participants (Fig. 6b). The overall behavior of SNCA expression was consistent with our findings in differential expression analysis by genetic cohort (Supplementary Fig. 3). It remains likely that LRRK2 and GBA1, along with SNCA, may indirectly impact the mechanism responsible for decreased SNCA in PD whole blood.

Fig. 6: SNCA expression stratified by demographic, clinical, and biological factors.
figure 6

ac Gene counts were log(CPM) normalized and corrected for predicted neutrophil percentage. a SNCA expression in samples stratified by genetic status and diagnosis. The dotted red line represents the median SNCA expression in HC samples (i.e., ‘Control’ and ‘SNCA−/GBA1−/LRRK2−’). b SNCA expression of HC, IPD, PD-SNCA+ samples, PD-GBA1+ samples, and PD-LRRK2+ samples at baseline only. c SNCA expression of IPD samples over age at baseline. All p values were calculated using a Wilcoxon rank-sum test. *, **, *** indicate p values less than 0.05, 0.01. and 0.001, while N.S. indicates no significance.

Further analysis of SNCA expression by participant age at baseline also indicated a possible SNCA-age dependency in PD. Notably, though SNCA expression was only moderately decreased in PD-GBA1+ participants overall, younger PD-GBA1+ participants (i.e. 30s and 40s) appeared to exhibit a larger decrease in SNCA expression similar to PD-SNCA+ participants (Fig. 6a). In IPD samples (i.e. case with SNCA−/GBA1−/LRRK2−), there was a statistically significant decrease in SNCA expression as the baseline age of the participant increases (Fig. 6c and Supplementary Fig. 8). This steady age-related decrease of SNCA was not present in the control samples of any genetic cohort, indicating a possible disease-specific behavior of SNCA expression in whole blood. Overall, it appears that some combination of genetic and age-related factors may contribute to a systemic downregulation of SNCA in the PD whole blood transcriptome.

Discussion

This study aimed to examine the influence of blood cell-enriched gene expression on the differential expression analysis of RNA sequencing samples from whole blood in Parkinson’s disease. We previously identified an over-representation of neutrophil-enriched genes and the ‘Neutrophil Degranulation’ pathway in case-versus-control differential expression (DE) and IPA analyses. To address this, we developed a linear model using 1254 PPMI samples with CBC data to predict neutrophil percentages in an additional 5643 PPMI and PDBP samples. We constructed four models: a linear model using blood cell-enriched gene expression, a second linear model with genes selected through MI feature selection, a third linear model using a combined set of the most significant genes from the prior two models, and an XGBoost regression model trained on the entire gene expression dataset. We chose the combined model for neutrophil prediction due to strong performance in the R-squared, RMSE, and MAE assessments.

We included both the 1254 known neutrophil percentages and the 5643 predicted percentages as a design covariate in our differential expression analysis. This approach successfully eliminated many neutrophil-enriched genes, as well as genes enriched in other blood cells (e.g., lymphocytes, monocytes, basophils, eosinophils, and dendritic cells). After neutrophil correction, IPA analysis of case-versus-control DE results no longer showed an enrichment of the ‘Neutrophil Degranulation’ pathway. This finding suggests that using predicted neutrophil percentages successfully adjusts for neutrophil-related pathway activity. PD-SNCA+ and PD-LRRK2+ did, however, exhibit significant depression of the ‘Neutrophil Degranulation’ pathway, which may be a consequence of neutrophil correction or an area of further investigation. We additionally see significant upregulation of the ‘Parkinson’s Signaling Pathway’ post-correction, further highlighting how disease-specific features of the whole blood transcriptome are more apparent after neutrophil correction. The ‘Parkinson’s Signaling Pathway’ is present, though non-significant, in pathway analysis stratified by PD cohort, likely due to insufficient sample size.

While lowered SNCA has been observed previously in cerebrospinal fluid (CSF)22, this study demonstrates that whole blood not only expresses SNCA at substantial levels but also shows a marked decrease in PD patients. With neutrophil correction, we continued to see a consistent depression and DE signal improvement of SNCA expression in the whole blood transcriptome for all disease cohorts. We no longer see differential expression of PD risk variants LRRK2 and LRP10 in the case vs. control contrast, likely due to the strong association both genes have with neutrophil expression. LRRK2 was identified as a contributor to neutrophil chemotaxis in Mazaki et al.24. LRP10 has been demonstrated to associate with genes implicated in neutrophil degranulation in RNA analysis performed by the Human Blood Atlas19. Consequently, reduced differential expression following neutrophil correction aligned with the biological function of LRRK2 and LRP10.

Why SNCA downregulation occurs is an open question. For one, SNCA appears to be relevant in PD beyond the gene’s involvement in blood cell function. While SNCA is not highly expressed in neutrophils, SNCA is expressed in plasmacytoid dendritic cells (115.5 pTPM), classical monocytes (62.7 pTPM), and basophils (36.8 pTPM)25. We demonstrated that neutrophil correction decreased the number of differentially expressed blood cell-enriched genes across all cell types, which may explain the greater magnitude and significance of SNCA downregulation after correction. Including neutrophil percentage in our analyses uncovered a stronger gene expression signal of SNCA in PD cohorts that occurs independently of immune cells. In each PD cohort, we observed consistent SNCA downregulation relative to healthy controls, including in IPD samples lacking SNCA+/GBA1+/LRRK2+. This pattern suggests that lowered SNCA expression in whole blood may be a significant hallmark of PD.

Although this reduction reaches statistical significance, its utility as a biomarker is limited: mRNA-based measures are prone to degradation, and the lack of specificity is problematic for a relatively rare disorder such as PD. We also investigated whether PD might involve specific SNCA isoforms. However, because our random-priming assay detects both spliced and pre-spliced mRNA, we found no clear evidence of differential splicing. It should be noted, though, that ruling out isoform-specific variations conclusively is challenging. Importantly, our results suggest that other protein-based biomarkers in blood may warrant further exploration, especially considering promising α-synuclein seed amplification assays (SAA) in CSF. More recently, Kluge and colleagues have shown potential for SAA in PRKN-PD26. Despite the limited promise of RNA-based biomarkers in this study, these data underscore important biological insights into SNCA and its role in PD.

We further identified multiple mitochondria-related pathways enriched in PD cohorts with IPA. Most notable was the positive enrichment of the ‘Mitochondrial Dysfunction’ pathway in PD-SNCA+, PD-GBA1+, and PD-LRRK2+ samples. SNCA is associated with mitochondrial function and is present in the ‘Mitochondrial Dysfunction’ pathway, suggesting that mitochondrial activity may contribute to the SNCA DE signal in PD. Dysfunction of mitochondria has previously been tied to neuronal cell death and neurodegenerative disease27,28,29. Mitochondrial inhibition in DA neurons was also demonstrated to induce Parkinsonian motor symptoms in both primates and humans30. In the same PPMI samples used in our study, Barnhoorn et al. observed reduced mitochondrial function in PD patients and found that the severity of mitochondrial dysfunction correlates with disease severity in gene set enrichment analysis31. UMAP density plots also provide evidence of altered mitochondrial activity, with PD cohorts clustering in specific regions of gene expression rather than forming a distinct transcriptomic profile. The lack of distinct global structural differences between PD cohorts and HC samples corresponds with the high degree of complexity in PD development and expression. More sophisticated methods of unsupervised analysis may provide a better understanding of the highly nuanced transcriptomic differences in PD related to mitochondrial activity and dysfunction.

This study does have notable limitations. Although we used the largest compiled PD transcriptomic dataset available, our analyses were confined to clinically diagnosed PD participants and controls. The relatively small number of prodromal and SWEDD patients prevented detailed investigation of those groups. We also conducted analyses by sample rather than by participant, since individuals with SNCA/GBA1/LRRK2 mutations compose a relatively small proportion of our total pool of participants. As such, participant-associated sample dependencies may be unaccounted for in some statistical tests. While SNCA may be a strong signal of PD, significant transcriptomic differences that typify IPD are still unclear and will likely require further single-cell level analysis to elucidate. Long-read RNA-sequencing may also provide a more detailed picture of what specific transcripts are being expressed. Transcriptomic differences between control and PD samples may exist in gene isoforms that are difficult to parse from short-read sequencing, even with splice junction information. Overall, future research should employ both single-cell and long-read transcriptomic analyses to pinpoint which differentially expressed genes and pathways are causally linked to PD pathogenesis, as opposed to only associated with the disease.

In conclusion, this study incorporated neutrophil percentage correction into differential transcriptomic analysis of PD whole blood samples. We see a stronger depression of SNCA expression, which may be caused by mitochondrial dysfunction and other related mitochondrial pathway activity. Correcting for immune cell-enriched genes in PD whole blood RNA analysis can uncover more relevant pathways in the PD transcriptomic profile, which will help guide future work in blood-based analysis of the disease.

Methods

Data

All data used in the study were collected and processed according to the protocols outlined by the Parkinson’s Progression Markers Initiative (PPMI) and Accelerating Medicines Partnership Parkinson’s Disease (AMP PD®) program11,12. Both PPMI and Parkinson’s Disease Biomarkers Program (PDBP) from AMP PD® patient samples were labeled by their disease status (PD or non-PD). Genetic group (GBA1+/−, LRRK2+/−, and SNCA+/− germline mutation status was determined by genetic testing or whole genome sequencing variant data, as provided within AMP-PD and assigned by Craig et al.13. LRRK2+ participants expressed at least one of the following mutations: LRRK2R1441G, LRRK2R1441C, LRRK2R1628P, LRRK2Y1699C, LRRK2G2019S, or LRRK2G2385R. GBA1+ participants expressed at least one of the following mutations: GBA1N370S, GBA1T408M, GBA1E365K, GBA1IVS2, GBA184GG, or GBA1L444P. SNCA+ participants expressed at least one of the following mutations: SNCA A53T, SNCA E46K, or SNCA A30P. A subset of 1254 PPMI patient samples included complete blood count (CBC) and neutrophil percentage.

Written informed consent to data collection and sharing was obtained from all participants by both PPMI and PDBP. Ethical guidelines on human data collection were adhered to. The PPMI project was approved by the Institutional Review Board or Independent Ethics Committee of all participating sites in Europe, including Attikon University Hospital (Greece), Hospital Clinic de Barcelona and Hospital Universitario Donostia (Spain), Innsbruck University (Austria), Paracelsus-Elena-Klinic Kassel/University of Marburg (Germany), Imperial College London (UK), Pitié-Salpêtrière Hospital (France), University of Salerno (Italy), and in the USA, including Emory University, Johns Hopkins University, University of Alabama at Birmingham, PD and Movement Disorders Center of Boca Raton, Boston University, Northwestern University, University of Cincinnati, Cleveland Clinic Foundation, Baylor College of Medicine, Institute for Neurodegenerative Disorders, Columbia University Medical Center, Beth Israel Medical Center, University of Pennsylvania, Oregon Health and Science University, University of Rochester, University of California at San Diego, and University of California, San Francisco. The PDBP project was approved by the Institutional Review Board or Independent Ethics Committee of all participating sites in the USA, including University of Pennsylvania, University of Texas Southwestern Medical Center, Johns Hopkins University, Penn State Milton S. Hershey Medical Center, Harvard Medical School, Brigham and Women’s Hospital, University of Florida, University of Washington, Columbia University, Pacific Northwest National Laboratory, and University of Alabama at Birmingham.

Genome and transcriptome alignment, quantification, and quality control

Samples were sequenced, aligned, and quantified as part of PPMI. Sequencing was done using the Illumina NovaSeq 6000 platform, after which FASTQ files were aligned to the GRCh38 human genome using STAR 2.6.1d. Gene counts were created using featureCounts 1.6.2 and GENCODE 29 annotations.

A total of 8461 samples were provided by PPMI (labeled as PPMI) and AMP PD® (labeled as either PDBP or BioFIND). Only samples with a RIN value of greater than 6, usable bases percentage between 20% and 65%, and chimeric reads percentage below 3% were included in both the neutrophil percentage prediction model development and differential gene expression analysis. Additionally, 166 BioFIND samples were removed due to the relatively small sample size compared to PPMI and PDBP. Of the 6897 passing samples, 1254 were used for neutrophil prediction model development. A total of 6490 samples from participants labeled as ‘Case’ or ‘Control’ at baseline were included in differential expression analyses.

Neutrophil percentage linear modeling and prediction

Genes enriched in white blood cells (neutrophils, eosinophils, basophils, monocytes, lymphocytes, and dendritic cells) were identified using annotations from the Human Blood Atlas25. Linear models were developed using the lm() function in R. Backward elimination was applied recursively until the only genes used in the model had p-values less than 0.05. Mutual information features selection was conducted using SelectKBest() and mutual_info_regression from sklearn in Python. The XGBoost regression model was built using the xgboost R package with the parameters nrounds = 10, eta = 0.3, and max depth = 3. All four models were compared using the Pearson R-squared, root mean squared error, and mean absolute error across 100 train-test splits.

Variance analysis

PCA was calculated using the plotPCA() function from DESeq2, and batch correction was conducted with the removeBatchEffect() function from limma. PCs were correlated to the technical (study, plate, usable bases, etc.) and biological (neutrophil percentage, age, sex, etc.) variables of the samples using a Spearman’s rank correlation or intraclass correlation if the variable was continuous or categorical, respectively. The significance of each correlation was evaluated using either a Spearman’s or an ANOVA test p value for continuous or categorical variables, respectively. Categorical variables with singular unique values (sample_id, participant_id, diagnosis_at_baseline, diagnosis_latest) could not be evaluated using either method; as such, p values were set to 0 for all PCs.

Differential expression analysis

All differential expression analyses were conducted using the well-developed limma-voom 3.58.1 framework in R. We used a p value threshold of 0.05 and a log fold change threshold of 0.1. Each comparison used the design = ~0 + clinical diagnosis + sex + percent mRNA bases + predicted neutrophil percentage + age squared, where age is determined by the age at patient enrollment. The design matrix was determined through successive testing and variance analysis to identify the most significant and biologically relevant covariates (Supplementary Fig. 2 and Supplementary Table 6).

Genes were labeled in volcano plots by category of interest. Causal variants of PD were determined by evidence in previous studies. Mitochondrial genes were labeled based on genes in the Human Gene Nomenclature Committee’s mitochondrial genome list32. Pathway genes were compiled and identified from statistically significant pathways in Ingenuity Pathway Analysis (IPA) analyses. Leukocyte-enriched genes were identified per the Human Blood Atlas.

Ingenuity pathway analysis

Differential expression analysis results from multiple comparisons were used to conduct pathway analysis with QIAGEN Ingenuity Pathway Analysis (IPA) software (QIAGEN Inc., https://digitalinsights.qiagen.com/IPA). For DE results from each comparison, we ran IPA Expression Analysis, which compares given differential gene expression to IPA’s libraries of well-characterized pathways. Based on a given p value and log fold change threshold, IPA calculates an activation z-score for each. Pathways with an activation z-score of greater than |2| and a p value less than 0.5 were considered significantly expressed.

For IPA of DE results without neutrophil correction, an adjusted p value threshold of 0.05 and log fold change threshold of <−0.1 and >0.1 was applied to identify significant genes. For IPA of DE results with neutrophil correction, an adjusted p value threshold of 0.05 and log fold change threshold of <−0.085 and >0.065 was used to avoid biased z-score calculations (i.e. ensure number of upregulated genes is approximately equivalent to the number of downregulated genes) and include a similar number of differentially expressed genes as the un-corrected analysis (~500 upregulated and ~500 downregulated). Log fold change thresholds of <−0.1 and >0.075 and an adjusted p value threshold of 0.05 were used for IPD vs. HC analysis (394 downregulated and 396 upregulated genes). PD-GBA1+ v HC analysis was conducted with the same adjusted p value threshold and a log fold change threshold of <−0.1 and >0.09 (1140 downregulated and 1115 upregulated genes). PD-LRRK2+ v HC was conducted with the same adjusted p value threshold and a log fold change threshold of <−0.14 and >0.15 (1420 downregulated and 1421 upregulated genes). Finally, PD-SNCA+ v HC was conducted with the same adjusted p value threshold and a log fold change threshold of <−0.1 and >0.14 (1618 downregulated and 1625 upregulated genes). Genes from significant pathways related to mitochondrial function were then compiled and used in UMAP dimensionality reduction.

Dimensionality reduction and density plots

Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction was conducted using the umap() function in the R package umap version 0.2.10.0. Gene counts were normalized and transformed using DESeq2 vst() before applying dimensionality reduction. To make the density plots, samples were labeled with ‘Healthy Control’ if the participant was not diagnosed with PD and was SNCA−/GBA1−/LRRK2−, ‘Idiopathic PD’, if the participant was diagnosed with PD and was SNCA−/GBA1−/LRRK2−, PD-SNCA+ if the participant was diagnosed with PD and had a SNCA mutation, PD-GBA1+ if the participant was diagnosed with PD and had a GBA1 mutation, and PD-LRRK2+ if the participant was diagnosed with PD and had a LRRK2 mutation. Density plots were created using ggplot2 and the stat_density_2d() function. Mitochondrial genes were identified using the HUGO Gene Nomenclature Committee (HGNC) Mitochondrial genome gene group. ‘Mitochondrial Dysfunction’, ‘BBSome Signaling Pathway’, ‘Leukocyte Extravasation Signaling’, and ‘Parkinson’s Signaling Pathway’ genes were identified based on the corresponding molecule list in IPA.

Statistical software

Statistical analyses were conducted in either R version 4.3.1 or Python version 3.7.16. Linear models were created in R using the lm() from the R stats package, and the XGBoost model was built using the R package xgboost version 1.7.8.1. The package scikit-learn version 1.0.2 was used for mutual information feature selection in Python. Differential expression analysis and variance analysis were conducted in R using DESeq2 version 1.40.2, limma version 3.58.1, and edgeR version 3.42.4. Plots were made using either ggplot2 version 3.4.4 or EnhancedVolcano version 1.18.0. The geom_signif() function in ggpubr version 0.6.0 was used whenever a two-tailed Wilcoxon rank sum test was applied.