Abstract
Understanding diseases as the result of continuous transitions from a healthy system is more realistic than understanding them as discrete states. Here, we designed the spectrum formation approach (SFA), a machine learning-based method that extracts key features contributing to disease state continuity. We applied the SFA to transcriptomic data from patients with progressive liver disease and neurodegenerative movement disorders to examine its effectiveness in identifying biologically relevant gene sets. The SFA identified transcription factors that potentially regulate liver inflammation and voluntary movement. In neurodegenerative disorders, the SFA also identified genes regulated by ETS-1, with unclear effects on movement. In functional assessment using human iPSC-derived neurons, ETS-1 overexpression disrupted dopamine receptor balance, reduced GABA-producing enzyme levels, and promoted cell death. These findings suggest that the SFA enables the discovery of regulatory factors capable of modifying disease states and provides a framework for the continuity-based interpretation of biological systems.
Similar content being viewed by others
Introduction
With recent advances in molecular biology, numerous efforts have been made to discover new drugs and therapeutic targets by gaining a deeper understanding of the pathophysiological mechanisms underlying these diseases. A predominant approach in disease characterization involves identifying disease-specific features that distinguish “healthy states” from “disease states” and leveraging these distinctions for drug development and treatment strategies. For instance, in Alzheimer’s disease, the accumulation of amyloid-β in the brain is a well-established hallmark of disease progression. Several pharmaceutical interventions targeting this accumulation have demonstrated some efficacy in improving cognitive function1.
Many studies aimed at identifying disease-related features assume a discrete classification of health and disease, where the absence of a disease-specific feature is represented as “0” (healthy state) and its presence as “1” (disease state)2,3. This discrete interpretation is also commonly adopted in machine learning-based analyses, which have been widely applied in the life sciences to identify disease biomarkers from large-scale omics data4,5. However, except in certain congenital disorders, disease progression generally occurs as a continuous transition from a healthy state to a diseased state, particularly in aging-related or progressive diseases. Thus, the strict dichotomy between healthy and diseased states may not adequately reflect actual disease classification.
Progressive liver diseases, such as metabolic dysfunction-associated steatotic liver disease (MASLD) and steatohepatitis (MASH) exemplify this continuous transition (These conditions, formerly referred to as non-alcoholic fatty liver disease (NAFLD) and non-alcoholic steatohepatitis (NASH), have been redefined to encompass broader metabolic disorders independent of alcohol consumption or fat accumulation. Terms in the text are described based on the time of data acquisition). Progression from a healthy liver to MASLD, and subsequently to MASH, involves a continuum of pathological changes, including fat accumulation, inflammatory responses, fibrosis, and cirrhosis6,7. Notably, such continuous disease progression is observed not only at the individual level but also at the population level. Even among patients diagnosed with the same disease, variability in disease severity, progression rate, and genetic predisposition results in diverse disease progression trajectories. From a macroscopic perspective, this diversity suggests that the transition from a healthy state to a diseased state forms a spectrum rather than a strict dichotomy.
Therefore, a continuous rather than binary interpretation of disease features may provide a more accurate representation of disease mechanisms. Identifying the factors that contribute to this continuum could enhance our understanding of disease pathology and facilitate the discovery of novel therapeutic targets. The concept of continuous disease progression aligns with the notion that recognizing “pre-disease states” can significantly contribute to disease management. Various mathematical and data-driven approaches, such as dynamic biomarkers8 and statistical restoration of fragmented time course9, have been proposed to capture pre-disease states and predict disease progression over time. Additionally, a multistate disease progression model was introduced to probabilistically estimate state transitions based on clinical and epidemiological data10,11. This model defines multiple discrete disease states and predicts the transitions between them, providing valuable insights into disease prognosis for medical decision-making12.
Despite the growing emphasis on analyzing disease progression as a continuum, challenges remain in identifying the key factors governing these transitions, particularly when temporal data are limited. Furthermore, in addition to sequential transitions over time, analyzing the relationships between adjacent states in a generalized manner cannot be ignored. Factors governing the adjacency of specific symptom states, independent of disease progression, could serve as effective therapeutic targets. However, no established method currently exists for the efficient extraction of transition-related factors from limited datasets. Developing an analytical framework capable of extracting key contributors to disease state continuity from large-scale biological information, including genomic, transcriptomic, proteomic, and metabolomic data, can facilitate disease understanding and drug discovery.
In this study, we developed a simple methodological approach to extract information related to disease state transitions from transcriptome data. We focused on time-series data representing the progression of steatohepatitis. Additionally, we analyzed neurodegenerative diseases, including Parkinson’s disease (PD) and Huntington’s disease (HD), as models of symptom similarities and symmetry. PD is characterized by the degeneration of dopaminergic neurons in the substantia nigra, leading to increased inhibitory output of the basal ganglia and hypokinetic symptoms13,14. In contrast, HD is characterized by the degeneration of striatal neurons in the indirect pathway, resulting in decreased inhibitory output and hyperkinetic symptoms14,15,16,17. From the perspective of motor symptoms, PD and HD form a spectrum, with the non-PD/HD group positioned between them. Although these diseases are primarily caused by neurodegeneration, their effects propagate to the surrounding neuronal circuits, thereby influencing motor control. This suggests that interventions targeting factors other than neurodegeneration may modify the trajectory of motor symptoms. By investigating the factors contributing to this symptom spectrum, our study aimed to evaluate whether treating diseases as a continuum rather than as discrete entities offers novel insights into disease understanding and therapeutic intervention strategies.
Results
Design of a loss function to represent overlap between states
When utilizing biological data presumed to include important factors for disease onset to distinguish between disease and non-disease groups, the inherent complexity of biological systems renders the complete separation of disease and non-disease groups highly unlikely. Therefore, it is more realistic to conceptualize these groups as forming a contiguous spectrum. Based on this premise, a machine learning approach was designed to extract disease-related information from biological data by assuming a spectral distribution. Conventionally, classification problems are addressed by minimizing a loss function, such as log loss (or cross-entropy loss), which aims to drive the predicted class probabilities to zero or one. Here we consider multiclass classification, where \(N\) is the total number of samples and \(M\) is the number of classes. Let \({y}_{{ij}}\) represent the ground-truth label indicating whether the \(i\) th sample belongs to the \(j\) th class, and let \({p}_{{ij}}\) denote the probability that the classifier assigns the \(i\) th sample to the \(j\) th class. The multiclass log-loss function is computed as follows:
If the \(i\) th sample belongs to the \(k\) th class, the corresponding ground-truth label vector can be expressed as follows:
and it is generally represented as the following one-hot vector:
However, in disease analysis, the feature space often exhibits a partial overlap between disease and non-disease groups. Consequently, the optimal ground-truth labels are not necessarily restricted to zero or one. To mimic such overlaps, we define the set \({M}_{{adj},k}\) of groups adjacent to the \(k\) th group and introduce the overlap weights \({w}_{k,l}\) as hyperparameters representing the fraction of the \(k\) th group’s feature space overlapping with \(l\) th group. Thus, for a sample from the \(k\) th group, the corresponding ground-truth label is defined as:
with both \({y}_{i,j}\) and \({w}_{k,l}\) constrained to the interval [0,1]. This implementation is almost the same approach known as label smoothing18,19,20,21. Considering the three groups A, B, and C, where overlaps exist between A and B, as well as between B and C, but not between A and C, the ground-truth label vectors for the samples from each group are as follows:
This enables us to inherently define a state transition sequence (e.g., A→B→C or the reverse) that is particularly useful for exploring background factors associated with disease progression and symptom severity. Based on these modified true labels, we constructed a spectrum formation classifier (SFC) that optimizes a modified multiclass log-loss function incorporating the overlap weights \({w}_{k,l}\) (denoted as \({SFC}({w}_{{k}{,}{l}})\)) and a general classifier (GC) that optimizes a general multiclass log-loss function. This is based on the assumption that information prioritized in the SFC, derived by evaluating the differences between important information in SFC and that in GC, is critical for extracting factors that are important for the formation of overlaps and/or spectra. Based on this information, an approach for investigating the factors that contribute to spectrum and/or overlap formation was developed, called the spectrum formation approach (SFA) (Fig. 1A). Incorporating researchers’ domain knowledge into the overlap weights allows SFA to uncover expert-relevant factors that span multiple groups, in contrast to conventional data-driven clustering methods (Fig. 1B).
A The SFA is devised to extract information that is prioritized in the SFC. For a given dataset, the GC and SFC are constructed and evaluated the feature importance separately. The differences between GC and SFC was focused on. B SFA enables knowledge-guided analysis by allowing researchers to define conceptual relationships between states (e.g., through overlap weights), in contrast to conventional classification analyses, which are entirely data-driven.
SFC prioritizes dimensions determining state continuity
When applying the SFA to real disease data, the degree of overlap between the classified state groups is often unclear. To evaluate whether the SFA can extract information related to state continuity even when the set overlaps in the SFC differ from the actual data, we generated a synthetic dataset (DS1) assuming transcriptomic data with 20,000 dimensions. In DS1, all dimensions except two (“gene 1” and “gene 2”) were randomly sampled from a uniform distribution, while gene 1 and gene 2 were sampled from uniform distributions with partial overlaps between groups (Fig. 2A). For gene 1, the overlap weights were set as \({w}_{{AB}}=0.1\), \({w}_{{BA}}=0.1\), \({w}_{{BC}}=0.1\), \({w}_{{CB}}=0.1\), \({w}_{{AC}}=0\), and \({w}_{{CA}}=0\), and there was no overlap between groups A and C. For gene 2, the overlap weights were set as \({w}_{{AB}}=0.1\), \({w}_{{BA}}=0.1\), \({w}_{{BC}}=0.1\), \({w}_{{CB}}=0.1\), \({w}_{{AC}}=1.0\), and \({w}_{{CA}}=1.0\). This situation mimics the case in which a certain phenotype, such as neurodegeneration, is common but another phenotype, such as the movement phenotype, exhibits a spectrum between states A and C, with B being a control condition (Fig. 2B). We constructed the multiple \({SFC}({w}_{AB}={w}_{BA}={w}_{BC}={w}_{CB}={w}_{hyper},\,{w}_{AC}={w}_{CA}=0)\) and GC, where the hyperparameter \({w}_{{hyper}}\) was chosen from \(\left[0.05,\,0.1,\,0.15,\,0.2,\,0.25,\,0.3,\,0.35,\,0.4,\,0.45,\,0.5\right]\). For each classifier, we evaluated the extent to which genes 1 and 2 were emphasized for classification, based on feature importance. The results indicated that, compared with the GC, all SFCs assigned relatively higher importance to gene 1 (Fig. 2C, D). Nevertheless, the classification accuracy was apparently decreased when \({w}_{{hyper}}\ge 0.35\), a value over threefold higher than the actual value (Fig. 2E). These results suggest that the SFC places a relatively greater emphasis than the GC on the dimensions that determine state continuity. We also evaluated the detectability using a single \({SFC}\left({w}_{{AB}}={w}_{{BA}}={w}_{{BC}}={w}_{{CB}}=0.05,\,{w}_{{AC}}={w}_{{CA}}=0\right)\) on synthetic datasets with varying degrees of overlap (Supplementary Fig. 1). In these cases, the SFC placed relatively greater emphasis than the GC on the dimensions that define state continuity. From these results, even when the preset overlap hyperparameters differed from the actual data, the SFC maintained its detection power for dimensions defining the disease continuum. However, its reliability should be examined based on classification accuracy. In addition, under classification by the SFC with sufficient accuracy, it is considered that assuming \({w}_{k,l}={w}_{l,k}\) in the analysis remains a viable approach for extracting information related to the formation of the spectrum and overlap, although such symmetry is not generally guaranteed.
For dataset DS1, SFC \(\left({w}_{{AB}}={w}_{{BA}}={w}_{{BC}}={w}_{{CB}}=0.05 \sim 0.5,{w}_{{AC}}={w}_{{CA}}=0\right)\) and the GC were constructed, and the feature importance outputs from each classifier were compared. A Violin plots showing the distribution of group data for each dimension in generated dataset DS1, which comprises 20,000 genes mimicking real disease data and exhibits continuity among groups A–C. B Scatter plot showing the group data distribution for genes 1 and 2 in DS1. C Feature importance of genes 1 and 2 as the outputs of the SFC and GC. D p-values calculated using Welch’s t-test based on the feature importance outputs from the SFC and GC. E Test data accuracy for each classifier. Mean ± SD, n = 5.
Impact of multiple disease continuum-related dimensions on the detection power of the SFA
When multiple features contribute to both the formation of state continuity and state classification, the feature importance output by the SFC may diminish, potentially reducing the detection power for information related to the formation of the spectrum and the overlap in the SFA. To examine this, we generated a new dataset (DS2) comprising multiple features contributing to both the formation of inter-state continuity and group classification (Fig. 3A) and evaluated the detection capability of the SFA for the features that determine state continuity. In DS2, genes 1 and 2 were designed to be distributed continuously across groups A, B, and C with overlap weights \({w}_{{AB}}=0.05\), \({w}_{{BA}}=0.05\), \({w}_{{BC}}=0.05\), \({w}_{{CB}}=0.05\), \({w}_{{AC}}=0\), and \({w}_{{CA}}=0\). Based on DS2, we built \({SFC}\left({w}_{{AB}}={w}_{{BA}}={w}_{{BC}}={w}_{{CB}}=0.05,\,{w}_{{AC}}={w}_{{CA}}=0\right)\) and GC. Next, we evaluated the extent to which genes were emphasized for classification based on their feature importance. As a result, it was suggested that, even in datasets containing multiple disease features, SFC significantly prioritizes gene 1 and gene 2—features contributing to state continuity—over GC (Fig. 3B). However, when the difference in feature importance for genes contributing to state continuity between the SFC and GC in DS1 was compared to that in DS2 using Welch’s t-test, the corresponding p-value for DS2 was higher (Fig. 3C). These results suggest that when multiple factors contribute to state continuity, the probability of detecting spectral features decreases. For example, in transcriptome data analysis, expression correlations among specific genes are anticipated, potentially leading to reduced detection power for spectral features. Therefore, when analyzing actual disease data, it may be necessary to employ strategies such as reducing the dimensionality of the data as much as possible. In addition, we also tested cases where data are composed of multiple subpopulations, in each of which different genes are contributing to state continuity (Supplementary Fig. 2). By increasing the number of subpopulations, the accuracy of classifiers decreased, however, when we focused on the rank order of the differences of feature importance value between SFC and GC, genes contributing to state continuity were intensified. Thus, the detection power is expected to be maintained by proper data compression together with relative evaluation of genes.
For dataset DS2, \({SFC}\left({w}_{{AB}}={w}_{{BA}}={w}_{{BC}}={w}_{{CB}}=0.05,{w}_{{AC}}={w}_{{CA}}=0\right)\) and the GC were constructed, and the feature importance outputs of each classifier were compared. A Violin plots showing the distribution of groups A, B, and C for each dimension in the generated dataset DS2, which includes multiple features that form a continuity among groups A–C. In the violin plot, group A is shown in blue, group B in green, and group C in red. B Feature importance for each feature as an output of \({SFC}\left({w}_{{AB}}={w}_{{BA}}={w}_{{BC}}={w}_{{CB}}=0.05,\,{w}_{{AC}}={w}_{{CA}}=0\right)\) and the GC for DS2. C Table of p-values calculated using Welch’s t-test based on the feature importance values obtained from the GC. Mean ± SD, n = 5, **p < 0.01 vs. GC.
Application of SFA to NAFLD data
In many cases—especially in experimental biology—researchers possess an intuitive or conceptual sense that certain pathophysiological states are more similar or adjacent than others. This perspective is typically informed by an integrated understanding of overall phenotypes, clinical observations, or mechanistic insights, even when the precise molecular basis remains unclear. The primary aim of the SFA is to uncover such molecular features through conceptual modeling, guided by the researchers’ domain knowledge (Fig. 1B). To validate the utility of the SFA for real-world data, NAFLD, a progressive liver disease, was selected as the target. The publicly available NAFLD dataset (GSE126848) used in this analysis comprised transcriptomic data derived from liver biopsies obtained from individuals with normal body weights or obesity and with non-alcoholic fatty liver (NAFL) or NASH. In this analysis, data from subjects with normal weight, NAFL, and NASH were extracted and analyzed. NAFL represents fat deposition, whereas NASH involves not only fat deposition but also inflammation and fibrosis. Consequently, deviation from the normal liver state is expected to be greater in NASH than in NAFL, suggesting the formation of a spectrum corresponding to disease progression (Fig. 4A). Accordingly, we assumed an overlap between the normal-weight group and NAFL, as well as between NAFL and NASH, by setting the overlap hyperparameters to \({w}_{{Normal},{NAFL}}={w}_{{NAFL},{Normal}}={w}_{{NAFL},{NASH}}={w}_{{NASH},{NAFL}}=0.05,\,{w}_{{Normal},{NASH}}={w}_{{NASH},{Normal}}=0\) when we analyzed the data using the SFA. These hyperparameter values were heuristically set, but inspired by the conventional 5% significance level commonly used to define statistically distinguishable groups. Given that those pathological states are typically treated as distinct, we assigned an overlap value of 0.05 to pairs of adjacent states and 0 to those with no direct adjacency. Given that applying the SFA to high-dimensional data may reduce its feature detection power due to the presence of numerous features associated with disease continuity, we performed dimensionality reduction on a dataset containing 17,251 genes using an autoencoder. The data were compressed into a 20-dimensional latent space, which was used for subsequent analyses. The autoencoder model output exhibited a strong correlation with the input data (Fig. 4B), indicating that the input data were compressed into a low-dimensional latent space with minimal loss. Next, using latent-space data, we constructed the following NAFLD classifiers: \({SFC}{(w}_{{Normal},{NAFL}}={w}_{{NAFL},{Normal}}={w}_{{NAFL},{NASH}}={w}_{{NASH},{NAFL}}=0.05\), \({w}_{{Normal},{NASH}}={w}_{{NASH},{Normal}}=0\)) and GC. To assess whether the constructed classifiers properly captured the features of the data, we selected model sets 4 and 5 (Fig. 4C), which achieved test accuracies more than double that of random classification (33% accuracy). Subsequently, we explored the common latent-space features that assigned higher importance to the SFC than to the GC based on the rank sum. Consequently, feature 10 was extracted (Fig. 4D). To extract the gene information associated with feature 10, the gradient values of the genes with respect to the features in the latent space were computed using the decoder part of the autoencoder. After identifying genes with large absolute gradient values, we searched for those that made significant contributions to feature 10. Among the 17,251 genes, the top and bottom 5% in terms of the gradient values, amounting to 1726 genes, were extracted as a group. To extract the biological information associated with this gene group, an enrichment analysis was performed based on the DisGeNET database. As a result, NAFLD-related disease terms such as “FATTY LIVER DISEASE, NONALCOHOLIC, SUSCEPTIBILITY TO, 1,” were enriched in the extracted gene group (Fig. 4E). In contrast, using feature 11, which exhibited relatively high importance in the GC, as a control, the relevant gene group was similarly extracted and subjected to enrichment analysis. The top 20 enriched terms did not include any disease terms directly associated with NAFLD (Fig. 4F), suggesting that NAFLD-related information was effectively extracted using the SFA. Furthermore, to evaluate the relationship between the gene group extracted by the SFA and the disease in greater detail, we assessed the association between transcription factors linked to the obtained gene group and NAFLD. Enrichment analysis based on the TRRUST database revealed that the gene group extracted by the SFA was significantly enriched for transcription factors, including E2F1, which has been implicated in promoting lipid accumulation and the malignant progression of hepatocellular carcinoma22,23, and TP53, a key regulator of inflammatory responses and cell death24 (Fig. 4G). Based on these results, the SFA may be useful for identifying factors associated with the pathogenesis of progressive liver disease from real-world data.
A Formation of state continuity along the progression of NAFLD. Using the deviation from the normal liver state as the axis, the deviation increases from normal-weight subjects to NAFL and further to NASH, thereby forming a spectrum corresponding to disease progression. B Performance of the constructed autoencoder model. The plot displays, on the vertical axis, the values output by the autoencoder, and on the horizontal axis, the values input to the autoencoder. Pearson’s correlation coefficient (r) was calculated based on the input and output data values. In addition, the green dotted line represents y = x. C Test data accuracy for the GC and SFC constructed based on NAFLD latent-space data is shown for each model set. Model sets indicated in red represent those with a test data accuracy that is more than twice that of random classification (random classification test accuracy: 33%). D Diagram showing the extraction process of feature 10. The “importance” in the left table indicates the difference in feature importances between the SFC and GC. For model sets 4 and 5, the common latent-space feature assigned greater importance by the SFC than by the GC was identified based on the rank sum. E Enrichment analysis based on the DisGeNET database for the gene set obtained from feature 10. The top 20 terms, sorted in ascending order based on p-values, are shown. F Enrichment analysis based on the DisGeNET database for the gene set obtained from feature 11, which is a control feature. G Enrichment analysis based on the TRRUST database for the gene set obtained from feature 10.
Application of SFA to PD and HD data
To examine whether SFA can be applied to explore information related to state continuity in diseases and symptoms in which a continuum is assumed, we focused on PD, which typically presents as a hypokinetic disorder with voluntary motor symptoms, and HD, which presents as a hyperkinetic disorders. When considering voluntary motor symptoms, PD, control, HD groups could be interpreted as forming a spectrum on the axis of voluntary movement (Fig. 5A). Using previously published human brain transcriptome data derived from patients with PD and HD25, we applied the SFA to explore the disease features associated with the continuity of voluntary motor states. First, autoencoder models were constructed to compress the dataset containing 21,861 genes into low-dimensional latent spaces of 10 and 20 dimensions, which were subsequently used for analysis. The output of the autoencoder models exhibited a strong correlation with the input data (Fig. 5B), indicating that the input data were compressed into a low-dimensional latent space with minimal loss. Next, we hypothesized a 5% overlap between the PD and control groups and between the control and HD groups. Accordingly, we constructed \({SFC}\left({w}_{{PD},{Control}}={w}_{{Control},{PD}}={w}_{{Contorl},{HD}}={w}_{{HD},{Control}}=0.05,{w}_{{PD},{HD}}={w}_{{HD},{PD}}=0\right)\) and a GC, respectively. The hyperparameter setting was also inspired by the conventional 5% significance level commonly used to define statistically distinguishable groups. Next, the accuracy of each classifier was evaluated. Model sets demonstrating a test accuracy at least twice that of random classification (33%) were selected for further analysis (Fig. 5C). Subsequently, latent-space features that were prioritized in the SFC were extracted using rank-sum analysis and compared with those in the GC. As a result, in the 20-dimensional latent space, feature 16 was identified, whereas features 7 and 9 were extracted in the 10-dimensional latent space based on the rank sum (Fig. 5D). To extract the genes associated with the identified features, gradient values for each gene were computed from the decoder part of each of the constructed autoencoders, and 2188 genes corresponding to the top and bottom 5% of these gradient values were extracted. A Gene Ontology (GO) analysis performed on the extracted gene groups consistently identified the term “locomotion,” which is associated with voluntary movement, across all analysis results (Fig. 5E). Furthermore, enrichment analysis based on the TRRUST database was conducted to identify transcription factors that were significantly enriched within the extracted gene groups. This analysis identified several transcription factors, including NF-κB1, Sp1, and JUN, that have been reported to be associated with the pathogenesis of PD and HD (Fig. 5F). These findings suggest that applying the SFA to PD and HD data successfully extracted genes related to PD and HD pathogenesis and voluntary movement.
A Schematic representation of the expected continuity in PD and HD pathology/symptoms. Voluntary movement forms a spectrum with PD, control, and HD in that order. B Performance evaluation of autoencoder models. (Left) Autoencoder compressing disease data into 10 dimensions. (Right) Autoencoder compressing disease data into 20 dimensions. Plots show autoencoder output values (y-axis) against input values (x-axis), with Pearson’s correlation coefficient (r) computed for evaluation. Green lines represent y = x. C Test data accuracy of the GC and the SFC for each model set. Red-highlighted model sets achieved a test accuracy at least twice that of random classification (33%). D Features with higher importance in the SFC than in the GC. (Left) In the 10-dimensional latent space, common features in model sets 7 and 9 were identified via rank-sum analysis. (Right) In the 20-dimensional latent space, model sets 16 exhibited common features with higher importance in the SFC than in the GC. E GO analysis of gene sets from feature 7 and feature 9 (10-dimensional space) and feature 16 (20-dimensional space). F Enrichment analysis using the TRRUST database.
Evaluation of the impact of ETS-1 perturbation on dopamine production function
When applying SFA to the PD and HD data, genes with unclear effects on voluntary movement were also identified. As the analysis simulated the spectrum of voluntary movement states, we investigated the impact of fluctuations in these gene groups on voluntary motor function. We used transcription factors as modulation tools to alter some of the extracted genes. Among the transcription factors associated with the gene group extracted by the SFA, we identified ETS-1 (Fig. 5F), which demonstrated a limited association with voluntary movement, PD, and HD pathogenesis. The association with a relevant axis—feature 16 in the 20-dimensional latent space—was consistently confirmed across different settings of the overlap hyperparameter in the SFC (Supplementary Fig. 3). Because voluntary movement is regulated by neurotransmission through dopamine-mediated pathways14,26, including direct and indirect pathways, we first assessed its impact on dopaminergic neuron function. Using midbrain neuronal cells differentiated from human iPS cells (hiPSCs) as a dopamine-producing model, we evaluated the effects of ETS-1 overexpression or knockdown on cell viability and expression levels of tyrosine hydroxylase (TH), the rate-limiting enzyme in dopamine production. The differentiation of dopaminergic neuron was confirmed by immunostaining against TH (Fig. 6A, Supplementary Fig. 4). The overexpression and upregulation of ETS-1 in midbrain neuronal cells were confirmed by mRNA quantification (Fig. 6B, C). Midbrain neuronal cells exogenously overexpressing ETS-1 exhibited a significant reduction in cell viability by ~15% compared with that in the control group (Fig. 6D). In contrast, the suppression of endogenous ETS-1 did not result in any significant changes in cell viability compared with that in the control group (Fig. 6E). Regarding the effect on TH mRNA expression, neither ETS-1 overexpression nor suppression resulted in significant changes (Fig. 6F, G). Since the suppression of ETS-1 did not exhibit any significant effects on cell viability or TH expression, the influence of endogenous ETS-1 on the dopamine production mechanism appears to be limited. In contrast, ETS-1 overexpression may weakly trigger cell death in dopamine-producing neurons.
Using iPSC-derived midbrain neurons, we evaluated the effects of ETS-1 perturbation on cell viability and the expression of TH, the rate-limiting enzyme in dopamine production. For real-time quantitative PCR, the mRNA expression of each gene were normalized to that of GAPDH, and the expression of the control group was set to 1. For cell viability measurements, the luminescence values were converted to ATP levels and normalized by setting the viability of the control group to 100%. A The differentiation of iPSC-derived midbrain neurons was confirmed by immunostaining. TH (green) was stained as a midbrain marker, while nucleus was visualized by Hoechst (blue dots). The white bar indicates 100 µm. B ETS-1 mRNA expression in midbrain neurons overexpressing ETS-1 (n = 3). C ETS-1 mRNA expression in midbrain neurons with suppressed endogenous ETS-1 expression (n = 3). D Viability of midbrain neurons overexpressing ETS-1 (n = 4). E Viability of midbrain neurons with suppressed endogenous ETS-1 expression (n = 4). F TH mRNA expression in midbrain neurons overexpressing ETS-1 (n = 3). G TH mRNA expression in midbrain neurons with suppressed endogenous ETS-1 expression (n = 3). Data are presented as Mean ± SD, *p < 0.05, **p < 0.01, N.S. not significant versus control.
Evaluation of the impact of ETS-1 modulation on dopamine receptor function
Voluntary movements are also influenced by changes in dopaminergic receptor function14,26. Therefore, we evaluated the effects of ETS-1 overexpression or suppression on GABAergic neurons expressing the dopamine receptors DRD1 and DRD2 using forebrain neurons differentiated from hiPSCs. The differentiation of GABAergic neuron was confirmed by immunostaining for GABA (Fig. 7A, Supplementary Fig. 4). The expression of ETS-1 in forebrain neuronal cells was confirmed by mRNA quantification (Fig. 7B, C). When ETS-1 was exogenously overexpressed, the mRNA expression of DRD1 in forebrain neurons decreased by ~30% compared with that in the control group (Fig. 7D). In addition, the suppression of endogenous ETS-1 resulted in no effect or a weak increase tendency in DRD1 mRNA expression relative to that in the control group (Fig. 7E). In contrast, regarding DRD2 mRNA expression, there was no significant change upon ETS-1 overexpression or suppression (Fig. 7F, G). Based on these results, in forebrain neurons, ETS-1 overexpression appears to disrupt the balance between DRD1 expression, which is associated with the activation of the direct pathway in the cortico-basal ganglia loop, and DRD2 expression, which is associated with the activation of the indirect pathway, potentially leading to an indirect pathway-dominant state. Next, to investigate whether the change in DRD1 mRNA expression induced by ETS-1 overexpression resulted from a reduction in the number of DRD1-positive cells due to cell death or alterations in DRD1 expression per cell, we measured the viability of forebrain neurons with ETS-1 perturbation to assess cell counts. Forebrain neurons with exogenously overexpressed ETS-1 exhibited a significant decrease in cell viability by ~25% 72 h after adenovirus infection compared with that in the LacZ overexpression control group (Fig. 7H). Conversely, forebrain neurons in which endogenous ETS-1 was suppressed showed a significant increase in viability of ~55% 24 h after adenovirus infection (Fig. 7I). These results suggest that ETS-1 induced cell death in forebrain neurons. Considering that ETS-1 overexpression resulted in an ~25% decrease in forebrain neuronal cell viability at 72 h, while DRD1 mRNA expression decreased by ~30% at 48 h, these findings suggest that the decline in DRD1 mRNA expression occurs at a faster rate than cell death. Therefore, the reduction in DRD1 mRNA expression cannot be explained solely by changes in the cell population due to neuronal cell death. After exogenous overexpression of ETS-1, DRD1-positive forebrain neurons were immunostained and counted. The results revealed that although there was no statistically significant difference in the number of DRD1-positive cells per 100 cells between the ETS-1 overexpression and control groups, a tendency toward an ~20% decrease was observed (Fig. 7J, Supplementary Fig. 5). Based on the above results in forebrain neurons, it is implied that the reduction in DRD1 expression following ETS-1 overexpression was due to both increased cell death and reduced per-cell expression.
Using iPSC-derived forebrain neurons, we evaluated the effects of ETS-1 modulation on dopamine receptor function. For real-time quantitative PCR, the mRNA expression of each gene was normalized to that of GAPDH, and the expression in the control group was set to 1. In the cell viability assay, luminescence values were converted to ATP levels and normalized by setting the viability of the control group to 100%. A The differentiation of iPSC-derived forebrain neurons was confirmed by immunostaining. GABA (green) was stained as a midbrain marker, while nucleus was visualized by Hoechst (blue dots). The white bar indicates 100 µm. B ETS-1 mRNA expression in forebrain neurons overexpressing ETS-1 (n = 3). C ETS-1 mRNA expression in forebrain neurons with suppression of endogenous ETS-1 expression (n = 3). D DRD1 mRNA expression in forebrain neurons overexpressing ETS-1 (n = 4). E DRD1 mRNA expression in forebrain neurons with endogenous ETS-1 suppression (n = 3). F DRD2 mRNA expression in forebrain neurons overexpressing ETS-1 (n = 4). G DRD2 mRNA expression in forebrain neurons with suppression of endogenous ETS-1 expression (n = 3). H Viability of forebrain neurons overexpressing ETS-1 (n = 4). I Viability of forebrain neurons after suppression of endogenous ETS-1 expression (n = 3). J Number of DRD1-positive cells in forebrain neurons overexpressing ETS-1. Data are presented as Mean ± SD, *p < 0.05, **p < 0.01, N.S. not significant versus control.
Evaluation of the impact of ETS-1 modulation on GABA production
Voluntary movements can also be influenced by the amount of GABA released by the dopamine-receiving neural system14,26. It has been reported that GABA levels in the motor cortex, a region of the forebrain, are negatively correlated with the severity of PD symptoms. Accordingly, using forebrain neurons differentiated from hiPSCs, we evaluated the effects of ETS-1 modulation on the mRNA expression of glutamic acid decarboxylase (GAD) 65 and GAD67, the enzymes responsible for GABA synthesis. As a result, 48 h after ETS-1 overexpression, compared with that in the control group, the mRNA expression of GAD65, an enzyme responsible for activity-dependent GABA synthesis, was significantly reduced by ~30%, whereas the mRNA expression of GAD67, which maintains basal levels, exhibited only a slight decreasing trend (Fig. 8A, B). When endogenous ETS-1 was suppressed, no significant differences were observed in the mRNA levels of either GAD65 or GAD67 compared with those in the control group (Fig. 8C, D). Based on these results, in forebrain neurons, exogenous ETS-1 overexpression reduces the expression of GAD65, the enzyme responsible for activity-dependent GABA synthesis, suggesting that the amount of GABA released after dopamine administration may be diminished.
Using iPSC-derived forebrain neurons, we evaluated the effects of ETS-1 modulation on GABA production using real-time quantitative PCR. For each gene, mRNA expression were normalized to that of GAPDH, and expression in the control group was set to 1. A GAD65 mRNA expression in the forebrain neurons overexpressing exogenous ETS-1 (n = 3). B GAD65 mRNA expression in forebrain neurons with suppressed endogenous ETS-1 expression (n = 3). C GAD67 mRNA expression in the forebrain neurons overexpressing exogenous ETS-1 (n = 3). D GAD67 mRNA expression in forebrain neurons with suppression of endogenous ETS-1 expression (n = 3). Data are presented as Mean ± SD, *p < 0.05, **p < 0.01, N.S. not significant versus control.
Discussion
Except in congenital conditions, the onset of a disease represents a transition from a healthy state to a diseased state and is a continuous change in condition. Previous studies have focused on this continuous transition by modeling the multistep transitions of disease states11,27. However, there is no well-established method for performing an integrated analysis of multiple states to extract the factors governing these continuous transitions. In this study, we developed a novel and simple disease feature exploration method called the SFA, inspired by the label smoothing technique18,19,20,21. For machine learning, we constructed a classifier using a loss function that considers the adjacency of multiple states. This classifier, the SFC, emphasizes the input information that is important in the disease continuum and disease-state transition compared with a standard classifier. We validated the utility of the SFA by applying it to the synthetic datasets DS1 and DS2, as well as transcriptome data from NAFLD, progressive liver disease, PD, and HD. The results suggest that the SFA is useful for extracting information related to the continuity of states, as originally intended. Owing to the characteristics of the method, the SFA requires the degree of overlap between groups to be set as a hyperparameter of the SFC. When there are multiple states of interest (i.e., different diseases), it is generally difficult to determine the extent of overlap in actual biological datasets. Interestingly, our investigation using synthetic data suggests that, even if there is a difference between the actual degree of group overlap and the overlap value set as a hyperparameter in the SFC, it is still possible to detect the factors that govern state continuity (Fig. 2, Supplementary Fig. 1). In addition, the robustness of the hyperparameter values for the assumed data overlap in the SFC was confirmed using the PD/HD dataset, a real-world example (Supplementary Fig. 3). Our findings imply that the method can be applied even when the exact degree of overlap in the real data is unknown although it is better to combine multiple analyses—such as SFC models with varying overlap hyperparameters or alternative dimensionality reduction approaches—to ensure robustness and reliability. This makes the SFA an intriguing approach for extracting novel modes of information in the classification learning of biological data.
A limitation of SFA is that its detection power decreases when several factors determine continuity (Fig. 3, Supplementary Fig. 2). As a countermeasure, this study incorporates a method that integrates highly correlated dimensions from multidimensional data and compresses them into a lower-dimensional space by autoencoders. We also performed data compression by primary component analysis (PCA), a linear compression method, against the PD/HD data, followed by SFA. This analysis also intensified transcriptional factors similar to those identified by autoencoder-based SFA (Supplementary Fig. 6). While the autoencoders achieved meaningful compression with only 10–20 dimensions, PCA required 45 components to capture a similar level of variance. This suggests that PCA tends to distribute the contribution of individual genes across many axes, potentially diminishing the sensitivity and robustness of SFA and that adequate compression of correlated genes may give us analytical stability. However, even if some dimensions of the analyzed data (such as the expression profiles of genes in the transcriptome data) are strongly correlated, the meaning of the biological information corresponding to each individual dimension is not necessarily identical. Therefore, even if the factors governing state continuity are extracted from the compressed low-dimensional information space, attention should be paid to the results because the interpretations in the original high-dimensional space may not be determined uniquely. In this study, focused on transcriptome analysis, we employed enrichment analysis to improve the biological interpretability of the data and attempted to reveal interesting and novel findings by distinguishing between established and potentially novel insights.
Data heterogeneity must also be cared as another possible limitation. If heterogeneity exists, dimensionality reduction alone may not resolve the issue of detectability. Interestingly, our analysis of synthetic data composed of multiple subpopulations suggests that the SFA concept is applicable to datasets with underlying heterogeneity (Supplementary Fig. 2). Therefore, emerging high-resolution datasets such as single-cell RNA-seq (scRNA-seq) may also be potential targets for SFA, although direct application to scRNA-seq data is currently not feasible. Instead, constructing pseudo-bulk datasets by integrating scRNA-seq data may allow for effective application of SFA, provided that classifier performance remains sufficiently high. In such cases, integrating multiple scRNA-seq datasets may help mitigate batch effects, provided that classifier accuracy remains sufficiently high. However, our synthetic data analysis (Supplementary Fig. 2) suggests that classifier performance—specifically classification accuracy—may decline when the target data include too many subgroups. Switching to a different classifier framework—other than LightGBM—may help address this issue, as long as the SFA-compatible custom loss function for the SFC can still be implemented. In our study, we demonstrated this flexibility by conducting an SFA analysis using the XGBoost framework on the PD/HD dataset (Supplementary Fig. 7).
To demonstrate the novelty of the information extracted by SFA, we compared its identified features with those obtained using existing analytical approaches. Unlike standard differential expression (DE) analysis—which detects genes with statistically significant changes between discrete groups—SFA is designed to identify features contributing to continuous or ordered transitions across states. To validate this distinction, we compared the genes identified by SFA with those reported in a previous PD/HD study25, which employed DE-based analysis. Notably, there was little overlap between the key enriched features reported in that study and those identified by SFA, supporting the notion that SFA captures distinct features associated with disease spectra. Furthermore, to assess whether SFA detects different types of biological signals than those identified by Weighted Gene Correlation Network Analysis (WGCNA), we applied WGCNA to the same PD/HD dataset. Genes associated with each group (PD, HD, and Control) were identified based on module–trait correlations. We then compared the gene sets identified by SFA (in both 10- and 20-dimensional latent spaces) with those identified by WGCNA28,29. As shown in Supplementary Fig. 8A and 8B, the overlap between SFA and WGCNA gene sets was minimal, indicating that SFA captures gene features largely distinct from those obtained through co-expression network analysis. In addition, we compared the gene expression trends among three gene sets: those identified exclusively by SFA, those identified exclusively by WGCNA, and the small subset of overlapping genes (Supplementary Fig. 8C). Although clear spectrum-like patterns were difficult to observe at the level of individual genes, the averaged expression profiles across pathological classes suggested that genes identified by SFA—rather than those identified by WGCNA—were more strongly associated with disease spectrum formation (Supplementary Fig. 8D). This observation is consistent with the fundamentally different goals and underlying assumptions of the two approaches: WGCNA is designed to detect gene modules based on co-expression across samples, typically reflecting network structure or shared regulation, which is not directly related to spectral continuity. We also assessed whether SFA-identified genes could be detected using simple Pearson correlation. We compared their mean expression levels between each pair of groups (PD vs. HD, PD vs. Control, HD vs. Control) and visualized them as scatter plots (Supplementary Fig. 9). Most genes aligned along the y = x line, indicating minimal group-specific differences. This suggests that these genes would be overlooked by standard correlation analysis. These results highlight the complementary nature of SFA to other methods such as WGCNA, with each capturing distinct aspects of the underlying biology.
Next, we examined the biological significance of the results obtained by applying the SFA to data from patients with NAFLD. In enrichment analysis using DisGeNET as a reference, several disease terms related to NAFLD—for example, “FATTY LIVER DISEASE, NONALCOHOLIC, SUSCEPTIBILITY TO, 1”—were enriched in the obtained gene set. This suggests that SFA-extracted gene groups are strongly associated with NAFLD. Furthermore, when we examined the transcription factors enriched in this gene group using TRRUST as a reference, “E2F1” was detected as the top-ranked transcription factor. Hepatic E2F1 expression is elevated in humans with fatty liver and in patients with NASH30. In the process of increased de novo lipid synthesis in the liver, which is considered one of the major factors in NAFLD progression, E2F1 plays an important role in lipid accumulation and related processes22,23. Moreover, E2F1-knockout mice exhibit inhibited NASH progression induced by 3,5-diethoxycarbonyl-1,4-dihydrocollidine, suggesting that E2F1 is deeply involved in both NAFLD progression and the determination of disease state continuity22,23,30. In enrichment analysis using TRRUST as a reference, “SIRT1” was detected among the top nine transcription factors (Fig. 4G). SIRT1 plays an important role in regulating hepatic lipid metabolism and NAFLD progression, as evidenced by numerous reports31,32,33,34. In addition to these two factors, many other transcription factors related to NAFLD progression were detected, and the involvement of their downstream genes was also suggested. Therefore, it is highly likely that by applying SFA to NAFLD data, factors related to the continuity of the NAFLD disease state were extracted. These results imply that using disease data in combination with previous research findings, a wide range of information concerning the causes and consequences of disease continuity can been extracted through using the SFA, thereby demonstrating its utility.
Next, we characterized the results of the exploration of factors related to voluntary movement based on transcriptome data from the brain tissues of patients with PD and HD. The SFA was used to search for factors involved in the formation of a spectrum corresponding to motor symptoms. When GO analysis was performed on the extracted gene sets, the term “locomotion” was commonly enriched, suggesting that, as intended, information related to voluntary movement was successfully obtained. Furthermore, in the enrichment analysis of related transcription factors, Sp1, and ETS-1 were commonly extracted across all analyses for both 10- and 20-dimensional latent space (Fig. 5F). In addition, NF-κB1, RELA, and JUN were identified in the analysis of the 10-dimensional latent space. These transcription factors are frequently reported to be involved in the development of PD and HD35,36,37,38. For example, regarding Sp1, it has been reported in HD mouse models that the disease-causing mutant form of HTT (mHTT) binds to Sp1, thereby impairing Sp1’s function as a transcription factor39. Moreover, the promoter region of the HTT gene contains an Sp1-responsive element, and Sp1 upregulation has been reported to promote HTT transcription39,40. Additionally, the inhibition of Sp1 function has been reported to decrease mHTT expression41, suggesting that Sp1 is strongly involved in HD pathogenesis. On the other hand, with regard to PD, it is known that the upregulation of downstream genes controlled by Sp1 induces dopaminergic neuronal dysfunction42. In addition, Sp1 inhibition reduces monoamine oxygenase B activity, exerts neuroprotective effects, and improves behavioral abnormalities in mouse models43. Sp1 inhibition can also attenuate damage in dopaminergic neurons and suppress oxidative stress and inflammation44. Thus, the involvement of these factors in the development of both PD and HD suggests that they are important in voluntary motor symptoms. This reaffirms the validity of our methodology and raises the expectation that the newly extracted information will include factors related to the regulation of motor symptoms.
When SFA was applied to the PD and HD patient data, gene groups that were not previously known to be related to voluntary movement were also extracted. ETS-1 emerged as a novel factor in the enrichment analysis. Although the suppression of ETS-1 expression has been reported to induce abnormal behavior in planarians45, few reports have linked ETS-1 to voluntary movement in mammals. In planarian studies, the suppression of ETS-1 has been shown to increase the number of TH-expressing cells in the brain and elevate TH mRNA expression. However, in midbrain neurons differentiated from hiPSCs, neither the overexpression of ETS-1 nor the suppression of endogenous ETS-1 led to significant changes in TH mRNA expression. In contrast, in forebrain neurons differentiated from hiPSCs, the exogenous overexpression of ETS-1 resulted in a selective and significant decrease in the mRNA expression of DRD1, which is responsible for the direct pathway (Fig. 7D). Consequently, the balance of DRD1 and DRD2 mRNA expression in the basal ganglia shifted in favor of DRD2, suggesting that the direct pathway in the motor loop was attenuated and that a state transition toward voluntary movement suppression occurred. In addition, in forebrain neurons, exogenous ETS-1 overexpression decreased the expression of GAD65, a key enzyme in activity-dependent GABA production (Fig. 8A), contributing to reduced GABA release following dopamine reception. Furthermore, ETS-1 overexpression decreased the viability of forebrain neurons (Fig. 7H). These phenomena—alterations in the balance of dopamine receptor subtypes, reduced GABA production capacity, and decreased cell viability in forebrain neurons—can be interpreted as causing a shift in voluntary movement toward a hypokinetic state. Conversely, the suppression of endogenous ETS-1 increased the viability of forebrain neurons and DRD1 expression, although not significantly. Based on these results, although endogenous ETS-1 may not be indispensable for controlling voluntary movement in humans, it appears that the perturbation of ETS-1-related genes by ETS-1 overexpression exerts a weak yet widespread effect on various factors that influence voluntary movement.
Finally, we discuss the types of datasets suitable for analysis by SFA. In practice, many existing datasets are biased toward the two extreme states: “healthy” and “very sick.” Nonetheless, we believe there are two meaningful scenarios in which SFA can be effectively applied. First, in the context of progressive diseases—such as metabolic-associated fatty liver disease (MAFLD)—data are often available for multiple intermediate stages. In such cases, SFA is directly applicable and can provide deeper insights into the molecular mechanisms underlying disease progression. Second, even when only data from “healthy” and “very sick” individuals are available, SFA remains valuable. If a phenotypic ordering is preserved, these classes can be treated as adjacent from a higher-level perspective. As demonstrated in our synthetic data experiments, SFA was able to identify relevant features even in the absence of actual data overlap, as long as ordering was maintained (Supplementary Fig. 1, overlap = –0.3). This suggests that even biased datasets could provide meaningful insights through SFA. Moreover, such analyses may allow inference of intermediate disease states based on patterns embedded in transcriptomic or molecular data.
Conventional methods have focused on factors that exert a strong influence on specific molecular mechanisms responsible for physiological functions. In contrast, the SFA shows promise for exploring factors that affect a wide range of elements associated with specific physiological functions. Even if the impact on individual molecular mechanisms is small, the overall functional and state changes may reach a level that can contribute to disease treatment. Therefore, we expect that the SFA, which can identify factors with information characteristics that differ from those of conventional approaches, will prove to be a useful methodology for disease data analysis.
Methods
Evaluation of SFC hyperparameter sensitivity using fixed-overlap synthetic data
To investigate the sensitivity of the SFC to its overlap hyperparameter, we constructed synthetic datasets in which the degree of class overlap was held constant. Each dataset comprised 30,000 samples and 19,998 background genes, with expression values sampled uniformly between 0 and 1. Two genes were designed to simulate interpretable features with distinct biological relevance. Gene 1 encoded a monotonic spectrum across the three classes (A > B> C), representing a progressive biological or clinical gradient. Its expression distribution was constructed using a fixed overlap coefficient of 0.1, defined as the proportion of distributional overlap across classes. Since all class distributions had the same width, this proportion uniquely determined the positional shift between classes. Gene 2, in contrast, was designed to vary non-monotonically across classes (e.g., high expression in class A and C, but not B), thereby modeling a feature unrelated to linear state progression. Genes 1and 2 were sampled uniformly from the designated ranges. Samples were assigned to one of the three classes in a round-robin fashion to maintain class balance. While the synthetic data distribution remained fixed, the overlap hyperparameter in the SFC was systematically varied across following values: 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5. To evaluate the sensitivity of SFC to this parameter, we measured performance using multiple criteria: classification accuracy, feature importance scores, and the statistical significance (P-values) of feature importance scores in SFC relative to those obtained using GC among cross validation sets (The number of sets was 10 in this case). These evaluations provided insight into the robustness of SFC’s overlap parameter when applied to datasets containing both spectrum-like and non-spectrum-like features. See also Results section and Fig. 2.
Evaluation of SFC hyperparameter sensitivity using multi-dimensional spectrum-forming genes
To evaluate how the dimensionality of spectrum-forming gene structures influences the performance and sensitivity of the SFC, we conducted simulations in which the number of informative spectrum-forming genes was increased beyond two. Each synthetic dataset consisted of 30,000 samples and 19,998 background genes with uniformly distributed random expression values in the range [0,1]. Among these, six genes (genes 1–6) were designated as informative features. Genes 1 and 2 were designed to encode a monotonic, class-dependent expression gradient, forming a two-dimensional spectrum-like structure across three classes (A > B> C). In contrast, genes 3–6 were constructed to exhibit non-monotonic class-specific patterns, such as high expression in classes A and C but not B, thus modeling features unrelated to linear phenotypic progression. The expression distribution of each informative gene was constrained using a fixed spectrum overlap coefficient of 0.05, defined as the proportion of the distributional overlap between adjacent classes. Expression values for these genes were sampled uniformly from class-specific ranges that preserved the desired overlap structure. To assess the behavior of the SFC model under this multi-dimensional input, we fixed the algorithm’s overlap hyperparameter at 0.05 and evaluated its performance using several metrics. These included the feature importance scores generated by SFC, and the statistical significance (P-values) of those scores in comparison to those produced by GC. See also Results section and Fig. 3.
hiPSC culture
A cryopreserved hiPSC line derived from female blood cells (SCTi003-A; STEMCELL Technologies, Vancouver, Canada, 200-0511) was obtained. Undifferentiated hiPSCs were cultured on untreated 6-well plates (Corning, New York, United States, 3736) coated with 10 µg/mL Vitronectin-XF (STEMCELL Technologies, 07180) in mTeSR Plus medium (STEMCELL Technologies, 100-0276) until they reached 70–80% confluence. The medium was replaced every 3d. All cells were cultured at 37 °C in a humidified atmosphere containing 5% CO2 and 95% air.
Differentiation of hiPSCs into neural progenitor cells (NPCs)
To generate forebrain or midbrain neurons, it is necessary to differentiate hiPSCs into NPCs. hiPSCs were differentiated into NPCs using a STEMdiff SMADi Neural Induction Kit (STEMCELL Technologies, 08581) according to the manufacturer’s instructions. There are two main methods for differentiating iPSCs into NPCs: the monolayer culture and embryoid body protocols. In this study, the monolayer protocol was chosen because of the simplicity of the differentiation process. To generate single-cell suspensions, hiPSCs were dissociated through 10 min incubation with Gentle Cell Dissociation Reagent (STEMCELL Technologies, 100-0485) and suspended in STEMdiff Neural Induction Medium supplemented with a SMADi supplement (STEMCELL Technologies) supplemented with 10 µM Y-27632 (STEMCELL Technologies, 72302). Cells were seeded onto 6-well plates (Corning, 353046) coated with 15 µg/mL poly-L-ornithine (Sigma-Aldrich, MO, United States, P4957) and 10 µg/mL laminin (Wako, Osaka, Japan, 120-05751) at a density of 2.5 × 105 cells/cm2. The medium was changed daily with warmed (37 °C) STEMdiff Neural Induction Medium supplemented with the SMADi supplement, and cells were passaged every 7d until the third passage. After the third passage, more than 90% of the cells were characterized as PAX6, SOX1, and Nestin positive. Cells were maintained in 6-well plates coated with 15 µg/mL poly-L-ornithine and 10 µg/mL laminin in STEMdiff Neural Progenitor Medium (STEMCELL Technologies, 05833). The medium was replaced daily. Cells were passaged after 7d culture in STEMdiff Neural Progenitor Medium and used for forebrain or midbrain neuron differentiation.
Differentiation of NPCs into forebrain or midbrain neurons
Cells cultured in STEMdiff Neural Progenitor Medium were used for the differentiation of NPCs into forebrain or midbrain neurons. After 7d culture in STEMdiff Neural Progenitor Medium, cells were treated with warmed ACCUTASE (STEMCELL Technologies, 07920) at 37 °C for 10 min and washed with DMEM/F12 containing 15 mM HEPES (STEMCELL Technologies, 36254) and pelleted by centrifugation at 3000 × g for 5 min at room temperature. Cells were suspended in STEMdiff Neural Progenitor Medium and plated onto 6-well plates coated with 15 µg/mL poly-L-ornithine and 10 µg/mL laminin at a density of 1.25 × 105 cells/cm2. The day after passage, the medium was switched from the STEMdiff Neural Progenitor Medium to STEMdiff Forebrain Neuron Differentiation Medium (STEMCELL Technologies, 08600) or STEMdiff Midbrain Neuron Differentiation Medium (STEMCELL Technologies, 100-0038) supplemented with 200 ng/mL human recombinant sonic hedgehog (Wako, 198-18341) for differentiation. Cells were fed daily with STEMdiff Forebrain Neuron Differentiation Medium or STEMdiff Midbrain Neuron Differentiation Medium supplemented with 200 ng/mL human recombinant sonic hedgehog for 6–7d until the cells reached 90–95% confluence. The cells were detached using ACCUTASE, washed with DMEM/F12 containing 15 mM HEPES, and pelleted by centrifugation at 3000 × g for 5 min at room temperature. After resuspending the cells in STEMdiff Forebrain Neuron Maturation Medium (STEMCELL Technologies, 08605) or STEMdiff Midbrain Neuron Maturation Medium (STEMCELL Technologies, 100-0041), the cells were plated onto 6-well plates at a density of 7.2 × 105 cells/well (for western blot), 12-well plates (Corning, 353043) at a density of 2.4 × 105 cells/well or 6.0 × 105 cells/well (for real-time quantitative PCR (qPCR)), 8-well glass chamber slides (Matsunami glass, Tokyo, Japan, SCS-N08) at a density of 2.5 × 104 cells/well (for cell counts), or white 96-well plates at a density of 2.0 × 104 cells/well (for cell viability assay). All plates and slides were coated with 15 µg/mL poly-L-ornithine and 10 µg/mL laminin. Cells were fed every 3d with STEMdiff Forebrain Neuron Maturation Medium or STEMdiff Midbrain Neuron Maturation Medium until day 9 (forebrain neuron maturation) or day 14 (midbrain neuron maturation) from the start of maturation.
Construction of adenoviral vectors
The adenoviral vector for transducing ETS-1 was constructed using the pAd/CMV/V5-DEST vector (Invitrogen, V49320) and the ViraPower Adenoviral Expression System (Invitrogen, K493000), according to the manufacturer’s instructions. β-galactosidase (LacZ) was used as the reference adenoviral construct. To suppress endogenous ETS-1 expression, shRNA sequences targeting the ORF and 3′ regions of ETS-1 mRNA were designed using VectorBuilder (https://www.vectorbuilder.jp/tool/shrna-target-design.html). The adenovirus used for introducing the designed shRNA sequences to cells was constructed with the pAd/PL-DEST™ Gateway™ Vector Kit (Invitrogen, V49420) following the manufacturer’s instructions. The target sequences designed using VectorBuilder were as follows:
shETS-1: 5′- GACCGTGCTGACCTCAATAAG -3′.
The effect of the constructed adenovirus on ETS-1 mRNA expression was determined by RT-qPCR.
Adenoviral transduction
Cells were transduced with constructed adenoviruses at 1000 MOI in an appropriate culture medium. Transduced cells were analyzed at 24 h (for cell viability assay), 48 h (for qPCR, immunocytochemistry), or 96 h (for cell viability assay) after transduction.
RNA isolation, reverse transcription, and RT-qPCR
Total RNA was extracted using RNAiso Plus 9109 (TaKaRa, Shiga, Japan). The mRNA expression of the target genes were determined via RT-qPCR using THUNDERBIRD Next SYBR qPCR Mix (TOYOBO, Tokyo, Japan, QPX-201), an Eco Real-Time PCR System (Illumina, CA, United States, EC-101-1001), and a LightCycler 96 instrument (Roche Life Science, 05815916001) with the associated software according to the manufacturer’s instructions. The expression of each gene was normalized to that of human glyceraldehyde-3-phosphate dehydrogenase (hGAPDH). The following primer sequences were used for the analyses:
GAPDH (fwd 5′-TCCACTGGCGTCTTCACC-3′ and rev 5′-GGCAGAGATGATGACCCTTTT-3′)
ETS-1 (fwd 5′-GTTAATGGAGTCAACCCAGC-3′ and rev 5′-GGGTGACGACTTCTTGTTTGATA-3′)
DRD1 (fwd 5′-GTTTGTGTGGTTTGGGTGGG-3′ and rev 5′-TATTCGTCGCAGGGCAAAGT-3′)
DRD2 (fwd 5′-ACCATGCCCAATGGCAAAAC-3′ and rev 5′-TGATGAACACGCCGAGAACA-3′)
TH (fwd 5′-CAGCCCTACCAAGACCAGAC-3′ and rev 5′-TGTACGGGTCGAACTTCACG-3′).
GAD65 (fwd 5′- GGCTTTTGGTCTTTCGGGTC -3′ and rev 5′- GCACAGTTTGTTTCCGATGCC -3′).
GAD67 (fwd 5′- GCCAGACAAGCAGTATGATGT -3′ and rev 5′- CCAGTTCCAGGCATTTGTTGAT -3′).
Immunocytochemistry (ICC) and DRD1-positive cell counts
Cells cultured in 8-well glass chamber slides were fixed with 4% paraformaldehyde phosphate buffer solution (Wako, 163-20145) for 15 min at room temperature and washed three times with phosphate-buffered saline (PBS), each for 5 min. Subsequently, the cells were permeabilized with 0.1% Triton X-100 in PBS for 10 min at room temperature and washed three times with PBS. For blocking, the cells were incubated for 1 h at room temperature in PBS containing 5% BSA and washed three times with PBS. Primary antibodies were diluted in PBS containing 0.1% BSA, and cells were incubated in a solution containing a primary antibody overnight at 4 °C. The primary antibody used was a DRD1 Recombinant Rabbit Monoclonal Antibody (1:500; Invitrogen, Carlsbad, CA, USA; 702593). Subsequently, the cells were washed three times with PBS and incubated with the secondary antibody, goat anti-rabbit Alexa Fluor 546 (Invitrogen, A11010), diluted 1:500 in PBS containing 0.1% BSA for 1 h at room temperature. For nuclear staining, the cells were washed three times with PBS and incubated with Hoechst33342 diluted 1:1000 in PBS for 10 min at room temperature. The cells were mounted using Dako Fluorescent Mounting Medium (Agilent, CA, United States, USA; S3023) to prevent fluorescence fading. To estimate the number of total and DRD1-positive cells, images were randomly acquired using a BZ-X810 fluorescence microscope (Keyence, Osaka, Japan) at 100× magnification. The total number of cells and cells positive for DRD1 was counted manually. We also performed immunostaining using anti-TH antibody (Invitrogen, MA547435), anti-GABA antibody (Sigma-Aldrich, A0310), and anti-NeuN antibody (Invitrogen, PA5-143586) in accordance with manufacturer’s instruction to confirm neuronal cell differentiation.
Cell viability assay
Cell viability was measured using the CellTiter-Glo Luminescent Cell Viability Assay (Promega, WI, United States; G7572), which quantitatively detects live cells based on adenosine triphosphate (ATP). Cells cultured on the white 96-well plate were treated with 100 µL CellTiter-Glo Reagent. Subsequently, the plate was shaken using a Twin Mixer TM282 (AS ONE, Osaka, Japan, 1-1160-01) for 2 min to induce cell lysis, and the cells were incubated for 8 min to stabilize the luminescence. Luminescence was measured using a Varioskan Flash instrument (Thermo Fisher Scientific, Waltham, MA, USA). To estimate the ATP levels in the cells, a calibration curve was constructed, and the luminescence values were converted to ATP levels.
Data processing
NAFLD data
The publicly available dataset GSE126848 was employed as the NAFLD dataset for the SFA. GSE126848 comprises RNA-seq data for 19,786 genes derived from human liver biopsies, including samples from normal-weight (n = 14), obese (n = 12), NAFL (n = 15), and NASH (n = 16) patients. After obtaining the read count data for each gene, conversion to transcripts per million (TPM) was performed to correct for variability introduced by differences in RNA extraction and sequencing conditions. The necessary gene length data for TPM conversion were obtained from the Ensembl database using the BiomaRt package in R. Given concerns regarding the potential reduction in feature detection power of the SFA in high-dimensional data analysis, the high-dimensional NAFLD data were compressed into a low-dimensional latent space through dimensionality reduction. Dimensionality reduction was performed by constructing an autoencoder model based on TPM-normalized data. To ensure stable and efficient model construction, the TPM-normalized data were standardized. Specifically, genes with a total expression value of zero across all patients, deemed unlikely to significantly affect the disease, were removed, resulting in a dataset of 17,251 genes. Subsequently, 1 was added to all TPM-normalized values, followed by a logarithmic transformation, and the data for each gene were standardized to have a mean of 0 and a variance of 1 for autoencoder model construction.
PD and HD data
Transcriptome data from PD and HD patients were obtained from a previous study25. This dataset comprises raw RNA-seq data derived from postmortem human brain tissues from patients with PD (n = 29) or HD (n = 29) and non-PD/HD control subjects (n = 49). Based on gene length and related data obtained from the BiomaRt package, RNA-seq data were normalized to TPM. Subsequently, to ensure stable and efficient autoencoder model construction, the TPM-normalized values were log-transformed and standardized, similar to the NAFLD data. The examination of the standardized data distribution revealed that two genes, LINC02694 (ENSG00000175779) and LINC01162 (ENSG00000232790), exhibited markedly skewed distributions toward low values. These genes consistently showed almost zero expression in all patients, likely because they produce non-coding RNAs and are not linked to diseases such as PD or HD or to voluntary movement, meaning that they probably have little effect on the overall data interpretation. Including these genes in the autoencoder can bias the model. Therefore, LINC02694 and LINC01162 were excluded from this dataset. The standardized data after this exclusion were then used for autoencoder model construction.
Autoencoder model construction
The processed disease data were used to construct and train autoencoder models using the Keras package in Python with manual parameter tuning. During model construction, the rectified linear unit activation function, a nonlinear function commonly employed in neural networks, was utilized for nonlinear feature extraction. In our study, an autoencoder was developed to compress and reconstruct the 17,251-dimensional NAFLD data into a 20-dimensional latent space. Autoencoder models were similarly constructed for 21,861-dimensional PD and HD data, compressing them into either a 10-dimensional or a 20-dimensional latent space. The performance of the autoencoder models was evaluated by calculating the correlation coefficient between the input values and the output values of the constructed autoencoder models.
Hyperparameter tuning for the construction of disease classifiers
The latent space data from the autoencoder were used to construct a classifier based on LightGBM, an algorithm frequently employed for disease classification in life science data analysis. The parameters of the LightGBM-based classifier were optimized using the Optuna package in Python46. To facilitate this optimization, the training data were split into training and test sets using cross-validation (with n_split set to 5 or 10). A total of 100 trials were conducted using Optuna, and the parameters yielding the highest average accuracy for the test data were selected to construct the GC and SFC using LightGBM. To construct the SFC, a custom loss function was implemented in LightGBM.
Feature selection
The GC and SFC models constructed using the LightGBM from the same training data were grouped into a single model set. To extract the feature importance from classifiers that had sufficiently learned disease characteristics, we selected only those model sets in which both the GC and SFC achieved test accuracies at least twice that of a random classifier (66.67% or higher, given that a random classifier achieved 33.33%). For the selected model sets, we output the feature importance of the GC and SFC and identified latent-space features that were more important in the SFC than in the GC by evaluating the differences between feature importance in the SFC and GC based on the rank sum. The feature importance used in our analysis was the information gain provided by the LightGBM classifiers.
Feature analysis
It was hypothesized that the latent-space features, which were relatively more emphasized in SFC than in GC, are strongly associated with factors contributing to the disease continuum. The decoder component of the autoencoder was used to obtain gene information linked to these features. The effects of the extracted features on each gene were evaluated using gradient values. Here, the gradient is the partial derivative \(\frac{\partial {O}_{j}}{\partial {c}_{i}}\), representing the influence of the \(i\) th feature (\({c}_{i}\)) on the \(j\) th dimension of the decoder output (\({O}_{j}\)). In this study, gradients were computed, and based on these values, genes that significantly influenced the latent-space features identified by SFA were extracted. Note that the gradients computed between the autoencoder’s input and latent-space data may not adequately evaluate certain highly correlated features in the input data. Therefore, the decoder component of the autoencoder was adopted. For the gradient calculation, the decoder part of the constructed autoencoder model was extracted, and the GradientTape function provided by TensorFlow, which is a Python framework, was utilized. Consequently, the average gradient value across all the patients was computed for each gene in the disease dataset. Genes with average gradient magnitudes in the top 5% (exerting a positive influence on the identified features) and bottom 5% (exerting a negative influence) were identified as gene groups with a substantial impact on the latent-space features.
GO enrichment analysis
To extract biological information associated with the obtained gene groups, we performed GO enrichment analysis and examined relevant biological functions, diseases, and transcription factors. This analysis was conducted using Metascape47, a tool capable of analyzing data from multiple databases. In the Metascape enrichment analysis, DisGeNET and TRRUST were used as the reference databases for evaluating disease and transcription factor associations, respectively47,48,49.
Statistical analysis
All data are presented as mean ± S.D. Welch’s t-test was used for comparisons between two groups.
Synthetic data-based validation of SFA robustness (related to Supplementary Fig. 1)
To evaluate the robustness of the SFA in identifying features contributing to state continuity, we generated synthetic transcriptomic datasets under varying degrees of overlap between groups. Each dataset simulated three class labels (A, B, C) and included 20,000 features, with only two features (“gene 1” and “gene 2”) exhibiting structured group-specific variation.
Synthetic data generation
Datasets were constructed for eight different overlap settings: –0.3, 0.01, 0.025, 0.05, 0.10, 0.15, 0.20, and 0.30. For each condition, 1000 samples were simulated and evenly assigned to three classes: A, B, and C. To model a structured disease continuum, “gene 1” was designed to encode a monotonic progression across the classes in the order A > B> C, representing a feature involved in biological state transition. In contrast, “gene 2” displayed a different distribution pattern that did not follow this class ordering, serving as a non-spectrum-related gene for control. The values of gene 1 and gene 2 were sampled from uniform distributions with fixed widths, and the degree of class separation was manipulated by adjusting the overlap value. The overlap value was defined as the proportion of the distribution width that overlaps with the distribution of the adjacent class. Since all class distributions had the same width, this proportion uniquely determined the positional shift between classes. For example, an overlap value of 0.10 indicates that 10% of class A’s gene 1 distribution overlaps with that of class B, and likewise between B and C. The remaining 19,998 features were sampled independently from uniform distributions with no class-specific differences, serving as background noise. This setup ensured that only gene 1 and gene 2 contained relevant information, allowing a focused assessment of SFA’s ability to detect spectrum-relevant features under varying inter-class overlap conditions.
Model training and evaluation
For each dataset, classifiers were trained using both the SFC and a general classifier (GC), each implemented via LightGBM. The SFC incorporated predefined overlap hyperparameter (=0.05) between adjacent groups (e.g., A–B and B–C), while the GC relied on standard one-hot label encoding. Feature importance for each model was computed, and the relative emphasis on gene 1 and gene 2 was compared between SFC and GC. This allowed us to assess the sensitivity of the SFC to features responsible for inter-group continuity under different overlap assumptions. All models were trained and evaluated using tenfold stratified cross-validation to ensure robust performance estimation and to prevent overfitting to any specific data partition.
Purpose and interpretation
As shown in Supplementary Fig. 1, even when the overlap parameter used to construct the SFC differed from the true degree of overlap in the synthetic data, the classifier maintained the ability to prioritize features contributing to the disease spectrum. This suggests that the SFA is tolerant to moderate mismatch in the assumed overlap setting, and supports its applicability to real-world datasets where true inter-group relationships are not explicitly known.
Evaluation of SFA robustness under data heterogeneity (related to Supplementary Fig. 2)
To evaluate the robustness of the SFA in heterogeneous populations, we constructed synthetic transcriptomic datasets composed of multiple subpopulations, each contributing differently to the disease continuum. This experiment was designed to assess whether SFA can detect features contributing to state continuity even when such features vary across subpopulations.
Synthetic dataset design
Each dataset contained 5000 samples and 20,000 gene features, with samples evenly distributed across three classes (A, B, and C). We simulated four levels of population heterogeneity by varying the number of distinct subpopulations (2, 3, 4, and 5). Within each subpopulation, two synthetic genes were embedded among random noise features:
-
Gene X was designed to reflect a monotonic continuum across the three classes in the order A > B> C, representing a feature that encodes progressive disease severity or phenotypic transition.
-
Gene Y, in contrast, was designed to vary between classes independently of this order, modeling a feature unrelated to linear state progression (e.g., a phenotype affecting only class A and C).
The values of gene X and gene Y were drawn from partially overlapping uniform distributions to simulate realistic biological variability. Specifically, for each class, values were sampled from a distribution with a fixed width, and the overlap ratio was defined as the proportion of this width that is shared with adjacent classes. In this experiment, the overlap ratio was set to 0.2, representing 20% overlap between the feature distributions of neighboring classes (A–B and B–C). Each subpopulation was assigned a unique gene X/gene Y pair at different indices (e.g., gene1/gene 2 for subpopulation 1, gene 3/gene 4 for subpopulation 2), ensuring heterogeneity in spectrum-relevant features across the population. The remaining 19,998 gene features were generated as uninformative background noise.
Model training and feature importance analysis
For each synthetic dataset, classifiers were trained using both a general classifier (GC) and the SFC, implemented with LightGBM. The SFC was trained using label smoothing that encoded partial overlaps between adjacent classes, reflecting the underlying spectral assumptions (The value of hyperparameter for the overlap was set to 0.05). Feature importance was extracted based on information gain, and the detection power of the SFA was evaluated by tracking how consistently gene X was ranked higher by SFC than GC across all subpopulations and heterogeneity levels. To ensure reliable estimation of model performance and feature importance rankings, all experiments were conducted using fivefold stratified cross-validation.
Outcome and interpretation
As visualized in Supplementary Fig. 2, increasing the number of subpopulations resulted in reduced overall classification accuracy, indicating the challenge posed by population complexity. However, the SFC consistently prioritized gene X over gene Y in feature importance ranking, even as heterogeneity increased. These results suggest that SFA retains its ability to extract spectrum-related features under complex, heterogeneous data conditions, especially when relative importance rather than absolute accuracy is used as the evaluation criterion.
Sensitivity analysis of overlap hyperparameter in SFA (related to Supplementary Fig. 3)
To evaluate the sensitivity of the SFA to its key hyperparameter—state overlap—we conducted a systematic analysis using real transcriptome data from Parkinson’s disease (PD), Huntington’s disease (HD), and healthy controls. The overlap parameter governs the degree to which adjacent classes are assumed to share their underlying feature distributions.
Dataset and latent feature extraction
The input data consisted of transcriptome-wide TPM-normalized gene expression profiles. After log-transformation and standardization, dimensionality was reduced using a pretrained autoencoder, and the resulting 20-dimensional latent space was used as input for classification.
Classifier design and overlap sensitivity settings
We constructed SFCs using the LightGBM framework. For each model, we manually adjusted the overlap parameter, which defines the assumed degree of distributional continuity between adjacent classes (PD–control and control–HD). The parameter was defined as the proportion of each class’s distribution width that overlaps with its neighboring class. For example, an overlap of 0.1 implies that 10% of the feature distribution of one class intersects with the adjacent class. A series of SFC models were built using different overlap hyperparameter values (0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2, 0.25, 0.3), while keeping all other training settings constant. For each setting, the importance of the latent features were evaluated based on gain values from LightGBM. Model evaluation at each overlap setting was performed using tenfold stratified cross-validation to ensure consistent performance comparison across hyperparameter conditions.
Evaluation of feature robustness
We analyzed the consistency of feature prioritization across models by comparing the frequency with which each latent feature appeared among the top-ranked features under different overlap conditions. Only cross-validation sets with the accuracy score greater than 0.66 were subjected to feature ranking. A robust SFA model should consistently prioritize similar latent features despite variation in the overlap assumption. As shown in Supplementary Fig. 3, one specific latent dimension (e.g., feature 16) was repeatedly prioritized across most overlap settings, indicating that SFA’s ability to extract spectrum-relevant features is robust to moderate variation in the overlap parameter. This supports the practical applicability of SFA even when the precise degree of state continuity is unknown.
Classifier based on PCA and TF enrichment analysis (related to Supplementary Fig. 6)
To evaluate the performance of the SFA using features derived from conventional linear dimensionality reduction, we applied principal component analysis (PCA) to transcriptome data from Parkinson’s disease (PD), Huntington’s disease (HD), and control brain samples. This analysis aimed to test whether SFA could identify spectrum-relevant signals even when using orthogonal, unsupervised features such as principal components (PCs).
Data processing and PCA embedding
RNA-seq expression data were log-transformed after replacing zero values with a small constant, and standardized. PCA was performed on the resulting gene expression matrix, and the minimum number of principal components required to explain at least 85% of the total variance was selected for further analysis. In this dataset, 45 principal components met this criterion and were used as input features for classification.
Classifier training and comparison
SFCs and general classifiers (GCs) were implemented using the LightGBM framework. The SFC was trained using a custom loss function that encoded partial overlap between adjacent class labels (PD–control and control–HD), reflecting assumed biological continuity. The value of hyperparameter for the overlap was set to 0.05. The GC used standard multiclass objective functions. Optimization for LightGBM Hyperparameters was performed using Optuna. Model evaluation was performed by repeated tenfold stratified cross-validation (repeat number = 10).
Gene contribution analysis
To identify genes contributing to the spectrum-relevant features, we first identified principal components that were consistently prioritized by the SFC but not by the GC, based on feature importance scores. In this step, only cross-validation sets with the accuracy score greater than 0.66 were subjected to feature ranking. Then, for each selected PC, we extracted the top loading genes (i.e., genes with the largest absolute values in the PC loading matrix) as candidates contributing to disease spectrum variation.
Transcription factor enrichment via TRRUST
The union of genes identified from SFC-prioritized PCs was submitted to Metascape for enrichment analysis using the TRRUST transcription factor–target interaction database. Significantly enriched transcription factors were identified based on statistical enrichment (Fisher’s exact test with adjusted p-values), and the results were visualized as a clustered heatmap (Supplementary Fig. 6C), illustrating the regulatory architecture underlying the PCA-extracted spectrum-related gene set.
Classifier implementation using XGBoost (related to Supplementary Fig. 7)
To examine the flexibility of the SFA and evaluate its compatibility with classifiers beyond LightGBM, we implemented the SFC using the XGBoost framework. This analysis, corresponding to Supplementary Fig. 7, was conducted on the same 20-dimensional latent features derived from the PD, control, and HD transcriptome dataset described above.
Data input and preprocessing
TPM-normalized gene expression data were log-transformed after replacing zero values with a small constant (8.62 × 10⁻⁶) to ensure numerical stability. The transformed data were then standardized per gene (mean = 0, SD = 1). A subset of features was used as input for model training, consistent with the LightGBM-based analyses.
Model construction and optimization
We implemented the SFC in XGBoost by defining a custom multiclass objective function that incorporates overlap weights among class labels to represent state continuity. A symmetric overlap was assumed between PD–control and control–HD classes and set to 0.05 as a value for the overlap hyperparameter, while PD–HD overlap was set to zero, following the same spectrum modeling as in other SFC settings. For baseline comparison, a general classifier (GC) was also constructed using standard multiclass log-loss. XGBoost hyperparameters for both classifiers were optimized using the Optuna framework with tenfold stratified cross-validation. Model performance was assessed based on classification accuracy, and only those models exceeding twice the baseline (i.e., >0.66 accuracy) were retained for feature importance analysis.
Evaluation of classifier compatibility
Feature importance from XGBoost-based GC and SFC was computed using the built-in gain metric. Features more strongly prioritized by the SFC relative to the GC were identified using rank sum comparison. The consistency of these results with the LightGBM-based SFA findings demonstrates the methodological flexibility of SFA and validates the robustness of extracted spectrum-relevant features across classifier architectures (see Supplementary Fig. 7).
Comparison with weighted gene co-expression network analysis (related to Supplementary Fig. 8)
To determine whether the SFA identifies gene sets distinct from those uncovered using conventional co-expression-based methods, we conducted Weighted Gene Co-expression Network Analysis (WGCNA) on the same transcriptomic dataset derived from postmortem human brain tissues of Parkinson’s disease (PD), Huntington’s disease (HD), and control subjects.
Data preprocessing
RNA-seq data were normalized to transcripts per million (TPM), log-transformed following the replacement of zero values with a small constant (8.62 × 10⁻⁶), and standardized per gene. Genes with zero expression across all samples were excluded. The processed data were transposed such that genes corresponded to rows and samples to columns, in preparation for WGCNA.
Network construction and module detection
WGCNA was implemented using the PyWGCNA package in Python. A weighted co-expression network was constructed by calculating pairwise correlations between all genes. Soft-thresholding power was selected to approximate a scale-free network topology, and gene modules were defined using hierarchical clustering of the topological overlap matrix. Module eigengenes, representing the first principal component of each module’s expression pattern, were calculated for downstream trait association.
Trait association and gene selection
Module eigengenes were correlated with sample traits (PD, HD, or control) using Pearson correlation analysis. Modules showing significant positive or negative correlation with any of the three groups were retained. For each such group-associated module, the top 100 genes with the highest module membership (kME) values were selected as representative genes. These kME values reflect the strength of the correlation between each gene and the module eigengene. The selected top-100 genes from all modules significantly associated with each group (PD, HD, control) were then combined to form group-specific WGCNA-derived gene sets. That is, one gene set was constructed for PD, one for HD, and one for control, by aggregating the top genes from each group-correlated module. Also, a gene set constructed by aggregating PD, HD and control gene sets were referred to as WGCNA gene set.
Comparison with SFA
To assess the overlap and divergence between conventional and SFA-based gene selection strategies, we compared the WGCNA-derived gene sets with those obtained from SFA, which were based on gradient backpropagation from the decoder of the trained autoencoder. As shown in Supplementary Fig. 8, the overlap between the two sets was minimal, indicating that SFA identifies gene features involved in disease spectrum continuity that may be overlooked by correlation-based module analysis. This suggests that SFA provides complementary biological insights to WGCNA.
Comparative visualization of spectrum-associated gene expression patterns
To visualize and compare the expression patterns of genes associated with pathological spectrum progression, we generated heatmaps using Z-score–standardized gene expression values. For each gene, expression values were transformed using the Control group as the reference, such that the mean and standard deviation within the Control group became 0 and 1, respectively. To align the direction of gene expression changes with the pathological spectrum (HD → Control → PD), we applied a sign adjustment such that the average expression in the HD group was greater than that in the PD group. This convention facilitated consistent visual and statistical interpretation of monotonic trends across disease states. Genes containing any NaN, inf, or -inf values after transformation were excluded from downstream visualization and analysis. For each of the three gene sets—SFA-only, WGCNA-only, and their overlap—we calculated the group-wise mean and standard error of the signed Z-scores for the HD, Control, and PD samples. These summary statistics were used to support comparisons across gene sets.
Data availability
Raw analysis examples and a refactored code template are available on GitHub (https://github.com/kyoshiaki-tky/SFA.git).
Code availability
Raw analysis examples and a refactored code template are available on GitHub (https://github.com/kyoshiaki-tky/SFA.git).
References
Zhang, Y., Chen, H., Li, R., Sterling, K. & Song, W. Amyloid β-based therap1y for Alzheimer’s disease: challenges, successes and future. Signal Transduct. Target Ther. 8, 248 (2023).
Kubota, K. J., Chen, J. A. & Little, M. A. Machine learning for large-scale wearable sensor data in Parkinson’s disease: concepts, promises, pitfalls, and futures. Mov. Disord. 31, 1314–1326 (2016).
Odusami, M., Maskeliūnas, R., Damaševičius, R. & Misra, S. Machine learning with multimodal neuroimaging data to classify stages of Alzheimer’s disease: a systematic review and meta-analysis. Cogn. Neurodyn. 18, 775–794 (2024).
Pantaleo, E. et al. A Machine learning approach to Parkinson’s disease blood transcriptomics. Genes 13, 727 (2022).
Abdullah, M. N., Wah, Y. B., Abdul Majeed, A. B., Zakaria, Y. & Shaadan, N. Identification of blood-based transcriptomics biomarkers for Alzheimer’s disease using statistical and machine learning classifier. Inform. Med. unlocked 33, 101083 (2022).
Du, X., DeForest, N. & Majithia, A. R. Human genetics to identify therapeutic targets for NAFLD: challenges and opportunities. Front. Endocrinol.12, 777075 (2021).
Ramírez-Mejía, M. M. & Méndez-Sánchez, N. What is in a name: from NAFLD to MAFLD and MASLD—unraveling the complexities and implications. Curr. Hepatol. Rep. 22, 221–227 (2023).
Aihara, K., Liu, R., Koizumi, K., Liu, X. & Chen, L. Dynamical network biomarkers: theory and applications. Gene 808, 145997 (2022).
Ishida, T. et al. A Novel method to estimate long-term chronological changes from fragmented observations in disease progression. Clin. Pharmacol. Ther. 105, 436 (2018).
Matsena Zingoni, Z., Chirwa, T. F., Todd, J. & Musenge, E. A review of multistate modelling approaches in monitoring disease progression: Bayesian estimation using the Kolmogorov-Chapman forward equations. Stat. Methods Med. Res. 30, 1373–1392 (2021).
Attia, I. M. Novel approach of multistate Markov chains to evaluate progression in the expanded model of non-alcoholic fatty liver disease. Front. Appl. Math. Stat. 7, 766085 (2022).
McCullum, L. B. et al. Markov models for clinical decision-making in radiation oncology: a systematic review. J. Med. Imaging Radiat. Oncol. 68, 610–623 (2024).
Calabresi, P. et al. Alpha-synuclein in Parkinson’s disease and other synucleinopathies: from overt neurodegeneration back to early synaptic dysfunction. Cell Death Dis. 14, 176 (2023).
Rocha, G. S. et al. Basal ganglia for beginners: the basic concepts you need to know and their role in movement control. Front. Syst. Neurosci. 17, 1242929 (2023).
Wani, T. A., Al-Omara, M. A. & Zargar, S. Huntington disease: current advances in pathogenesis and recent therapeutic strategies. Int. J. Pharm. Sci. Drug Res. 3, 69–79 (2011).
Nopoulos, P. C. Huntington disease: a single-gene degenerative disorder of the striatum. Dialogues Clin. Neurosci. 18, 91–98 (2016).
Jankovic, J. MD. Treatment of hyperkinetic movement disorders. Lancet Neurol. 8, 844–856 (2009).
Zhang, C.-B. et al. Delving deep into label smoothing. IEEE Trans. Image Process. 30, 5984–5996 (2021).
Han, Y. et al. Robust visual tracking based on adversarial unlabeled instance generation with label smoothing loss regularization. Pattern Recognit. 97, 107027 (2020).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. CVPR 2016, 2818–2826 (2016).
Galdran, A. et al. Non-uniform label smoothing for diabetic retinopathy grading from retinal fundus images with deep neural networks. Trans. Vis. Sci. Technol. 9, 34 (2020).
Denechaud, P., Fajas, L. & Giralt, A. E2F1, a novel regulator of metabolism. Front. Endocrinol. 8, 311 (2017).
Denechaud, P. et al. E2F1 mediates sustained lipogenesis and contributes to hepatic steatosis. J. Clin. Investig. 126, 137–150 (2016).
Staib, F. et al. The p53 tumor suppressor network is a key responder to microenvironmental components of chronic inflammatory stress. Cancer Res. 65, 10255–10264 (2005).
Labadorf, A., Choi, S. H. & Myers, R. H. Evidence for a pan-neurodegenerative disease response in Huntington’s and Parkinson’s disease expression profiles. Front. Mol. Neurosci. 10, 430 (2018).
Gerfen, C. R. Segregation of D1 and D2 dopamine receptors in the striatal direct and indirect pathways: an historical perspective. Front. Synaptic Neurosci. 14, 1002960 (2023).
Tahami Monfared, A. A. et al. Estimating transition probabilities across the Alzheimer’s disease continuum using a nationally representative real-world database in the United States. Neurol. Ther. 12, 1235–1255 (2023).
Rezaie, N., Reese, F. & Mortazavi, A. PyWGCNA: a Python package for weighted gene co-expression network analysis. Bioinformatics 39, btad415 (2023).
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinform. 9, 559–559 (2008).
Zhang, Y. et al. E2F1 is a novel fibrogenic gene that regulates cholestatic liver fibrosis through the Egr-1/SHP/EID1 network. Hepatology 60, 919–930 (2014).
Mariani, S. et al. Plasma levels of SIRT1 associate with non-alcoholic fatty liver disease in obese patients. Endocrine 49, 711–716 (2015).
Tian, C., Huang, R. & Xiang, M. SIRT1: harnessing multiple pathways to hinder NAFLD. Pharmacol. Res. 203, 107155 (2024).
Ding, R., Bao, J. & Deng, C. Emerging roles of SIRT1 in fatty liver diseases. Int. J. Biol. Sci. 13, 852–867 (2017).
Vilà, L. et al. AAV8-mediated Sirt1 gene transfer to the liver prevents high carbohydrate diet-induced nonalcoholic fatty liver disease. Mol. Ther. Methods Clin. Dev. 1, 14039 (2014).
Singh, S. S. et al. NF-κB-mediated neuroinflammation in Parkinson’s disease and potential therapeutic effect of polyphenols. Neurotox. Res. 37, 491–507 (2020).
García-García, V. A., Alameda, J. P., Page, A. & Casanova, M. L. Role of NF-κB in ageing and age-related diseases: lessons from genetically modified mouse models. Cells 10, 1906 (2021).
Perrin, V. et al. Implication of the JNK pathway in a rat model of Huntington’s disease. Exp. Neurol. 215, 191–200 (2009).
Hu, K. et al. c-Jun/Bim upregulation in dopaminergic neurons promotes neurodegeneration in the MPTP mouse model of Parkinson’s disease. Neuroscience 399, 117–124 (2019).
Bradford, J. et al. Expression of mutant huntingtin in mouse brain astrocytes causes age-dependent neurological symptoms. Proc. Natl Acad. Sci. 106, 22480–22485 (2009).
Li, S. et al. Interaction of Huntington disease protein with transcriptional activator Sp1. Mol. Cell. Biol. 22, 1277–1287 (2002).
Wang, R. et al. Sp1 regulates human Huntingtin gene expression. J. Mol. Neurosci. 47, 311 (2012).
Dong, L. & Gao, L. SP1-driven FOXM1 upregulation induces dopaminergic neuron injury in Parkinson’s disease. Mol. Neurobiol. 61, 5510–5524 (2024).
Yao, L. et al. Inhibition of transcription factor SP1 produces neuroprotective effects through decreasing MAO B activity in MPTP/MPP+ Parkinson’s disease models. J. Neurosci. Res. 96, 1663–1676 (2018).
Cai, L. et al. Up-regulation of microRNA-375 ameliorates the damage of dopaminergic neurons, reduces oxidative stress and inflammation in Parkinson’s disease by inhibiting SP1. Aging 12, 672–689 (2020).
Chandra, B., Voas, M. G., Davies, E. L. & Roberts-Galbraith, R. H. Ets-1 transcription factor regulates glial cell regeneration and function in planarians. Develop 150, dev201666 (2023).
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York. 2623–2631 https://doi.org/10.1145/3292500.3330701 (2019).
Zhou, Y. et al. Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 10, 1523 (2019).
Han, H. et al. TRRUST: a reference database of human transcriptional regulatory interactions. Sci. Rep. 5, 11432 (2015).
Janet, P. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res. 45, 833 (2017).
Acknowledgements
This research was partially supported by AMED under grant number 24mk0121280 and JSPS KAKENHI Grant Numbers JP23K27336 and JP23K21464.
Author information
Authors and Affiliations
Contributions
Y.K. and T.T. designed the experiments. T.F., Y.K., and K.K. performed computational analyses. T.F., Y.K. and S.M. performed the cellular experiments. T.F., Y.K., K.K. and T.T. wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Fujiwara, T., Kariya, Y., Kobayashi, K. et al. Utility of the continuous spectrum formed by pathological states in characterizing disease properties. npj Syst Biol Appl 11, 100 (2025). https://doi.org/10.1038/s41540-025-00579-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41540-025-00579-x