Abstract
Multi-omics data provides a comprehensive view of biological systems and enables researchers to uncover intricate molecular mechanisms underlying complex diseases. However, multi-omic data is often incomplete and joint modeling of multi-omics data will lead to exclusion of a large portion of subjects. Furthermore, most current multi-omics studies pinpoint individual -omics markers, which may not interact, posing challenges for interpretation. In this study, we developed an interpretable deep trans-omic fusion neural network, TransFuse, to include incomplete -omic data for training of prediction models. When evaluated using the data from two Alzheimer’s disease cohorts, TransFuse generally showed superior or comparable performance over competing methods in a wide range of metrics like classification accuracy and F1 score. In addition, TransFuse yielded a subset of multi-omics features forming functional disease network modules, providing valuable insights into underlying molecular mechanism. In addition, almost all the genetic variants identified by TransFuse are expression quantitative trait locus (eQTLs) specific to frontal cortex tissue, from which the gene and protein expression data were collected. This highlights the great potential of TransFuse in capturing the tissue-specific information flow. Top pathways enriched include VEGF and EPH pathways, both influencing neural development and synaptic formation.
Similar content being viewed by others
Introduction
Multi-omics data provides a comprehensive view of biological systems by integrating various molecular layers, such as genomics, proteomics, and metabolomics. This comprehensive approach enables researchers to uncover intricate disease mechanisms and identify potential therapeutic targets1. On the other hand, databases such as Reactome and SNP2TFBS provide critical prior knowledge of regulatory links between SNPs, genes, and proteins. Although not entirely accurate nor tissue specific, this prior knowledge could still be leveraged as an additional source of evidence on top of multi-omics data to improve the precision in discovery of molecular mechanism. Recently, a growing effort has been dedicated to the integrative modeling of multi-omic data together with a rich resource of prior knowledge in biological interactions like protein interactions and biochemical interactions in the pathways. Several studies have verified that machine learning models informed by prior biological interactions can lead to improved prediction performance and model interpretability2. A subset of those are integrative -omics analysis that jointly modeled multi-omics data and prior biological interactions3,4. This approach has yielded functionally connected -omics features, such as genetic variants, genes, and proteins, offering valuable insights into molecular mechanisms and providing potential targets for treatment.
However, a major challenge in integrative modeling of multi-omics data is the prevalence of missing data across -omics layers5. In disease cohorts, incomplete multi-omic profiles are common due to various factors such as technical limitations, limited tissue availability, and patient dropout. Most established integrative -omics modeling techniques rely on concatenated multi-omics features and thus require complete multi-omic profiles from all subjects4,6. For instance, MOGONET is a graph neural network framework designed to integrate omics-specific subject similarity networks for patient classification6. In addition, our previous work, MoFNet, is a graph neural network that integrates multi-omics data with prior trans-omic interactions to model the flow of information from DNA to gene and protein. However, these methods necessitate that all subjects have a complete set of omics data, leading to the exclusion of subjects with missing -omics types and restricting analysis to a small subset of the cohort. This limitation reduces the full potential of large multi-omic datasets, thereby diminishing predictive power and the ability to identify robust -omics features associated with complex diseases.
While imputation methods can be used to estimate missing values, they typically perform well only when a few values are missing and rely on the presence of partial data and observable patterns within each data type7,8. However, when an entire -omics data type is missing, imputation must depend solely on other -omics types, which is often insufficient because shared information across -omics is limited and may fail to capture modality-specific biological signals. Although some methods attempt to predict one -omics type from another3, such predictions risk amplifying noise or biases present in the observed data, thereby reducing robustness for downstream tasks such as classification or biomarker discovery. Imputation methods for substantial missingness remain underdeveloped and should be used with caution; while they may capture global patterns, they are generally unsuitable for precision tasks like patient classification.
In this study, we introduce TransFuse, an interpretable deep trans-omic fusion neural network that to enable the inclusion of subjects with incomplete -omics data during model training, without requiring reconstruction of large missing data chunks. TransFuse integrates multi-omics data with prior knowledge of functional interactions among proteins, genes, and their upstream regulatory SNPs. As an extension of MoFNet, TransFuse adopts a modular network architecture consisting of three separate modules, each dedicated to processing only one omics type (i.e., SNPs, gene expression, or proteins). This modular design allows each component to be pre-trained independently using subjects with missing -omics types. We evaluated TransFuse on multi-omics data from the ROS/MAP cohort using Alzheimer’s disease as a test case. Transfuse fine-tuned from pre-trained modules demonstrated improved/comparable prediction performance compared to other state-of-the-art methods, while also identifying biologically meaningful disease subnetworks associated with Alzheimer’s disease.
Results
Classification performance
Shown in Fig. 1 is the performance metrics for the proposed TransFuse model and other competing classification methods. We reported not only accuracy and AUC, but also F1 score, precision, and specificity metrics to provide a comprehensive comparison of performance. Notably, the F1 score combines precision and sensitivity (recall) into a single metric and is widely utilized as a primary evaluation metric for imbalanced datasets. We found that fine-tuned TransFuse achieved the best average performance in 4 out of 5 evaluation metrics, with consistently low variation across 5 folds (Supplementary Table S1). Following the approach in9, we conducted paired t-test to assess the significance of this improvement. It showed that fine-tuned TransFuse delivered significantly higher or comparable performance relative to other methods (FDR-corrected p< 0.05; Fig. 1). In terms of accuracy, it significantly outperformed baseline TransFuse, GraphNet-constrained logistic regression, MOGONET, and MLP, and performed marginally better than the Lasso model. For specificity, it significantly outperformed competing methods including baseline TransFuse, MLP, Lasso, GraphNet, and Lasso-constrained logistic regression. Non-significant differences in other metrics are mainly due to high variability in the performance of competing methods, as reflected by large error bars in Fig. 1. Additionally, we performed paired t-tests to assess whether any competing method significantly outperformed the fine-tuned TransFuse on any metric. All p-values were greater than 0.92 (Supplementary Table S2), indicating that none of the competing methods demonstrated statistically superior performance compared to fine-tuned TransFuse.
Multi-omic sub-networks for AD
Across all methods, Laplacian-logistic and Lasso-logistic selected only a small number of features, with few prior connections (Fig. S3 and S4). Other methods, such as modularity-constrained logistic regression, elastic net, random forest, and MLP, identified a much larger set of features without much prioritization (Fig. S1, S2, S5 and S7). In contrast, TransFuse (Fig. 2) and MOGONET (Fig. S6) returned a smaller, more connected set of disease-related modules, which are more manageable for downstream functional interpretation. Fine-tuned TransFuse model identified 20 peptides, 107 genes and 7 SNPs, most of which are part of one big cohesive sub-network and three smaller connected sub-networks (Fig. 2). Smaller groups highlighted localized functional interactions within the broader prior network, consisting of important genes and proteins related to AD like the microtubule associated protein tau (MAPT) gene and the tau_PHF1_S404 peptide, which confirmed the important role of tau phosphorylation in neurodegeneration. The paired helical filament 1 (PHF1) antibody targets the specific phosphorylated epitope of tau at Ser396/404. Phosphorylation at this PHF1 site disrupts tau’s structural folding, promoting polymerization and contributing to neurofibrillary tangle formation10.
The largest sub-network includes 16 peptides, 104 genes, and 5 SNPs. The protein apolipoprotein E (APOE), the top risk factor of Alzheimer’s Disease, was identified within this sub-network, directly connected to the early growth response protein 1 (EGR1) gene. EGR1 transcripts were found elevated in APOE-deficient mice, pointing to an inverse relationship between EGR1 and APOE levels11. Pathway connecting amyloid precursor protein (APP_2), aph-1 homolog A/B (APH1A/APH1B), cluster of differentiation-44 (CD44_2), and EGR1 covers another potential critical aspects of AD. APP processing by the gamma-secretase complex (containing APH1A/APH1B) produces amyloid-beta, a primary driver of AD12. CD44, a mediator of cell adhesion, plays a complex role in neuroinflammation by controlling T-cell differentiation, adhesion, and blood-brain barrier permeability. In CD44-deficient mice, increased permeability and pro-inflammatory T-cell profiles highlight CD44’s role as a negative regulator of inflammation13. EGR1, a transcription factor, is involved in synaptic health and is impacted by neuroinflammation and amyloid-beta processing14, contributing to synaptic dysfunction and cognitive decline. Interactions between APP, APH1A/APH1B, CD44, and EGR1 reveal a pathological interaction network which may provide important hypothesis for therapeutic development.
Within the largest sub-network, hub peptides like phosphoinositide-3-kinase regulatory subunit 1 (PIK3R1_1) and growth factor receptor bound protein 2 (GRB2_1) exhibit high importance scores, demonstrating their role in integrating signals from upstream genes and other interacting proteins. These peptides are well known for their contribution to neuroinflammation, vascular permeability, and impaired cell signaling in AD15. The largest gene node angiopoietin 2 (ANGPT2) and its connection to hub peptide PIK3R1_1 suggest strong functional connectivity in signaling networks. Specifically, PIK3R1 modulates neuroinflammation through the PI3K/Akt pathway16, while ANGPT2 promotes vascular permeability and leukocyte infiltration by antagonizing Tie2 receptor signaling. A study showed that blocking ANGPT2 improved blood-brain barrier integrity, reducing proinflammatory cell activity and improving central nervous system function17.
Expression quantitative trait locus (eQTL) analysis
Considering that gene and protein expression data are both from prefrontal cortex, we further examined the tissue-specificity of the functional effect associated with the identified SNPs. That is, whether these SNPs are expression quantitative trait loci (eQTLs) associated with the downstream transcriptomic changes in the corresponding tissue. Based on the Brain eQTL Almanac (BRAINEAC) database, we found that 5 out of the 7 identified SNPs were significant eQTLs in the frontal cortex (Table. 1) and one SNP, rs10135521, could not be found in the database. The remaining SNP rs9835340 returned a marginal eQTL effect with adjusted p-value of 0.053.
Pathway enrichment analysis
Pathway enrichment analysis was performed on the largest sub-network of Fig. 2, which includes 16 peptides, 104 genes and 5 SNPs, using the g:Profiler platform. We identified several clusters of enriched functional pathways, like vascular endothelial growth factor (VEGF) signaling, erythropoietin-producing hepatocellular (EPH) signaling, cell communication and transcriptional regulation (Fig. 3). The largest cluster of enriched pathways is VEGF signaling pathways. This signaling family is involved in neuroprotection and speculated to contribute to the neurodegenerative process in AD18. Some research indicates that VEGF levels are decreased in the cerebrospinal fluid and brain tissue of Alzheimer patients, which might reflect impaired angiogenesis and neuroprotective functions19. In addition, our enrichment results highlighted the important role of EPH signaling pathway in AD (\(p = 4.54 \times 10^{-8}\)). The EPH signaling pathway, involving EPH receptors and ephrin ligands, is crucial for synaptic function and plasticity, impacting synapse formation and maintenance. Dysregulation of this pathway can lead to synaptic dysfunction and loss, which are the key features of AD20. Additionally, EPH signaling may influence amyloid-beta production and tau pathology, contributing to the progression of AD21.
Ablation analysis
In the ablation analysis, we evaluated the contribution of each -omics type to the final performance of the fine-tuned TransFuse model. Specifically, we tested six different input configurations by either removing one -omics type at a time (by setting its input values to zero) or retaining only one single -omics type. We used paired t-tests with FDR correction to assess the statistical significance of performance differences across these configurations (Supplementary Figure S8). When removing individual -omics types, we observed that omitting SNP or gene expression data led to only a slight decrease in performance. In contrast, removing protein data resulted in a significant drop, underscoring its critical importance to the predictive power of TransFuse. Similarly, among the single-omics models, the protein-only configuration consistently outperformed the SNP-only and gene-only models. Notably, the protein-only model achieved performance levels close to that of the full multi-omics model, further emphasizing the dominant role of protein data in this task. This finding is biologically plausible, as proteins are the functional products of gene expression and genetic variation, and thus may inherently encapsulate much of the downstream disease-relevant information. Our TransFuse model captures the information flow from SNP to gene and protein that likely contributes to the strong performance of protein features.
Replication analysis
After module pre-training using ROS/MAP samples with missing -omics types, the learned weights from the three pre-trained modules were used to initialize the corresponding components of the TransFuse model, while the fully connected layers were initialized with random weights. The fine-tuning process was performed separately using complete samples from the ROS/MAP and MSBB cohorts, allowing TransFuse to adapt independently to each dataset. TransFuse model fine-tuned using MSBB complete samples identified 3 SNPs, 64 genes, and 21 peptides associated with Alzheimer’s disease, among which 1 SNP, 17 genes, and 8 peptides overlapped with those previously identified in the ROS/MAP cohort.
Discussion
We proposed a new deep multi-omic fusion model that models the dynamic information flow from DNA to RNA and protein, and at the same time enables the training using subjects with missing -omics types. TransFuse, after fine-tuning, demonstrated significantly improved performance over a wide range of performance metrics. Compared to other competing methods, it also yielded a subset of multi-omics features with functional interactions mostly known in the prior knowledge, with reasonable success when replicated in an independent cohort. In addition, almost all the SNPs identified by TransFuse are eQTLs specific to frontal cortex tissue, from which the gene expression and protein expression data were collected. This highlights the great potential of TransFuse in capturing tissue-specific information flow from SNP to RNA. Our findings together underscore the important role of the VEGF and EPH pathways in AD. Both pathways influence neural development, with VEGF promoting neuron growth and survival, while Eph signaling is involved in axon guidance and synaptic formation. Their crosstalk is crucial in coordinating neural functions, and dysregulation of these pathways could contribute to neurological disorders, including Alzheimer’s disease.
There are several limitations of this work that merit further consideration. The prior SNP-gene interactions used in this study are limited to promoter regions. However, many of the AD risk SNPs identified in the genome wide association studies (GWASs) are intergenic and located in non-coding regions. Therefore, our prior network limits the capability of TransFuse to connect AD risk SNPs with AD-related proteins. Instead, it only searches for upstream regulatory SNPs nearby candidate genes, which likely lead to the small number of identified SNPs in both discovery and replication analyses. This could be improved by incorporating emerging high-throughput chromosome conformation capture (Hi-C) data, which provide distant SNP-gene interactions and will likely reach intergenic regions with AD GWAS findings.Furthermore, while TransFuse achieved the highest average F1 score, its effectiveness in handling imbalanced data remains unclear, as the dataset in this study is not highly imbalanced. A more thorough evaluation would require testing on more imbalanced data when it becomes available.
Methods
Multi-omic data
Multi-omics data used this study were downloaded from a cohort of Alzheimer’s disease (AD), the Religious Orders Study (ROS) and Memory and Aging Project (MAP) cohort22. Multi-omic data of 1717 participants were downloaded from the AMP-AD portal, including imputed genotypes, pre-processed RNA-Seq gene expression, and protein expression data, along with diagnostic information. RNA-seq gene expression and protein expression data were collected from the prefrontal cortex region of postmortem brains. Table 2 outlines the demographics of the study participants, including the breakdown of participants with cognitive normals (CN) and AD across all -omics data types. This dataset consists a substantial number of subjects with brain tissue data, totaling 683 CN and 581 AD in the genomics group, 354 CN and 268 AD in the transcriptomics group, and 641 CN and 541 AD in the proteomics group. Among those 1717 subjects, only 464 of them have complete sets of -omics types including 263 CN and 201 AD cases. Considering that information flow from SNPs to genes and proteins, this rich multi-omics dataset allows for a multi-layered analysis of the genetic and molecular factors associated with AD. Detailed information regarding the ROS/MAP cohort, and processing steps of genotype, gene expression and protein expression can be found in the supplementary text.
Pre-filtering of -omics features
This multi-omics analysis was designed to target the molecular mechanisms underlying 186 peptides measured in the ROS/MAP study, selected for their high relevance to AD23. Similar to our previous research, we employed a bottom-up approach to pre-filter SNPs and genes, which is also expected to mitigate the effect of small sample size4. Specifically, we mapped the 186 peptides to 126 unique genes (gene set A), which demonstrated functional interactions with 954 genes (gene set B) in the Reactome database. Among these 1080 (126 + 954) genes, 743 of them with available RNA-seq data were included in the subsequent analysis. Upstream SNPs within 5K base pairs of these genes were identified, and further filtered to include only those significantly associated with transcription factor-binding activity as per the SNP2TFBS database. This resulted in a comprehensive dataset comprising 822 SNPs, 743 genes, and 186 peptides (Fig. 4a). The functional relationships used to filter the genes and SNPs formed a prior multi-omic network, and will be embedded into the modules of TransFuse to guide the search for molecular sub-networks related to AD.
Prediction outcome
The SNP genotype, gene expression, and protein expression data were extracted and utilized for the classification of AD patients from cognitive normal (CN) individuals. The clinical diagnosis of all participants at the time of brain tissue collection was used as the indicator of their disease status. Notably, the diagnosis time aligns with the time of data collection for the -omics datasets.
Architecture of TransFuse
The proposed deep multi-omic fusion model is a graph neural network that enables transfer learning with a modular architecture, namely TransFuse. It is composed of 3 primary modules, each modeling one type of information flow or information integration (Fig. 4b). Module 1 takes genotype of pre-filtered SNPs as input and output nodes are pseudogenes, each corresponding to one pre-filtered gene. Links in module 1 are simply the SNP-gene connections in the prior network. Each pseudo-gene in module 1 integrates information from upstream SNPs of corresponding gene. Module 2 takes expression of pre-filtered genes as input and output nodes are pseudo-proteins, each corresponding to one input protein. Links in module 2 are gene-protein functional interactions in the prior network. Each pseudo-protein in module 2 integrates information from interacting genes and then integrated with the output of module 1. Module 3 takes the expression of pre-filtered proteins as input and output nodes are pseudo-proteins, same as the output of module 2. Each pseudo-protein in this module integrates information from connected proteins, including itself. Finally, output from these modules will be integrated and passed to the fully connected layers.
-
1.
Input layers: Input data was separated into 3 modalities as SNP genotype, gene expression and protein expression, denoted as \(X_1\in R^{N_{subj} * 822}\), \(X_2\in R^{N_{subj} * 743}\), and \(X_3\in R^{N_{subj} * 186}\) respectively. Here, \(N_{subj} = 464\) is the number of subjects.
-
2.
Graph fusion layers: Three graph fusion layers (i.e., modules) were directly wired with prior biological interactions to encode functional connectivity among SNPs, genes, and proteins. For example, the adjacency matrix \(A_{3}\in R^{186*186}\) represents interactions between proteins, where \(A_{3}(i,j) = 1\) if protein \(i\) has interactions with protein \(j\), and \(A_{3}(i,j) = 0\) otherwise. Self-connections within proteins are denoted by an identity matrix \(I_3\). The output of module 3, \(Z_{3}\), is computed as: \(Z_{3} = f\left( X_{3} \left( W_{3} \odot (A_{3} + I_{3}) \right) + b_{3}\right) ,\) where \(\odot\) represents the Hadamard product for element-wise multiplication of two matrices, and \(b_{3}\) is the bias term. The matrices \(A_{1} \in R^{822*743}\) and \(A_{2} \in R^{743*186}\) encode connections between SNPs and genes, and between genes and proteins respectively. Similarly, for \(k \in \left\{ 1, 2\right\}\), \(Z_{k} = f\left( X_{k} \left( W_{k} \odot A_{k} \right) + b_{k}\right)\).
-
3.
Graph bridge Layer: The graph bridge layer is designed to integrate the outputs from modules 1 and 2, i.e., \(Z_{1}\) and \(Z_{2}\). These outputs are concatenated and fed into the graph bridge layer with the structure used in the fusion layers. A combined adjacency matrix \(A_{b}\) is derived from the concatenation of gene-protein interactions \(A_{2}\in R^{743*186}\) and the self-connections among proteins \(I_{3}\in R^{186*186}\), i.e., \(A_{b} = \left[ A_{2}^T, I_{3}^T\right] ^T\), where \([\cdot ]\) stands for row concatenation. The output of the bridge layer, \(Z_{b} = f\left( \left[ Z_{1}, Z_{2}\right] \left( W_{b} \odot A_{b} \right) + b_{b}\right)\), where \(W_{b}\) represents the weight matrix specific to the bridge layer and \(b_{b}\) is the bias term in this layer.
-
4.
Fully connected layers: the architecture incorporates several fully connected layers for disease status classification, with dropout regularization to prevent overfitting. The initial fully connected layer takes the concatenated outputs from module 3 (\(Z_{3}\)) and graph bridge layer (\(Z_{b}\)) as its input. Following this first layer, two additional fully connected layers further process the features before reaching the final prediction layer. The output of the last layer, \(Z_{L}\), uses a sigmoid activation function to classify the samples: \(Z_{L} = \sigma (Z_{L-1} w_{L} + b_{L}),\) where \(L\) indicates the layer number of the final prediction layer, and \(w_{L}\) stands for the weight of \(L\)-th layer. The binary cross-entropy loss, \(\mathcal {L}_{\text {fuse}}\left( y, \hat{y}_{\text{ fuse }}\right) =-\frac{1}{n} \sum _{i=1}^n\left[ y_i \log \left( \hat{y}_{\text{ fuse }, i}\right) +\left( 1-y_i\right) \log \left( 1-\hat{y}_{\text{ fuse }, i}\right) \right]\) is used to measure the classification error across all input samples.
Omics-specific modules
Each module or graph fusion layer requires the input of only one -omics type, and therefore can be pre-trained using subjects without complete -omics types. Specifically, we pre-trained three graph neural networks for binary classification, each composed of a graph fusion layer followed with fully connected layers. Each single-modality graph neural network was pre-trained independently using subjects from each -omics layer, and only subjects with missing -omics types were included for pre-training. For each modality-specific graph neural network, \(k \in \{1,2,3\}\), the loss function \(\mathcal {L}_k\left( y, \hat{y}_k\right)\) is defined using binary cross-entropy, \(\mathcal {L}_k\left( y, \hat{y}_k\right) =-\dfrac{1}{{N_{k}}} \sum _{i=1}^{N_{k}}\left[ y_i \log \left( \hat{y}_{k, i}\right) +\left( 1-y_i\right) \log \left( 1-\hat{y}_{k, i}\right) \right] ,\) where \(N_{k}\) stands for the subjects available for pre-training of each modality, and \(\hat{y}_k\) represents the predicted diagnosis. The loss functions for SNP-, gene- and protein-specific neural networks were denoted as \(\mathcal {L}_{1}(\cdot ), \mathcal {L}_{2}(\cdot ), \mathcal {L}_{3}(\cdot )\) respectively.
Taking SNP-specific graph neural network as an example, the optimization problem is formulated as,
where \(W_1\) represents all the weights in the SNP-specific graph neural network, and \(\alpha _1\) is the regularization parameter. Since the prior network is not completely accurate and lacks tissue specificity, not all prior interactions hold the same importance. For instance, certain interactions may not occur in brain tissues, and thus, the corresponding information flow should not be modeled with brain multi-omics data. To address this, we employ a penalty term \(\Vert W_{1}\Vert _1\) to enforce zero weight on prior connections in the graph fusion layers, which helps exclude functional connections irrelevant to brain expression data and disease outcomes. The goal is to minimize the loss \(\mathcal {L}_1\) while promoting sparsity in the connections of the graph fusion layer. Weights from the graph fusion layers of all modality-specific neural networks are then transferred to the TransFuse model. Initially, these transferred weights were frozen to preserve the learned representations and stabilize the early phase of training on the complex integrated data. Later, all the modules were unfrozen, and the entire TransFuse model was fine-tuned using 464 subjects with a complete set of -omics types.
Parameter optimization
During the module pre-training stage, three omics-specific graph neural networks–for proteins, genes, and SNPs–were independently trained using the Adam optimizer24, with optimal hyperparameters selected via a grid search. The predefined hyperparameter ranges included: dropout rate (0.1, 0.3, 0.5, 0.7), L1 regularization (0.0001, 0.0003, 0.0005, 0.0008, 0.0010, 0.0030, 0.0050), initial learning rate (0.0002, 0.0004, 0.0006, 0.0008, 0.0010, 0.0015, 0.0020), and weight decay (0.0001, 0.0002). The best-performing configuration included a dropout rate of 0.5, L1 regularization of 0.0005, an initial learning rate of 0.0006, and weight decay of 0.0002. The resulting modality-specific networks achieved prediction accuracies of 0.75 for proteins, 0.80 for genes, and 0.59 for SNPs. The architectures of the final two fully connected layers in each network were set to (186, 16) for proteins, (186, 32) for genes, and (743, 48) for SNPs.
During the TransFuse fine-tuning stage, the learned weights from the three pre-trained modules were used to initialize the corresponding parts of TransFuse, while the fully connected layers were initialized with random weights. Fine-tuning was performed separately on the ROS/MAP and MSBB datasets using the same training pipeline. Initially, three modules in TransFuse were frozen, and only the fully connected layers and inter-module connections were trained using a relatively high learning rate of 0.0015. After this phase, the modules were unfrozen, and the entire network was fine-tuned using a smaller learning rate of 0.00006, selected from the range (0.00002, 0.00004, 0.00006, 0.00008). Early stopping and dropout (rate = 0.5) were applied to prevent overfitting. TransFuse was trained for up to 100 epochs, with early stopping triggered if validation loss failed to improve for 10 consecutive epochs or if training loss increased while validation loss decreased. The final two fully connected layers had dimensions of (186, 64) and (64, 1), respectively. All training was conducted in a Google Colab environment using an L4 GPU, PyTorch version 2.2.1, and CUDA version cu121.
Performance comparison
We compared the performance of TransFuse against MOGONET, vanilla neural network Multilayer Perceptron (MLP), random forest and three other logistic regression based classification models, using modularity, elastic net, and Lasso as penalty terms respectively4,25,26. These sparse logistic regression models were selected because they are designed for both classification and feature selection. Modularity and GraphNet constrained logistic regression (M-logistic) were implemented using Matlab. Elastic net constrained logistic regression, traditional logistic regression with lasso penalty, and random forest were implemented using the Python scikit-learn package. In addition, we evaluated the performance of both the baseline and fine-tuned TransFuse models. In the baseline model, initial weights were randomly assigned, whereas the fine-tuned model used weights initialized from the pre-trained modules. The ROS/MAP cohort includes 1,717 subjects, of whom 464 with complete multi-omics data were split into five folds for training and testing the baseline TransFuse and competing methods. The remaining subjects, missing at least one -omics type, were used for pre-training the individual modules. Classification accuracy and the Area Under the Receiver Operating Characteristic Curve (AUC) served as the primary evaluation metrics, complemented by F1 score, precision, and specificity for a more comprehensive performance assessment.
To ensure a robust and fair comparison, all methods underwent one single 5-fold cross-validation procedure for hyperparameter tuning and training, with identical training and test partitions. This approach was chosen because the limited sample size made nested cross-validation impractical, as smaller training folds led to unstable performance estimates. While we acknowledge that this may introduce some optimistic bias in the performance metrics, all competing methods were evaluated using the same single cross-validation strategy. Therefore, the comparisons between methods and the resulting conclusions should be fair and consistent. All methods were provided with the same input features, including genotype data for 822 SNPs, expression levels of 743 genes, and 186 peptides. Hyperparameters for all methods were carefully tuned across a comprehensive search space, detailed in Supplementary Table S3.
Model interpretation
TransFuse was interpreted in two ways. First, with the application of the L1 penalty, most links in the graph fusion layers or modules will receive zero coefficients. Since these links are uniquely mapped to the prior network, the weights within these modules can be used to prune the prior network. Second, each node will obtain an importance score using integrated gradient27. This importance score measures the impact of a unit change in that node on the prediction outcome, providing a relative measure to prioritize the contribution of SNPs, genes, and proteins. The cut-off threshold for pruning prior network edges is set at 0.00003, which was determined by identifying the point at which there is a significant increase in the number of connected components. This is based on the assumption that the top multi-omic features should be primarily functionally connected in a few large connected components, while less important ones are selected randomly and emerge as many small connected components or individual nodes. For other competing methods, features assigned a non-zero weight are regarded as contributing features.
Finally, we performed an ablation analysis to evaluate the relative contribution of each -omics type within the fine-tuned TransFuse model. In this analysis, we systematically removed or retained one -omics type at a time and recorded the corresponding changes in model performance. This approach enabled us to assess the unique predictive value each data type contributes to the overall effectiveness of TransFuse.
Replication analysis
The replication analysis utilized genotype, RNA-Seq gene expression, and protein expression data from the Mount Sinai Brain Bank (MSBB) cohort. Due to differences in protein quantification methods between the MSBB and the initial ROS/MAP cohort, direct peptide matching was not feasible. Consequently, the study included all the peptides corresponding to the 126 unique proteins previously identified in ROS/MAP. Finally, a total of 107 peptides, 648 genes, and 695 SNPs across 159 MSBB participants were included. Detailed information regarding MSBB cohort, and processing steps of genotype, gene expression and protein expression can be found in the supplementary text.
Overview of study cohorts and TransFuse architecture. (a) Pre-filtered multi-omic data collection in the ROS/MAP and MSBB cohorts. The image of brain surface was generated using BrainNet Viewer 1.7 in Matlab 2022b. (b) Architecture of complete TransFuse model. Shaded areas are pre-trainable modules that require input of only single -omics type.
Data availability
The source code is freely available through GitHub (https://github.com/JW-Yan/TransFuse). The data underlying this article is available via the AD Knowledge Portal (https://adknowledgeportal.org). The AD Knowledge Portal is a platform for accessing data, analyses, and tools generated by the Accelerating Medicines Partnership (AMP-AD) Target Discovery Program and other National Institute on Aging (NIA)-supported programs to enable open-science practices and accelerate translational learning. The data, analyses and tools are shared early in the research cycle without a publication embargo on secondary use. Data is available for general research use according to the following requirements for data access and data attribution (https://adknowledgeportal.synapse.org/Data%20Access). Data downloaded for the proposed analysis include: ROS/MAP protein expression (https://www.synapse.org/Synapse:syn21448467), ROS/MAP gene expression (https://www.synapse.org/Synapse:syn8456704), ROS/MAP genetic variants (https://www.synapse.org/Synapse:syn3221153), MSBB protein expression (https://www.synapse.org/Synapse:syn6100410), MSBB gene expression (https://www.synapse.org/Synapse:syn7391833), MSBB genetic variants (https://www.synapse.org/Synapse:syn11707204).
References
Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol. 18, 1–15 (2017).
Fortelny, N. & Bock, C. Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data. Genome Biol. 21, 1–36 (2020).
Chandrashekar, P. B. et al. Deepgami: deep biologically guided auxiliary learning for multimodal integration and imputation to improve genotype-phenotype prediction. Genome Med. 15, 88 (2023).
Xie, L., He, B., Varathan, P. et al. Integrative-omics for discovery of network-level disease biomarkers: a case study in alzheimer’s disease. Briefings in bioinformatics 22, bbab121 (2021).
Flores, J. E. et al. Missing data in multi-omics integration: Recent advances through artificial intelligence. Front. Artif. Intell. 6, 1098308 (2023).
Wang, T. et al. Mogonet integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 12, 3445 (2021).
Thung, K.-H. et al. Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion. NeuroImage 91, 386–400 (2014).
Hastie, T. et al. Imputing missing data for gene expression arrays. Technical report, Stanford University Statistics Department (1999).
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).
Jeganathan, S. et al. Proline-directed pseudo-phosphorylation at at8 and phf1 epitopes induces a compaction of the paperclip folding of tau and generates a pathological (mc-1) conformation. J. Biol. Chem. 283, 32066–32076 (2008).
Qin, X., Wang, Y. & Paudel, H. K. Early growth response 1 (egr-1) is a transcriptional activator of \(\beta\)-secretase 1 (bace-1) in the brain. J. Biol. Chem. 291, 22276–22287 (2016).
O’brien, R. J. & Wong, P. C. Amyloid precursor protein processing and alzheimer’s disease. Annu. Rev. Neurosci. 34, 185–204 (2011).
Dzwonek, J. & Wilczynski, G. M. Cd44: molecular interactions, signaling and functions in the nervous system. Front. Cell. Neurosci. 9, 175 (2015).
Duclot, F. & Kabbaj, M. The role of early growth response 1 (egr1) in brain plasticity and neuropsychiatric disorders. Front. Behav. Neurosci. 11, 35 (2017).
Chu, E., Mychasiuk, R., Hibbs, M. L. & Semple, B. D. Dysregulated phosphoinositide 3-kinase signaling in microglia: shaping chronic neuroinflammation. J. Neuroinflammation 18, 1–17 (2021).
Takata, F., Nakagawa, S., Matsumoto, J. & Dohgu, S. Blood-brain barrier dysfunction amplifies the development of neuroinflammation: understanding of cellular events in brain microvascular endothelial cells for prevention and treatment of bbb dysfunction. Front. Cell. Neurosci. 15, 661838 (2021).
Li, Z. et al. Angiopoietin-2 blockade ameliorates autoimmune neuroinflammation by inhibiting leukocyte recruitment into the cns. J. Clin. Investig. 130, 1977–1990 (2020).
Moore, A. M. et al. Apoe \(\varepsilon\)4-specific associations of vegf gene family expression with cognitive aging and alzheimer’s disease. Neurobiol. Aging 87, 18–25 (2020).
Hohman, T. J. et al. The role of vascular endothelial growth factor in neurodegeneration and cognitive decline: exploring interactions with biomarkers of alzheimer disease. JAMA Neurol. 72, 520–529 (2015).
Rosenthal, S. B. et al. Mapping the gene network landscape of alzheimer’s disease through integrating genomics and transcriptomics. PLoS Comput. Biol. 18, e1009903 (2022).
Rosenberger, A. F. et al. Altered distribution of the epha4 kinase in hippocampal brain tissue of patients with alzheimer’s disease correlates with pathology. Acta neuropathol. commun. 2, 1–13 (2014).
A Bennett, D., A Schneider, J., Arvanitakis, Z. et al. Overview and findings from the religious orders study. Curr. Alzheimer Res. 9, 628–645. (2012).
Bennett, D. A. et al. Religious orders study and rush memory and aging project. J. Alzheimers Dis. 64, S161–S189 (2018).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67, 301–320 (2005).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol. 58, 267–288 (1996).
Kokhlikyan, N., Miglani, V., Martin, M. et al. Captum: A unified and generic model interpretability library for pytorch. arXiv preprint arXiv:2009.07896 (2020).
Acknowledgements
This research was supported by NIH grants R01 AG081951, R21 AG072101, U19 AG074879, U01 AG068057, and NSF 2345235, 1942394. The results published here are in whole or in part based on data obtained from the AD Knowledge Portal. ROS/MAP: Study data was provided by the Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago. Data collection was supported through funding by NIA grants P30AG10161, R01AG15819, R01AG17917, R01AG30146, R01AG36836, U01AG32984, U01AG46152, the Illinois Department of Public Health, and the Translational Genomics Research Institute.
Author information
Authors and Affiliations
Contributions
L.X.: Conceptualization, Methodology, Visualization, Formal analysis, Validation, Writing - original draft, Writing review & editing. Y.R.: Methodology, Formal analysis, Validation. M.T.: Methodology, Formal analysis, Validation. K.N.: Data curation, Writing – review & editing. P.S.: Methodology, Supervision, Writing – review & editing. S.F.: Supervision, Writing – review & editing. A.S.: Data curation, Resource, Writing – review & editing. J.Y.: Conceptualization, Methodology, Visualization, Writing - original draft, Writing review & editing, Supervision, Funding acquisition. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
Dr. Saykin has received support from Avid Radiopharmaceuticals, a subsidiary of Eli Lilly (in kind contribution of PET tracer precursor) and holds advisory roles with Siemens Medical Solutions USA, Inc., NIH NHLBI, and Eisai. His editorial commitments include serving as Editor-in-Chief for the journal “Brain Imaging and Behavior”, and he participates in various NIH/NIA advisory committees. All the remaining authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Xie, L., Raj, Y., Tong, M. et al. Deep fusion of incomplete multi-omic data for molecular mechanism of Alzheimer’s disease. Sci Rep 15, 30182 (2025). https://doi.org/10.1038/s41598-025-14636-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-14636-2