Abstract
The integration of heterogeneous multi-omics datasets at a systems level remains a central challenge for developing analytical and computational models in precision cancer diagnostics. This paper introduces Multi-Omics Graph Kolmogorov–Arnold Network (MOGKAN), a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples together with Protein-Protein Interaction (PPI) networks for cancer classification across 31 different cancer types. The proposed approach combines differential gene expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi-omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov–Arnold theorem principle and uses trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28% and exhibits low experimental variability in comparison to related deep learning-based models. The biomarkers identified by MOGKAN were validated as cancer-related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability with potential to enhance the translation of complex multi-omics data into clinically actionable cancer diagnostics.
Similar content being viewed by others
Introduction
Cancer is a highly heterogeneous disease driven by genetic, epigenetic, and transcriptomic alterations. Advances in high-throughput sequencing technologies have enabled the generation of multi-omics datasets, offering deeper insights into the mechanisms underlying cancer development and patient outcomes. Among these, gene expression profiling plays a central role, allowing to monitor gene activity within specific tissues and cell populations and to distinguish cancerous from healthy cells1. Messenger RNA (mRNA) levels reflect active gene transcription under specific conditions, providing valuable information on tumor progression and cellular behavior2, as genes often display altered expression patterns in tumors compared to healthy tissues, revealing key molecular changes associated with cancer3. Analyzing such patterns aids in identifying cancer-specific genes and discovering potential biomarkers for early detection. The integration of gene expression data with DNA methylation and microRNA (miRNA) expression profiles has further advanced cancer research4. Combining molecular layers uncovers complex regulatory interactions that contribute to tumorigenesis5. DNA methylation profiling highlights epigenetic modifications that can silence tumor suppressor genes or activate oncogenes6, while miRNA expression analysis reveals critical mechanisms of post-transcriptional gene regulation involved in cancer progression7.
Despite the wealth of information provided by multi-omics data, extracting meaningful insights remains a major challenge due to the high dimensionality, feature heterogeneity, and complexity of genomic structures8,9. Traditional machine learning approaches, such as Support Vector Machines (SVMs) and Random Forests (RF), have shown potential for multi-omics-based cancer classification but often struggle with modeling the complex relationships in high-dimensional datasets and providing interpretable results10,11.
Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have demonstrated strong capabilities in capturing complex biological interactions12,13. Unlike conventional models that rely on Euclidean-based representations, GNNs naturally encode the relationships among genes, proteins, and regulatory elements within a graph structure, offering a biologically meaningful approach14,15. The Graph Kolmogorov-Arnold Network (GKAN) represents a significant advancement, whereas by applying Kolmogorov-Arnold representation theory to graph learning, GKAN enhances both model interpretability and flexibility through the use of trainable univariate functions on graph edges16,17. Furthermore, the incorporation of spline-based transformations allows for precise feature extraction and greater transparency, making GKAN particularly well-suited for biomarker discovery in cancer diagnosis.
This article introduces the Multi-Omics Graph Kolmogorov–Arnold Network (MOGKAN), a deep learning framework that integrates graph-based modeling of mRNA, miRNA, and DNA methylation data to classify 31 distinct cancer types. Protein-Protein Interaction (PPI) network information is used for defining the graph structure of MOGKAN. The data preprocessing pipeline combines differential expression analysis, Linear Models for Microarray Analysis (LIMMA)18, and LASSO regression to extract the most informative multi-omics features. DESeq219 was applied to mRNA data to identify genes exhibiting significant changes in expression levels. For DNA methylation data, LIMMA employs empirical Bayes methods to stabilize variance estimates and improve the detection of differential signals, particularly for genes with low expression levels. This approach enables the identification of differentially methylated regions with high sensitivity and specificity, providing a robust foundation for epigenetic research and the discovery of novel cancer biomarkers.
The primary contributions of this work are as follows:
-
Proposed MOGKAN, a novel deep learning framework for cancer classification with inherent feature interpretability through learnable activation functions.
-
Constructed a graph-based model integrating a PPI network graph structure with multi-omics data from mRNA, miRNA, and DNA methylation profiles. The combined use of DESeq2, LIMMA, and LASSO enabled the selection of biologically relevant features critical for cancer classification.
-
Identified key biomarkers driving cancer progression and validated their functional relevance through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses.
The rest of this paper is structured as follows. Section 2 discusses related work on graph-based network architectures and Kolmogorov–Arnold networks. Section 3 describes the datasets, preprocessing pipeline, multi-omics data integration, GKAN architecture, and experimental setup. Section 4 presents experimental results and analysis, and biomarker discovery. Section 5 concludes the paper.
Related works
Graph Neural Networks (GNNs) have demonstrated significant success in modeling structured data derived from graph-based relationships. While traditional GNNs offer strong predictive performance, they often face two major limitations, related to scalability and interpretability issues. To address these challenges, recent works explored Kolmogorov–Arnold Networks (KAN) as an alternative architecture. GKAN extends this concept by integrating KAN into graph learning tasks, replacing conventional linear weights with trainable univariate functions, thereby enhancing both model interpretability and flexibility.
Recent studies have increasingly adopted GKAN to improve feature representation and model interpretability in graph-based learning tasks. Zhang et al.16 introduced GraphKAN, replacing standard activation functions with KAN-based structures to enhance feature extraction, demonstrating superior performance in both node and graph classification tasks compared to traditional GNN architectures. Similarly, Kiamari et al.20 incorporated spline-based activation functions between graph layers, achieving high performance across a variety of graph network types. Kolmogorov–Arnold Graph Neural Networks (KAGNNs) proposed by Bresson et al.21 extended message-passing operations using principles from the Kolmogorov–Arnold theorem to improve graph learning. Carlo et al.22 further refined GKAN by applying spline-based activation functions directly to graph edges, boosting both predictive accuracy and interpretability.
Beyond methodological improvements, several studies have successfully employed GKAN-based architectures in molecular and biomedical domains. Ahmed et al.23 demonstrated that GKANs can accurately predict small molecule–protein interactions, highlighting their potential in drug discovery. Li et al.24 developed GNN-SKAN, an architecture combining Swallow-KAN (SKAN) with basic GNNs, achieving state-of-the-art results across multiple molecular datasets. The increasing complexity of biomedical data in multi-omics cancer classification presents a compelling opportunity for GKAN to advance cancer-type prediction as a rapidly evolving research area. GKAN networks are particularly well-suited for applications that demand transparent, interpretable models, as they provide explicit insights into prediction decisions.
Recent approaches such as scRGCL25 and scMGATGRN26 combined GNNs with contrastive learning and multiview mechanisms for improved interpretability and learning high-order biological relationships in single-cell transcriptomic data. Furthermore, scAMZI27 presented attention-based autoencoders as an additional scRNA-seq clustering method, and iCRBP-LKHA28 used hybrid attention and deep convolutional kernels to predict circRNA-RBP interactions. While these models are employed for cell-type annotation or molecular interaction predictions, MOGKAN is scaled to multi-omics integration and cancer-type classification and enhances both performance and interpretability.
Materials and methods
Dataset
This study utilizes data from the Pan-Cancer Atlas database29, which comprises genomic, transcriptomic, and epigenomic information across a wide range of cancer types. Access to this comprehensive database is facilitated by Genomic Data Commons (GDC), which provides streamlined data retrieval through query tools, such as the TCGAbiolinks package29. Developed by the National Cancer Institute, GDC serves as a standardized platform that promotes collaborative cancer research by enabling consistent and cohesive data sharing. For the analysis in this work, the extracted data includes 9,171 DNA methylation samples, 10,668 mRNA expression samples, and 10,465 miRNA expression samples. The number of omics samples for 33 types of cancer and normal tissue is detailed in Table 129. The compiled multi-omics data resource from the Pan-Cancer Atlas database enables the investigation of complex biological relationships and supports identification of biomarkers for studying tumor biology and clinical outcomes.
Data preprocessing
The data preprocessing pipeline in our work integrates dimensionality reduction techniques and feature selection methods to handle high-dimensional data and identify biologically relevant features in omics data. Specifically, we employed LIMMA and differential gene expression (DGE) analysis for feature selection. DGE analysis based on DESeq2 was applied to mRNA expression data, employing a negative binomial model to detect genes with significant expression changes30,31. LIMMA was used to analyze DNA methylation data and identify differentially methylated CpG sites31. To further reduce data dimensionality, we applied LASSO regression to mRNA and DNA methylation data32. The flow chart for data preprocessing is depicted in Fig. 1, with the following sections outlining the phases in the data processing workflow.
Differential gene expression (DGE) analysis
In genomics, DGE profiling is frequently used to compare the expression levels of genes in a particular organism under various settings or conditions (e.g., normal versus cancer, treatment against control, etc.)33. The analysis helps elucidate gene regulation mechanisms, environmental influences on gene activity, and a variety of other underlying biological processes. In our investigation, we employed DESeq2 to perform differential gene expression analysis on the mRNA data. DESeq2 models gene-level count data using a negative binomial distribution, which effectively accounts for both biological variability and overdispersion. We assessed the statistical significance of gene expression changes using the Wald test, based on p-values derived from the Wald statistic to evaluate whether the estimated log fold changes are significant. To identify genes potentially relevant to the biological processes under study, we applied a p-value threshold of 0.001.
LIMMA
For differential methylation analysis, we applied the LIMMA technique by fitting a linear model to the methylation levels of CpG sites as a function of experimental sample groups34. The initial dataset derived from the Human Methylation 450 K (HM450) array included 485,577 features across 9,171 samples35 (Table 2). Using LIMMA, we identified CpG sites that are significantly differentially methylated in tumor samples compared to normal controls. For each CpG site, LIMMA computes a moderated t-statistic and an effect size that captures the relative methylation differences between groups. The corresponding p-value indicates the statistical significance of each comparison. After applying a p-value cutoff of 0.05, the number of CpG features was reduced to 139,321, representing the most notable methylation alterations associated with the disease state.
LASSO Regression
Lasso Regression is a linear regression technique that incorporates L1 norm regularization to enhance model performance and reduce overfitting. The algorithm minimizes the sum of squared residuals while imposing penalties proportional to the absolute values of model coefficients. The enforcement of such penalty encourages sparsity by shrinking some coefficients to zero, effectively performing feature selection by eliminating less important variables. The Lasso objective function is given by the following equation:
where \(\:{\gamma\:}_{i}\) is the observed response variable for the \(\:i\)th sample, \(\:{\chi\:}_{ij}\:\)denotes the feature values, \(\:{\beta\:}_{j}\) are the regression coefficients, \(\:\lambda\:\:\)is the regularization parameter that controls the degree of sparsity, and \(\:n\) is the number of samples.
Multi-Omics data integration
To integrate mRNA (RNA-Seq), miRNA, and DNA methylation data into unified records, we used sample IDs as the linkage element. An inner join operation was performed on the common sample IDs across the three omics datasets, retaining only those samples that have complete data for all modalities. Cancer types lacking any of the omics layers were excluded from further analysis. Notably, two cancer types LAML and GCT (Table 1) were excluded from further analysis due to missing RNA-Seq and miRNA data, respectively. The final integrated dataset contains 8,464 samples spanning 31 cancer types and corresponding normal tissues, encompassing a total of 2,794 omics features (as summarized in Table 2).
Beside the used early integration strategy where multi-omics modalities are concatenated before passing them to a graph-based model, other integration strategies have been applied in prior works. Picard et al.36 described multi-omics integration under five integration types: early, mixed, intermediate, late, and hierarchical integration. These strategies have trade-offs in relation to the complexity of the model and pertaining to the interpretability and capability of preserving modality-specific signals. Although simple and popular, early integration methods might not be ideal in dealing with heterogeneity and differences in feature dimensionalities between layers of omics data. Conversely, mixed or intermediate integration have the ability to maintain modality-specific structures, as well as to enable more flexible modeling pipelines. Alternative strategies, like late or hierarchical integration, where each type of omics has its own encoder and they are subsequently fused, may also enhance interpretability and predictive power. Different integration strategies will be investigated in future versions of our framework.
Graph Kolmogorov–Arnold networks (GKAN)
GKAN represents a neural architecture that extends the Kolmogorov–Arnold Representation Theorem to graph-structured data, offering an alternative to traditional GNNs. Namely, unlike traditional GNNs which rely on message passing, GKAN utilizes functional decomposition to model interactions within graphs. By decomposing node and edge relationships into hierarchical, learnable transformations, GKAN can effectively capture long-range dependencies supported by Kolmogorov-Arnold representation while addressing the over-smoothing problem by using learned edge activation that often affects GNNs37. GKAN expresses multi-dimensional functions as summations of nonlinear one-dimensional functions, enabling the adaptive transformation of node embeddings based on information from neighboring nodes. This is grounded in the Kolmogorov-Arnold theorem, which states that any continuous multivariate function \(\:f:\:{\mathbb{R}}^{d}\to\:\mathbb{R}\) can be decomposed as:
where \(\:{g}_{q}\) and \(\:{h}_{q,p}\) are learnable nonlinear functions, \(\:{x}_{p}\) denotes the input features, and \(\:d\) is the input feature dimension.
For a given graph \(\:G=\:(V,\:E)\), where \(\:V\:\)is the set of nodes and \(\:E\) is the set of edges, the node features \(\:{h}_{v}^{\left(l\right)}\) at a layer \(\:l\) are updated using:
In (3), \(\:{h}_{v}^{\left(l\right)}\) denotes the feature representation of node \(\:v\) at layer \(\:l\), \(\:\mathcal{N}\left(v\right)\) represents the set of neighboring nodes of \(\:v\), and \(\:{g}_{q}\) and \(\:{h}_{q,p}\) are trainable transformation functions applied to graph features. After several layers of hierarchical transformations, the final node representation is obtained as:
where \(\:{\widehat{\mathcal{y}}}_{v}\:\)is the predicted class or regression output for node \(\:\nu\:\), \(\:\sigma\:\) is an activation function such as softmax (for classification) or sigmoid (for binary prediction), and \(\:L\) is the total number of layers in the network.
Graph structure
Protein-protein interactions are fundamental to biological systems, as they represent physical contacts or functional relationships between two or more protein molecules. The interactions play a central role in regulating cellular processes. In this study, we constructed a PPI network using the STRING database, which integrates both experimentally validated and computationally predicted protein interaction data38. STRING aggregates data from diverse biological sources, including high-throughput experimental assays, curated pathway databases, co-expression analyses, and text-mined associations extracted from scientific literature39. To enhance the reliability of interaction data, STRING assigns confidence scores to each interaction based on the strength and consistency of supporting evidence, thereby improving the robustness of biological network analyses37.
For our analysis, we focused on constructing a disease-specific protein network that connects proteins associated with multi-omics-derived genes, including mRNA, miRNA, and DNA methylation profiles. We utilized the STRING API (version 11.5) to automatically retrieve Homo sapiens (NCBI Taxonomy ID: 9606) protein interaction data. The results in TSV format were accompanied by confidence scores derived from co-expression data, experimental findings, curated databases, and literature mining. The resulting graph based on PPI networks was afterward used as input to the MOGKAN model, enabling biological graph representations that support cancer classification and biomarker discovery.
Experimental setting
The pipeline of the proposed framework is illustrated in Fig. 2. To construct the graph structure, we utilized a PPI-based edge index, derived by identifying highly interactive proteins within the PPI network. The selection is based on protein frequency counts, where only proteins appearing at least 200 times in the dataset are retained, ensuring the inclusion of biologically significant hub proteins with prominent roles in cellular function and interaction networks. This filtering step also aids in adjusting the graph’s layout and the amount of data it contains. Excluding weakly connected hubs removes isolated nodes from the graph, resulting in a more cohesive and interpretable graph structure.
The layers in the MOGKAN architecture are depicted in Fig. 2. Multi-omics data for each sample, including mRNA, miRNA, and DNA methylation, are integrated into a unified feature vector that is assigned to the corresponding nodes in the graph. Information is propagated across the network using GATConv layers40, which dynamically assign weights to neighboring nodes based on an attention mechanism learned during training. This allows the model to prioritize the most informative interactions by computing relevance scores for each neighbor. Through multiple GAT layers, the network iteratively refines node representations by aggregating attention-weighted messages, effectively capturing both biological signals and meaningful connectivity patterns. The resulting node embeddings encode local gene-specific characteristics while also reflecting the broader structure of PPI networks.
To fine-tune hyperparameters we employed a grid search strategy for learning rate, weight decay, dropout rate, and the number of attention heads. Model training was performed using the Adam optimizer over 100 epochs. The results of a grid search for the MOGKAN model are presented in Table 3, which shows the top 10 hyperparameter combinations that resulted in the highest mean accuracy and F1-score during grid search optimization of the model on multi-omics data. Several different arrangements were tried by altering the main architecture and training details like the hidden dimension size, number of attention heads, the hidden layers, dropout rate, learning rate, and L2 regularization strength. The model achieved consistently high performance across multiple settings, obtaining mean accuracies of 96.1% and F1-scores exceeding 95%. The best achieved results were in the configuration using a hidden dimension of 2048, four attention heads, two hidden layers, a dropout rate of 0.2, a learning rate of 0.0001, an L2 penalty of 0.0001, and yielded an accuracy of 96.17% and an F1-score of 95.12%.
As depicted in Fig. 2, the framework employs 5-fold cross-validation ensuring robust evaluation across different data splits. Data from multi-omics is processed by two consecutive GAT layers. The first GAT multiplies feature vectors by four attention heads, each with 2048-dimensional information, followed by a second GAT that sums the outputs and a LeakyReLU is used as its activation. The output is then passed through three Kolmogorov–Arnold Network (KAN) layers that apply nested nonlinear transformations (ψ → tanh → ϕ → tanh), each followed by batch normalization, LeakyReLU activation, and dropout for regularization. Feature dimensions are progressively reduced from the initial hidden size to 1024 and then to 512. Lastly, a linear classifier takes each processed set of attributes and converts them to 32 categories for the different cancer types. This hybrid design enables MOGKAN to capture both topological dependencies and complex nonlinear patterns in multi-omics cancer data.
Performance evaluation was conducted using standard classification metrics, including accuracy, precision, recall, and F1-score, averaged across multiple folds of cross-validation. To interpret the model’s predictions, feature importance was assessed based on activations in the model’s GAT first layer. For every attention head, the model learns a set of coefficients to determine the importance of nearby genes (or nodes) when aggregating data. To find out the importance of each gene, we averaged the attention scores given to each gene across the first layer of the GAT. Instead of using just one “head,” this approach captures a comprehensive view of the model’s attention mechanism. After the training process, we obtained the average attention weights assigned to each node across all samples and heads. Genes that consistently received high attention scores across different samples were considered more influential, as they contributed more significantly to feature propagation and decision-making within the graph. The most influential features were mapped to their corresponding genes using a BioMart query, providing enhanced biological context and insight into their relevance in cancer classification.
Performance metrics
For model evaluation, we applied standard performance metrics for multi-class classification tasks including accuracy, precision, recall, and F1-score. The model accuracy serves as a measure of the overall correctness defined through the following equitation:
Macro-averaging enables calculation of precision, recall, and F1-score per class before computing their collective average without preference to any class, as follows.
Results and discussion
The performance of the proposed MOGKAN framework evaluated using 5-fold cross-validation is summarized in Table 4. MOGKAN achieved a classification accuracy of 96.28% across 32 cancer types by integrating mRNA, miRNA, and DNA methylation data. The results demonstrate performance improvement ranging from 1.58 to 7.30% compared to related works employing deep learning architectures based on Convolutional Neural Network (CNN), Graph Convolutional Neural Network (GCNN), and Graph Transformer Network (GTN). In particular, Mostavi et al.41 achieved 95.70% accuracy using a CNN-based model, Ramirez et al.42 reported 94.61% accuracy with a GCNN-PPI approach, whereas Kaczmarek et al.43 implemented a GTN model that achieved 93.56% accuracy. Moreover, MOGKAN exhibits improved reliability as evidenced by its low standard deviation across the multiple folds of ± 0.0035, which stands out against the variability in the results of related works.
Table 5 presents the experimental evaluation of MOGKAN with single-omics and multi-omics data for classification of 31 cancer types and normal tissues. Among the single-omics inputs, the model trained with mRNA data achieved the highest accuracy of 0.9562 along with 0.9524 precision, 0.9357 recall, and 0.9414 F1-score. The multi-omics MOGKAN model trained with combined DNA methylation and miRNA data performed similarly to the single-omics models, although the results remain slightly below top performance levels. The combined use of mRNA, DNA methylation, and miRNA data resulted in the most effective performance including 0.9628 accuracy, 0.9582 precision, 0.9445 recall, and 0.9489 F1-score. These results confirm that the integration of multiple omics modalities enhances the performance across all evaluation metrics.
Notably, the removal of miRNA data resulted in the most significant drop in model performance, compared to excluding either mRNA or DNA methylation data. While mRNA is typically considered central to assessing gene expression and tumor signals, several biological and technical factors explain the impact of miRNA. Biologically, miRNAs play a critical role in post-transcriptional gene regulation, often acting as oncogenes or tumor suppressors. Dysregulation of miRNAs can influence the expression of numerous genes simultaneously, making them key indicators of cancer progression and subtype differentiation. Additionally, miRNAs are fewer in number and exhibit more stable expression patterns than mRNAs, which may help the model identify more robust and generalizable patterns. From a technical standpoint, the GraphKAN model’s attention mechanism assigns dynamic weights to input features. The stronger influence of miRNA features suggests that they contributed more prominently during training, allowing the model to focus on their informative value more effectively.
To further evaluate the generalization capability of MOGKAN, we performed a type-blind evaluation that withheld TCGA-LUAD, TCGA-LUSC, and TCGA-PRAD during training and used them only for testing. Table 6 demonstrates that the model maintains good predictive performance even in this rigorous evaluation. With the TCGA-PRAD cancer type the model achieved an accuracy of 0.9847, precision of 0.9973, and recall of 0.9620, indicating excellent generalization. With the TCGA-LUAD cancer type the model also showed robust performance with an accuracy of 0.9590, and an F1 score of 0.9393. Although the model performance on TCGA-LUSC was comparatively lower (accuracy = 0.9237, recall = 0.7945), it still reflects effective model generalization under type-blind conditions.
Table 7 lists the top 10 biomarkers identified by the MOGKAN framework based on feature importance. The importance of each feature was quantified using the absolute sum of weights from the model’s linear transformation layer, enabling a discriminative selection of key biomarkers. BioMart was employed to map the features to their corresponding gene identifiers, enhancing biological interpretability. The ten identified biomarkers MCL1, LINC01410, GALNT6, MAML3, ITGB3, LINC01090, PKDCC, PCAT14, KIF16B, and PITPNM3 showed cancer-specific functional patterns that align with known mechanisms of carcinogenesis. For instance, MCL1 was reported to contribute to therapy resistance in breast cancer by regulating mitochondrial oxidative phosphorylation activity (PMID: 28978427)44. GALNT6 and PITPNM3 have been identified as dual-function proteins promoting both epithelial-mesenchymal transition and immune evasion (PMIDs: 39245709, 21481794)45,46. Several long non-coding RNAs, including LINC01410 (PMID: 32104067), LINC01090 (PMID: 34550610), and PCAT14 (PMID: 35003397), were found to participate in ceRNA regulatory networks. Notably, PCAT14 exhibited the highest diagnostic precision for prostate cancer47,48,49. Furthermore, ITGB3 and KIF16B were implicated in extracellular vesicle-mediated communication, with evidence supporting their role as potential biomarkers for metastatic colorectal cancer (PMIDs: 37040507, 35487942)50,51. The activity of MAML3 is regulated by hypoxia-inducible factors, activating Hedgehog (HH) and NOTCH pathways in gallbladder cancer (GBC), thereby promoting tumor growth, migration, and invasion, while also enhancing sensitivity to gemcitabine (PMID: 37351966)52. Lastly, PKDCC has been linked to non–small cell lung cancer progression (PMID: 35847849)53.
Figures 3 and 4 depict the top 10 enriched Gene Ontology (GO) terms identified through MOGKAN analysis, highlighting molecular systems that contribute to multi-cancer classification. Figure 3 represents the top 10 enriched Gene Ontology (GO) terms for biological processes. Each horizontal bar corresponds to a GO term, with the length of the bar indicating the degree of enrichment, where the taller the bar the more enriched GO terms are. Our GO analysis identified the top 10 terms that our genes are associated with them, including positive regulation of respiratory burst, regulation of respiratory burst and apoptotic cell clearance, which highlight biological processes that are significantly overrepresented in the analyzed top 50 gene set. These GO terms reflect the role of reactive oxygen species (ROS) in modulating tumor microenvironment dynamics and immunotherapy outcomes54. Overall, the enriched GO terms validate the biological relevance of MOGKAN’s graph-based integration of multi-omics data, reinforcing its ability to uncover functionally significant pathways involved in cancer development and classification.
Figure 4 depicts the top 10 significantly enriched Gene Ontology (GO) molecular functions, primarily focused on lipid binding and ion channel regulation, suggesting roles in cellular signaling and membrane dynamics. The prominence of phosphatidylinositol (PI) binding terms—such as Phosphatidylinositol-3,5-Bisphosphate Binding (GO:0080025) and Phosphatidylinositol-3-Phosphate Binding (GO:0032266)—indicates involvement in phosphoinositide signaling, a pathway critical for membrane trafficking, autophagy, and cell survival. These findings align with established research indicating that dysregulation of phosphatidylinositol metabolism is prevalent across various cancers, as it activates the PI3K–AKT–mTOR signaling pathway55 and contributes to treatment resistance53.
Figure 5 presents the cancer-related gene set enriched KEGG pathways56,57,58, ranked by statistical significance using –log₁₀(p-value) scores. The “Mucin type O-glycan biosynthesis” pathway emerged as the most significantly enriched, highlighting its role in altering glycosylation patterns on tumor cells, a known contributor to cancer progression59,60. Closely following is the “Sphingolipid metabolism” pathway, which supports tumor cell survival and resistance to therapeutic agents61. The “Prolactin signaling pathway” ranks next in significance, particularly relevant for its involvement in breast cancer regulation62. Additionally, approximately 15% of all cancer-associated genes in the dataset are mapped to the “PI3K-Akt signaling pathway”, which serves as a central regulator of cellular proliferation and apoptosis yet remains frequently abnormal in cancer development63. Further down the ranking, the “Rap1 signaling pathway” was identified, underscoring its role in cell adhesion and metastasis, which aligns with the invasive phenotypes observed in several cancers64. The pathways “Aldosterone-regulated sodium reabsorption”, “Type I diabetes mellitus”, and “Maturity onset diabetes of the young” were also enriched, suggesting shared metabolic disruptions between cancer and diabetes65,66. While several pathways show moderate enrichment with –log₁₀(p-values) around 0.6, “Mucin type O-glycan biosynthesis” exhibits the strongest enrichment signal. Collectively, these results reveal biological pathways that describe mechanisms of cancer development, particularly those related to glycosylation events, lipid metabolism, and growth factor signaling.
The presence of domain-specific deep learning applications to a variety of biological questions underpins the importance of focused, explainable models. As an illustration, in NSCLC ferroptosis-associated lncRNAs were reported to be strong prognosticators and predictors of immunotherapy efficacy67. On the same note, deep learning has been used to rebuild the features of protein transport68 and diagnose cleft lip and palate through imaging-based ML models69. All of these studies demonstrate the increasing range of applications of interpretable models in clinical diagnostic, which is consistent with the goal of MOGKAN to discover biologically relevant cancer biomarkers using graph-based learning.
Limitations and future work
The proposed MOGKAN framework has several limitations. First, it relies on static PPI network information data from the STRING database, which may lack dynamic or condition-specific protein interactions. Incorporating tissue-specific or context-aware interaction networks could strengthen the biological relevance of the constructed network. Second, while the current model integrates mRNA, miRNA, and DNA methylation data, it omits other valuable omics layers such as proteomics, metabolomics, and copy number variation, which could provide complementary biological insights. Third, we used an early integration strategy, where multi-omics data (mRNA, miRNA, and methylation) are concatenated into a single feature vector prior to graph modeling. While this approach simplifies representation learning, it may obscure modality-specific characteristics and interactions.
For future work, we plan to extend the framework by incorporating a broader range of multi-omics data and utilizing dynamic and context-specific interaction networks to enhance model performance and biological interpretability. In addition, implementing attention mechanisms to weigh the contributions of different omics features may further improve predictive accuracy. Also, we will explore late integration strategies, such as using modality-specific encoders followed by attention-based or gating fusion mechanisms, which may better capture complementary signals across omics layers and improve both performance and interpretability.
Conclusion
This study introduces MOGKAN, a novel deep learning framework for accurate and interpretable cancer classification using multi-omics data. The approach integrates a three-step data preprocessing pipeline, combining DESeq2, LIMMA, and LASSO regression to preserve key biological signals while reducing dimensionality. By fusing DNA methylation, miRNA, and mRNA data with Protein-Protein Interaction (PPI) networks, MOGKAN achieves a classification accuracy of 96.28% across 31 cancer types. Through the application of the Kolmogorov–Arnold theorem, the framework extracts hierarchical features that enhance both predictive performance and biological interpretability. Key biomarkers identified by MOGKAN including MCL1, GALNT6, and ITGB3 were validated through GO and KEGG pathway analyses, confirming their involvement in critical processes like PI3K-AKT signaling, lipid metabolism, and immune evasion. These findings demonstrate the capability of the proposed framework to uncover fundamental molecular drivers of cancer and support its potential for clinical application in personalized cancer therapy.
Data availability
The datasets generated and/or analyzed during the current study are publicly available at https://www.idahofallshighered.org/vakanski/Codes_Data/mRNA_miRNA_Meth_integrated.csv
References
Narrandes, S. & Xu, W. Gene expression detection assay for cancer clinical use. J. Cancer. 9, 2249 (2018).
Singh, K. P., Miaskowski, C., Dhruva, A. A., Flowers, E. & Kober, K. M. Mechanisms and measurement of changes in gene expression. Biol. Res. Nurs. 20, 369–382 (2018).
Li, M., Sun, Q. & Wang, X. Transcriptional landscape of human cancers. Oncotarget 8, 34534 (2017).
Heo, Y. J., Hwa, C., Lee, G. H., Park, J. M. & An, J. Y. Integrative multi-omics approaches in cancer research: from biological networks to clinical subtypes. Mol. Cells. 44, 433–443 (2021).
Menyhárt, O. & Győrffy, B. Multi-omics approaches in cancer research with applications in tumor subtyping, prognosis, and diagnosis. Comput. Struct. Biotechnol. J. 19, 949–960 (2021).
Geissler, K. et al. The role of aberrant DNA methylation in cancer initiation and clinical impacts. Ther Adv. Med. Oncol 16, 1–23 (2024).
Ankasha, S. J., Shafiee, M. N., Wahab, N. A., Ali, R. A. & Mokhtar, N. M. Post-transcriptional regulation of MicroRNAs in cancer: from prediction to validation. Oncol Rev 12, 1–6 (2018).
Chai, H. et al. Integrating multi-omics data through deep learning for accurate cancer prognosis prediction. Comput. Biol. Med. 134, 104481 (2021).
Bersanelli, M. et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinform. 17, 167 (2016).
Ballard, J. L. et al. Deep learning-based approaches for multi-omics data integration and analysis. BioData Min. 17, 38 (2024).
Way, G. P. & Greene, C. S. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput. 80–91 (2018). (2018).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. Preprint at (2016). https://arxiv.org/abs/1609.02907
Hamilton, W. L. Graph Representation Learning (Morgan & Claypool, 2020).
Velickovic, P. et al. Graph attention networks. Stat 1050, 10–48550 (2018).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at (2019). https://arxiv.org/abs/1810.00826
Kiamari, M., Kiamari, M. & Krishnamachari, B. G. K. A. N. Graph Kolmogorov-Arnold Networks. Preprint at (2024). https://arxiv.org/abs/2406.06470
Zhang, F. & Zhang, X. G. K. A. N. Enhancing Feature Extraction with Graph Kolmogorov Arnold Networks. Preprint at (2024). https://arxiv.org/abs/2406.13597
Ritchie, M. E. et al. LIMMA: Linear Models for Microarray and RNA-Seq Data Analysis, version 3.58.1. (2024). https://bioconductor.org/packages/limma
Love, M. I., Huber, W. & Anders, S. DESeq2: differential expression analysis based on the negative binomial distribution, version 1.42.0. (2024). https://bioconductor.org/packages/DESeq2
Bresson, R. et al. KAGNNS: Kolmogorov-Arnold Networks Meet Graph Learning. Preprint at (2024). https://arxiv.org/abs/2406.18380
Carlo, G. D., Mastropietro, A. & Anagnostopoulos, A. Kolmogorov-Arnold Graph Neural Networks. Preprint at (2024). https://arxiv.org/abs/2406.18354
Ahmed, T., Sifat, M. H. & GraphKAN Graph Kolmogorov Arnold Network for Small Molecule-Protein Interaction Predictions. In ICML’24 Workshop ML for Life and Material Science: From Theory to Industry Applications (2024).
Li, R., Li, M., Liu, W. & Chen, H. GNN-SKAN: Harnessing the Power of SwallowKAN to Advance Molecular Representation Learning with GNNs. Preprint at (2024). https://arxiv.org/abs/2408.01018
Yuan, L. et al. ScRGCL: a cell type annotation method for single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning. Brief. Bioinform. 26 (1), bbae662 (2025).
Yuan, L., Zhao, L., Jiang, Y., Shen, Z., Zhang, Q., Zhang, M., … Huang, D. S. scMGATGRN:a multiview graph attention network–based method for inferring gene regulatory networks from single-cell transcriptomic data.Briefings in bioinformatics,25(6), bbae526 (2024).
Yuan, L., Xu, Z., Meng, B. & Ye, L. ScAMZI: attention-based deep autoencoder with zero-inflated layer for clustering scRNA-seq data. BMC Genom. 26 (1), 350 (2025).
Yuan, L., Zhao, L., Lai, J., Jiang, Y., Zhang, Q., Shen, Z., … Huang, D. S. iCRBP-LKHA:large convolutional kernel and hybrid channel-spatial attention for identifying circRNA-RBP interaction sites.PLOS Computational Biology,20(8), e1012399 (2024).
Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R Stat. Soc. Ser. B Stat. Methodol. 58, 267–288 (1996).
Rapaport, F. et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 14, 1–13 (2013).
Chen, J. et al. An epigenome-wide analysis of socioeconomic position and tumor DNA methylation in breast cancer patients. Clin. Epigenetics. 15, 68 (2023).
Pidsley, R. et al. A data-driven approach to preprocessing illumina 450K methylation array data. BMC Genom. 14, 1–10 (2013).
Picard, M., Scott-Boyer, M. P., Bodein, A., Périn, O. & Droit, A. Integration strategies of multi-omics data for machine learning analysis. Comput. Struct. Biotechnol. J. 19, 3735–3746 (2021).
Liu, Z. et al. (2024). Kan: Kolmogorov-arnold networks, arXiv preprint arXiv:2404.19756.
Jensen, L. J. et al. STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 37, D412–D416 (2009).
Szklarczyk, D. et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 49, D605–D612 (2021).
Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2012).
PyTorch Geometric. Graph Attention Convolution (GATConv) layer, version 2.4.0. (2024). https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GATConv.html
Mostavi, M., Chiu, Y. C., Huang, Y. & Chen, Y. Convolutional neural network models for cancer type prediction based on gene expression. BMC Med. Genomics. 13, 1–13 (2020).
Ramirez, R. et al. Classification of cancer types using graph convolutional neural networks. Front. Phys. 8, 203 (2020).
Kaczmarek, E. et al. Multi-omic graph Transformers for cancer classification and interpretation. Pac. Symp. Biocomput. 2022, 373–384 (2021).
Lee, K. M. et al. MYC and MCL1 cooperatively promote chemotherapy-resistant breast cancer stem cells via regulation of mitochondrial oxidative phosphorylation. Cell. Metab. 26, 633–647 (2017).
Sun, X. et al. GALNT6 promotes bladder cancer malignancy and immune escape by epithelial-mesenchymal transition and CD8 + T cells. Cancer Cell. Int. 24, 308 (2024).
Chen, J. et al. CCL18 from tumor-associated macrophages promotes breast cancer metastasis via PITPNM3. Cancer Cell. 19, 541–555 (2011).
Liu, F. & Wen, C. LINC01410 knockdown suppresses cervical cancer growth and invasion via targeting miR-2467-3p/VOPP1 axis. Cancer Manag Res 12, 855–861 (2020).
Chen, Y., Zhang, X., Li, J. & Zhou, M. Immune-related eight-lncRNA signature for improving prognosis prediction of lung adenocarcinoma. J Clin. Lab. Anal 35, 1–10 (2021).
Yan, Y., Liu, J., Xu, Z., Ye, M. & Li, J. lncRNA PCAT14 is a diagnostic marker for prostate cancer and is associated with immune cell infiltration. Dis. Markers 9494619 (2021). (2021).
Guo, W. et al. Single-exosome profiling identifies ITGB3 + and ITGAM + exosome subpopulations as promising early diagnostic biomarkers and therapeutic targets for colorectal cancer. Res 6, 0041 (2023).
Zhao, Q. et al. Comprehensive profiling of 1015 patients’ exomes reveals genomic-clinical associations in colorectal cancer. Nat. Commun. 13, 2342 (2022).
Na, L. et al. MAML3 contributes to induction of malignant phenotype of gallbladder cancer through morphogenesis signalling under hypoxia. Anticancer Res. 43, 2909–2922 (2023).
Du, J. et al. A novel intergenic gene between SLC8A1 and PKDCC-ALK fusion responds to ALK TKI WX-0593 in lung adenocarcinoma: a case report. Front. Oncol. 12, 898954 (2022).
Fruman, D. A. et al. The PI3K pathway in human disease. Cell 170, 605–635 (2017).
Broadfield, L. A., Pane, A. A., Talebi, A., Swinnen, J. V. & Fendt, S. M. Lipid metabolism in cancer: new perspectives and emerging mechanisms. Dev. Cell. 56, 1363–1393 (2021).
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677 (2025).
Kanehisa, M. Toward Understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).
Kanehisa, M. & Goto, S. K. E. G. G. Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Sies, H. et al. Defining roles of specific reactive oxygen species (ROS) in cell biology and physiology. Nat. Rev. Mol. Cell. Biol. 23, 499–515 (2022).
Brockhausen, I. Mucin-type O‐glycans in human colon and breast cancer: glycodynamics and functions. EMBO Rep. 7, 599–604 (2006).
Hannun, Y. A. & Obeid, L. M. Sphingolipids and their metabolism in physiology and disease. Nat. Rev. Mol. Cell. Biol. 19, 175–191 (2018).
Goffin, V., Binart, N., Touraine, P. & Kelly, P. A. Prolactin: the new biology of an old hormone. Annu. Rev. Physiol. 64, 47–67 (2002).
Gloerich, M. & Bos, J. L. Regulating Rap small G-proteins in time and space. Trends Cell. Biol. 21, 615–623 (2011).
Hoxhaj, G. & Manning, B. D. The PI3K–AKT network at the interface of oncogenic signalling and cancer metabolism. Nat. Rev. Cancer. 20, 74–88 (2020).
Spat, A. & Hunyady, L. Control of aldosterone secretion: a model for convergence in cellular signaling pathways. Physiol. Rev. 84, 489–539 (2004).
Gallagher, E. J. & LeRoith, D. Obesity and diabetes: the increased risk of cancer and cancer-related mortality. Physiol. Rev. 95, 727–748 (2015).
Yuan, L., Sun, S., Zhang, Q., Li, H. T., Shen, Z., Hu, C., … Huang, D. S. Identification of ferroptosis-related lncRNAs for predicting prognosis and immunotherapy response in non-small cell lung cancer.Future Generation Computer Systems,159, 204–220 (2024).
Bao, W., Yang, B. & Chen, B. TAPE_selection: organelle proteins classification with TAPE feature selection. IEEE Trans. Comput. Biology Bioinformatics. 99, (2025).
Chen, B., Li, N. & Bao, W. CLPr_in_ML: cleft lip and palate reconstructed features with machine learning. Curr. Bioinform. 20 (2), 179–193 (2025).
Acknowledgements
This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award P20GM104420.We appreciate the assistance by Srikar Chittemsetty with the implementation of codes for experimental results.
Author information
Authors and Affiliations
Contributions
F.A., A.V., and B.Z. developed the concept for the study. F.A., M.E., H.G., and M.M. conducted the data analysis and performed the experimental evaluation. F.A., N.B., B.Z., and A.V. analyzed and validated the results and findings. F.A. and N.B. drafted the first version of the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Alharbi, F., Budhiraja, N., Vakanski, A. et al. Interpretable graph Kolmogorov–Arnold networks for multi-cancer classification and biomarker identification using multi-omics data. Sci Rep 15, 27607 (2025). https://doi.org/10.1038/s41598-025-13337-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-13337-0