Hybrid deep learning framework for accurate classification of high dimensional genomic data

Swain, Manas Kumar; Kamila, Narendra Kumar; Jena, Lambodar; Chithaluru, Premkumar; Albeshri, Mohammed Yahya; Alsekait, Deema Mohammed

doi:10.1038/s41598-026-36128-7

Download PDF

Article
Open access
Published: 21 January 2026

Hybrid deep learning framework for accurate classification of high dimensional genomic data

Manas Kumar Swain^1,2,
Narendra Kumar Kamila³,
Lambodar Jena⁴,
Premkumar Chithaluru⁵,
Mohammed Yahya Albeshri⁶ &
…
Deema Mohammed Alsekait⁷

Scientific Reports volume 16, Article number: 5919 (2026) Cite this article

1283 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

High-dimensional genomic datasets often contain redundant, noisy, and sparse features that make accurate classification challenging for conventional deep learning (DL) models. Existing approaches generally fail to maintain interpretability and stability when confronted with heterogeneous genomic structures. To address these limitations, this study proposes a hybrid TabNet–CNN framework that combines attention-driven feature selection with adaptive convolutional refinement. The attention mechanism in TabNet highlights the most relevant genomic attributes, while the convolutional layers enhance localized feature interactions for accurate decision boundaries. Experimental results on multiple genomic datasets demonstrate superior performance in accuracy, AUC, and interpretability compared to state-of-the-art models. The proposed framework holds promise for real-world applications such as biomarker identification, disease subtyping, and clinical decision-support systems in precision medicine.

Optimized model architectures for deep learning on genomic data

Article Open access 30 April 2024

Transfer learning enables predictions in network biology

Article 31 May 2023

GenNet framework: interpretable deep learning for predicting phenotypes from genetic data

Article Open access 17 September 2021

Introduction

Genomic data classification plays a pivotal role in understanding complex biological processes and in supporting precision medicine initiatives. However, the high dimensionality, intrinsic noise, and sparsity of genomic datasets make it difficult for traditional learning models to capture meaningful representations. DL has demonstrated significant promise in biomedical data analytics, but its application to genomic data presents unique challenges, including limited interpretability, high computational cost, and susceptibility to overfitting¹. Recent advances in hybrid and interpretable learning frameworks attempt to address these issues by integrating structured attention and convolutional mechanisms. Nevertheless, many of these models still struggle with maintaining generalizability and transparency when dealing with large-scale genomic data. Furthermore, the interpretability of learned features remains limited, hindering biological validation. Earlier approaches such as random forests, support vector machines (SVMs), and logistic regression relied heavily on manual feature selection, requiring domain experts to decide which features might be important^2,3,4,5. While sometimes effective, this process is time-consuming, not scalable, and may overlook hidden biological interactions. DL models like feed-forward neural networks help by learning directly from raw inputs, but they often act like black boxes, providing little explanation about why a certain prediction was made. In healthcare, such interpretability is crucial for trust and acceptance.

To solve these challenges, this research proposes a hybrid model that combines TabNet⁶ with CNN⁷ and an Adaptive Feature Refiner (AFR) layer. The model integrates TabNet’s sparse attention feature selection with CNN-based local refinement to enhance discriminative power while maintaining explainability. TabNet is known for its ability to focus on the most relevant features during training using attention mechanisms. It allows the model to automatically decide which inputs are important for each prediction, reducing the burden of manual feature engineering. CNN complements this by capturing local patterns among the selected features, making it possible to detect complex biological relationships that might otherwise go unnoticed. The AFR module acts as an important bridge between TabNet and CNN. After TabNet identifies key features, the refinement module re-weights them based on learned importance scores. This step filters out noise and highlights the most valuable information, helping the CNN build stronger and more meaningful feature representations for accurate classification.

Challenges in high-dimensional genomic data modeling

Curse of Dimensionality: When the number of features is much larger than the number of samples, models may easily overfit. Overfitting means the model performs well on training data but poorly on new, unseen data. Traditional machine learning (ML) models struggle to select meaningful patterns when faced with such a large feature space.
Noisy and Redundant Features: Not all genomic features contribute equally to a particular disease or trait. Many features are irrelevant or redundant. If models treat all features the same way, they may pick up noise rather than meaningful biological signals, reducing the prediction quality.
Complex Feature Interactions: Gene–gene interactions, known as epistasis, often influence biological outcomes. Capturing such non-linear and high-order relationships between features is very difficult using simple models.
Need for Interpretability: In healthcare, it is not enough for a model to just be accurate. Doctors and researchers need to understand why a model makes a certain prediction, especially when the decisions impact real human lives. Interpretability is critical for trust, regulatory approval, and clinical deployment.
Scalability to Large Datasets: Genomic datasets^8,9,10,11,12 are growing rapidly with the advancement of sequencing technologies. Models must be able to handle large volumes of data efficiently without requiring too much memory or computation time.
Model Robustness: Models should remain reliable even when data contain missing values, noise, or slight variations. Robustness is crucial because biological data often come from multiple sources with different qualities.

Problem statement

Despite advancements in ML and DL, existing methods often fall short when dealing with high-dimensional genomic data. Most traditional models either lose important information by selecting too few features or become computationally heavy when trying to process everything. DL models can extract complex patterns but often lack interpretability and are sensitive to noise. There is a clear need for a new hybrid framework that can perform accurate classification by automatically selecting important features, capture hidden feature interactions effectively, and offer interpretability without adding significant computational burden. The aim of this research is to design such a framework by combining the dynamic feature selection ability of TabNet with the local pattern extraction strength of CNN, along with an adaptive refinement step to focus on the most informative features.

Motivation

The motivation behind this research comes from the gap between the potential of genomic data and the capabilities of current modeling techniques. Existing solutions either focus heavily on feature engineering, which is time-consuming and biased, or they rely on deep models that act like ”black boxes,” offering little insight into decision-making. By bringing together TabNet and CNN, there is an opportunity to build a model that can automatically pick up important features, understand local patterns among genes, and present interpretable results. The addition of an AFR step offers further filtering, helping the model fine-tune its focus during learning. With this hybrid approach, it becomes possible to build models that are not only accurate but also trustworthy and scalable for real-world genomic applications, supporting personalized medicine, drug discovery, and clinical research.

The main contributions of this paper are:

Proposed a hybrid model combining TabNet, CNN, and AFR designed specifically for high-dimensional genomic data classification.
TabNet is used for selecting important features through attentive sparse selection, and a custom refinement layer adjusts the feature weights based on relevance, improving both performance and stability.
CNN is employed after refinement to capture important local interactions between selected features, allowing the model to understand complex feature relationships.
By combining attention scores from TabNet with feature maps from CNN, the model provides explanations for its decisions, making it suitable for healthcare applications.
The model is built with computational efficiency in mind, achieving better training speed and lower memory usage compared to traditional DL methods for genomic data.
Conducted comprehensive experiments on multiple real-world genomic datasets^8,9,11, demonstrating that the proposed framework outperforms several state-of-the-art methods in terms of accuracy, F1-score, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), and robustness.
The model’s interpretability and performance make it a strong candidate for integration into clinical decision support systems, offering new tools for early diagnosis and treatment planning.

The paper is organized as follows: Section “Literature review” reviews recent works on deep learning, hybrid feature selection, and genomic data classification, highlighting their strengths and limitations in terms of models, datasets, evaluation metrics, interpretability, and privacy. Section “Proposed methodology” presents the proposed hybrid TabNet–CNN model, detailing its design, component interactions, and mathematical formulation. Section “Experimental setup” describes the experimental setup, including datasets such as TCGA⁸, GEO⁹, and ENCODE¹¹, along with software tools, hardware configurations, preprocessing steps, and evaluation metrics like accuracy, AUC-ROC, F1-score, interpretability, and privacy trade-offs. Section "Results and discussions" reports the results and analysis, comparing the proposed approach with existing methods through tables and graphical evaluations. Section "Conclusion and future scope" concludes with key findings and outlines future directions, including extensions to multi-omics data, enhanced privacy frameworks, and clinical applications for precision medicine.

Literature review

The field of genomic data analysis has seen rapid growth with the adoption of ML and DL methods. Researchers have introduced innovative frameworks to classify, predict, and extract meaningful patterns from complex biological datasets. These methods have been applied across healthcare, plant science, and cancer genomics, showing great promise but also facing important limitations. Many models achieve strong predictive performance but often lack interpretability, robustness, and scalability when dealing with high-dimensional genomic data. This literature review discusses several recent contributions that aim to improve genomic analysis. For each work, key strengths and existing challenges are outlined to better highlight the need for a more refined and balanced solution, such as the one proposed in this paper.

El-Nabawy et al.¹³ presented a method that combined clinical, genomic, and tissue-level data to classify breast cancer subtypes. The fusion approach improved accuracy, but the model was sensitive to uneven datasets.

Lu et al.¹⁴ proposed BrcaSeg, which connected image features from tissue samples with genomic information for breast cancer research. It revealed meaningful relationships, though broader testing on different datasets was limited.

Wei et al.¹⁵ introduced DeepTIS, which worked well for finding translation initiation sites in genomic sequences. However, the model didn’t perform as well when tested on different kinds of data.

Huang et al.¹⁶ discussed ML models for genomics in therapy applications. The discussion gave useful insights, but many of the models lacked the ability to give clear reasoning behind their predictions.

Ye et al.¹⁷ used image-based DL to classify various cancer types using genomic data. The accuracy was good, but the method had trouble dealing with differences between cancer types.

Ahemad et al.¹⁸ built a system for detecting COVID-19 using genomic data. The model worked well on small datasets but needed larger samples for reliable results.

Erfanian et al.¹⁹ reviewed DL models for analyzing single-cell data. These models captured fine differences at the cell level but often worked like black boxes, making their predictions hard to understand.

Bazgir and Lu²⁰ designed REFINED-CNN to predict survival using large feature sets. The model handled complex input well, but the results were not easy to explain, which limits its use in areas like healthcare.

Khodaei et al.²¹ built a model to classify virus genome signals using ML and signal processing. It showed good accuracy but wasn’t tested much under noisy or imperfect data conditions.

Wang et al.²² developed DNNGP, which worked with multi-omics data for plant traits. This method gave better predictions but required extra steps to process complex inputs before training.

Zhu et al.²³ created GSRNet by combining CNN and BiGRU with adversarial training to learn from genomic signal patterns. The approach worked well for capturing patterns, though it required more training time and effort.

Abu-Doleh and Al Fahoum²⁴ introduced XgCPred, a hybrid of XGBoost and CNN that used gene expression images for cell classification. It performed well, but the system was hard to interpret and didn’t scale easily.

Mohammed et al.²⁵ applied U-Net to genomic sequences. The model learned good representations, but more testing was needed on different datasets.

Barber and Oueslati²⁶ used pre-trained networks and a custom CNN to identify human exon and intron patterns. While the model improved prediction, it required heavy computation.

Nawaz et al.²⁷ explored sequence modeling for genomics. Their method picked up patterns well but was still hard to use in real medical settings due to limited interpretability.

Abbas et al.²⁸ worked on cancer classification using federated learning to protect private data. Coordination between different systems was one of the key challenges.

Mora-Poblete et al.²⁹ merged genomic and phenomic data to improve predictions in Eucalyptus trees. It worked well in plant research, but the same method hasn’t been used outside plant studies.

Feng et al.³⁰ created AI Breeder to support crop prediction. The model merged genomic and phenotype data effectively, although it needed extra effort when dealing with genetically diverse samples.

Batra et al.³¹ built an AI system for early lung cancer detection using genomic, clinical, and imaging data. While the method worked well, managing data from different sources raised privacy concerns.

Sangeetha et al.³² proposed a DL model for lung cancer classification using multiple types of data. The model improved predictions but struggled when parts of the data were missing.

Yaqoob, Verma, and Aziz³³ introduced a hybrid optimization model that combines the sine cosine algorithm with cuckoo search for gene selection and cancer classification. The study reported that this combination improved the accuracy of selecting relevant genes and enhanced the classification performance compared to single algorithms. The approach reduced computational complexity and offered stable results across different datasets. A limitation observed was that the hybrid method required parameter tuning that could be challenging for large and diverse datasets.

Yaqoob, Verma, Aziz, and Saxena³⁴ proposed a hybrid feature selection method that combines cuckoo search with Harris Hawks optimization for cancer classification. This work showed strong effectiveness in identifying the most relevant gene subsets and provided reliable improvements in prediction accuracy. The hybridization of two metaheuristic algorithms allowed better exploration and exploitation of the search space. However, the approach could still face scalability issues when applied to very high-dimensional datasets with thousands of features.

Yaqoob, Verma, Aziz, and Shah³⁵ introduced a hybrid Random Drift Optimization (RDO)-XGBoost framework for cancer classification. The model used the RDO algorithm for selecting informative features, followed by XGBoost for prediction. The study demonstrated high predictive power and provided useful insights into gene expression patterns linked to cancer. The limitation of the work was that the computational cost was high, and the approach might be less effective when applied to extremely imbalanced datasets.

Yaqoob³⁶ combined the minimum redundancy maximum relevance (mRMR) technique with the Northern Goshawk Algorithm (NGHA) for cancer gene selection. This integration helped in removing redundant features and improved the overall efficiency of cancer classification tasks. The method showed better convergence rates compared to other evolutionary algorithms. A limitation noted was that the performance of NGHA strongly depended on initial parameter settings, which could restrict adaptability across different cancer datasets.

Yaqoob and Verma³⁷ developed a feature selection method for breast cancer gene expression data using a combination of Krill Herd Algorithm Optimization (KAO) and Arithmetic Optimization Algorithm (AOA), followed by classification with SVM. The model provided strong improvements in classification accuracy and reduced the number of irrelevant features. The hybridization of KAO and AOA allowed effective balance between exploration and exploitation of the search process. A limitation identified was that the approach could require significant computational resources, making it difficult to use in real-time applications.

Raja et al. (2025)³⁸ developed an attention-based CNN to predict liver tumors using genomic data. It helped the model focus on key genomic features but depended on large labeled datasets for training.

Lin et al.³⁹ used DL to study traits in tiger pufferfish. The system worked well for one species, but applying it to other organisms might not give the same results.

Wang et al.⁴⁰ introduced Cropformer for plant genomics. It combined accuracy with explainability, though the model was resource-heavy and harder to use on larger problems.

Wu et al.⁴¹ built AutoGP for genomic selection in maize breeding. It helped improve decision-making but required large training data that may not always be available.

Recent works such as⁴² presented comprehensive insights into explainable artificial intelligence (XAI) for medical image analysis, emphasizing the need for transparency in biomedical decision systems. Another notable study^43,44, enhancing medical image report generation using a self-boosting multimodal alignment framework, proposed cross-modal attention to improve interpretability in complex healthcare models. Although these methods were developed for imaging, their interpretability principles motivate similar approaches in genomic modeling.

Table 1 Comparative analysis of existing and proposed methods.

Subjects

Abstract

Similar content being viewed by others

Optimized model architectures for deep learning on genomic data

Transfer learning enables predictions in network biology

GenNet framework: interpretable deep learning for predicting phenotypes from genetic data

Introduction

Challenges in high-dimensional genomic data modeling

Problem statement

Motivation

Literature review

Gaps identified in the literature

Proposed methodology

Architectural overview and feature processing

Advantages of proposed model

Experimental setup

Training configuration

Training workflow and experimental design

Dataset preparation

Training process

Hyperparameter tuning

Evaluation metrics

Methods

Code availability

Results and discussions

Model interpretability and biological relevance

Ablation study

Practical implications

Dataset limitations and mitigation strategies

Conclusion and future scope

Limitations of the proposed framework

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links