Introduction

Breast cancer remains the most common malignancy among women worldwide and the second leading cause of cancer-related mortality1. The epidermal growth factor receptor (EGFR), a transmembrane glycoprotein involved in cell proliferation, survival, and differentiation, plays a pivotal role in breast cancer progression2. Figure 1 shows a structural depiction of EGFR.

Fig. 1
figure 1

Ribbon diagram of the EGFR structure, with distinct functional regions highlighted in different colors. The ligand-binding domain (yellow–orange) facilitates interaction with epidermal growth factor, while the kinase domain (blue–cyan) contains the catalytic loop responsible for phosphorylation activity. The juxtamembrane segment (red–green) plays a regulatory role in signal transduction. The magenta-colored loop represents a key motif involved in conformational changes upon activation. This color-coded representation enables clear visualization of EGFR’s structural organization and functional domains. Structural model of the Epidermal Growth Factor Receptor (EGFR). Image adapted from Wikipedia (https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor), licensed under CC BY-SA 4.0

Under physiological conditions, EGFR signaling is tightly regulated; however, in many breast cancer subtypes, EGFR is overexpressed or aberrantly activated, leading to uncontrolled cell growth, increased metastatic potential, and poor prognosis3.

EGFR has emerged as an important therapeutic target, with treatments including small-molecule tyrosine kinase inhibitors (e.g., gefitinib, erlotinib) and monoclonal antibodies (e.g., cetuximab, panitumumab)4. While these therapies have demonstrated clinical benefit, challenges such as drug resistance, tumor heterogeneity, and lack of robust predictive biomarkers limit their long-term efficacy5. Accurate and early identification of EGFR status in breast cancer patients is therefore essential to optimize treatment strategies. Traditional diagnostic techniques such as immunohistochemistry (IHC)6 and fluorescence in situ hybridization (FISH)7, though widely used, are labor-intensive, subjective, and dependent on tissue quality.

In recent years, several computational and machine learning approaches have been applied for EGFR-related research in cancer. For example, Studies have employed supervised screening approaches combined with structural biology techniques to identify potent EGFR inhibitors8. In contrast, others have integrated machine learning with pharmacogenomics to rank potential breast cancer drugs and associated biomarkers9.

Comparative genomic and expression analyses of other prognostic markers, such as ANLN and KDR, have also provided insights that can complement EGFR-based investigations10. Moreover, stacking ensemble frameworks for predicting drug–drug synergy have demonstrated potential in designing combination therapies for EGFR-positive cancers11.

The application of deep learning in breast cancer diagnostics has also seen rapid growth. Several studies have focused on using multi-modal imaging to improve diagnostic accuracy. Recent work, such as a brief survey on deep learning schemes for multi-image modalities, highlights this trend12. Other research has explored how combining radiomics features from mammography with deep learning models and using networks that fuse features from different modalities can enhance diagnostic precision13,14. These studies collectively highlight the potential of deep learning to analyze complex, multi-source data for improved clinical outcomes.

Sequence-based predictors eliminate the need for costly wet-lab procedures and allow rapid, large-scale screening. Sequence-based predictors using ML and DL have been developed for identification of many biological problems, such as druggable protein15, angiogenic protein16, and globular protein17. However, according to our best knowledge, no sequence-based computational model has been constructed for EGFR identification.

Considering these limitations and gaps, we propose an advanced DL model, namely Ensemble Residual Convolutional Neural Network (ERCNN), that integrates ensemble learning with the principles of residual and convolutional neural networks. The major contribution points are listed below.

  • Developed a novel computational predictor for the fast and accurate identification of EGFR.

  • Constructed a unique primary sequence-based dataset for training and testing, which serves as a valuable resource for advancing cancer research and drug discovery.

  • Trained the model using ERCNN, a new architecture that integrates ensemble learning with residual and convolutional neural networks.

  • The model was trained using a multi-perspective feature set: KSCTD, AmpPseAAC, CDT, and ProtBERT-BFD.

The graphical view of the applied methods is shown in Fig. 2.

Fig. 2
figure 2

Stepwise workflow of the ERCNN-EGFR. The process involves: (1) Data set preparation: raw protein sequences are collected, processed using the CD-HIT tool for similarity reduction, and split into training and testing sets; (2) Feature extraction and model development: sequences are tokenized and embedded using ProtBERT, then passed through multiple Residual CNN models to form an ensemble; (3) Model evaluation and prediction: ERCNN model is evaluated using performance metrics and final predictions classify sequences as EGFR or non-EGF.

Materials and methods

Datasets

To construct a robust dataset for our innovative predictor, we first sourced sequences of both EGFR and non-EGFR from the UniProt database18. After collection, we applied several filtering criteria to ensure high quality. We used the CD-hit tool19 to remove sequences with more than 25% similarity, preventing redundancy. Additionally, we excluded sequences shorter than 50 residues or those containing unknown characters. This process resulted in a final dataset of 1092 EGFR and 1143 non-EGFR sequences.

The finalized dataset was then randomly partitioned into training and testing subsets to evaluate the model’s performance. The training set, used to train our model, consists of 845 EGFR and 858 non-EGFR sequences. The testing set, which contains 247 EGFR and 285 non-EGFR sequences, was reserved exclusively to assess the model’s ability to generalize to new, unseen data.

Feature representative approaches

Feature extraction is the process of converting the primary sequences of proteins into numerical form that can be processed by a machine/deep learning model20. We use three feature extraction methods in our study: KSCTD, AmpPseAAC, CDT, and ProtBERT-BFD.

Amphiphilic Pseudo amino acid composition

AmpPseAAC was developed by Kuo-Chen Chou in 200521. The updated pseudo amino acid composition method (AmpPseAAC) represents protein sequences using a set of numerical descriptors that capture the physicochemical characteristics and amino acid composition of the protein. By adding details about the amino acids’ amphiphilicity, AmpPseAAC expands on PseAAC22. Amphiphilicity is a metric used to quantify how hydrophobic an amino acid is. Water tends to repel hydrophobic amino acids while drawing hydrophilic amino acids towards it. AmpPseAAC has been shown to be effective in several bioinformatics applications, including protein classification, such as identification of enzyme subfamily classes23, prediction of apoptosis protein24, and classification of human protein subcellular locations25.

K-Spaced conjoint triad

KSCTD is a powerful tool for representing and analyzing biological sequences26. As a member of the sequence-based feature extraction family, it plays a crucial role in describing the functional and structural properties of biomolecules27.

The KSCTD algorithm is rooted in the concept of triads, which are sequences of three consecutive amino acids. These triads represent local structural motifs that are fundamental to protein folding and function28. By considering the spatial arrangement of these triads through the concept of “k-spacing,” KSCTD provides a more detailed and informative representation of sequences compared to traditional methods29. This approach allows for the capture of long-range interactions between amino acids, which are often crucial for protein function30.

Composition distribution transition

The notion of Composition Distribution Transition (CDT) is utilized in the fields of computational biology and bioinformatics31,32,33. It serves as an effective tool for analyzing protein sequences. By analyzing the distribution of compositional changes within a sequence, the transition approach is a statistical technique that helps researchers understand sequence patterns, structural traits, and functional implications.

CDT examines how certain elements, like the amino acids in proteins or the nucleotides in DNA, vary in relative frequency throughout a sequence. This can provide important details on structural motifs, functional domains, or sequence evolution34. This technique is particularly helpful for pinpointing sequence segments of interest where compositional fluctuations can point to structural or functional alterations. A sequence is usually segmented into smaller areas for CDT analysis, and the compositional variations between these regions are then quantified. This may entail evaluating differences in amino acids. Such biological properties as conserved DNA sections, binding sites, or protein domains may be indicated by these compositional variations35.

ProtBERT-BFD

ProtBERT-BFD is a large-scale protein language model built upon the Bidirectional Encoder Representations from Transformers (BERT) architecture, tailored specifically for protein sequence modeling. It is pretrained on the Big Fantastic Database (BFD), which contains over 2.1 billion protein sequences, making it one of the most extensive protein corpora ever used for model development. ProtBERT-BFD aims to produce rich, contextual embeddings that capture the evolutionary, structural, and functional characteristics of proteins directly from their amino acid sequences.

The model is trained using self-supervised learning with the masked language modeling strategy. In this approach, a fixed percentage (typically 15%) of amino acids in each sequence is randomly masked, and the model is tasked with predicting these masked residues from the surrounding context. By using a bidirectional attention mechanism, ProtBERT-BFD simultaneously incorporates information from both upstream and downstream positions, enabling it to learn long-range dependencies that are critical for understanding protein folding, structural domains, and functional sites. Unlike traditional profile-based approaches such as PSSMs, this model learns directly from raw sequences, avoiding the need for computationally intensive multiple sequence alignments.

ProtBERT-BFD offers several advantages over other protein language models, including UniRep and ProtTrans-T5. One major strength Lies in its extensive training dataset, the model is pretrained on 2.1 billion protein sequences, providing broad coverage of the protein sequence space and enabling better generalization to rare protein families and uncharacterized domains.

Another advantage is its bidirectional context capture, which allows the model to consider dependencies in both sequence directions simultaneously. This is in contrast to unidirectional approaches, which only model context in one direction, potentially missing important long-range interactions. Additionally, ProtBERT-BFD provides an alignment-free representation, eliminating the need for multiple sequence alignments. This significantly reduces preprocessing time and makes the model more scalable for large-scale bioinformatics studies.

Model training and prediction

During this stage, the tasks of model training will be carried out. In this connection, various deep learning frameworks will be employed, including BiLSTM, ERCNN, GAN, and GRU. The best performance is secured by ERCNN-based model, which is elaborated in the following subsection.

Ensemble residual convolutional neural network

ERCNN integrates ensemble learning with the principles of residual and convolutional neural networks36,37. This framework is designed to improve the performance of the model. Residual Neural Network (ResNet) is a pivotal component of the ensemble residual CNN. ResNet is a crucial part of the ensemble residual CNN. ResNet introduces shortcut connections to address the problem of vanishing gradients in CNN38. By enabling information to move through the network more directly, these shortcut links lower the possibility that performance may deteriorate as the network gets deeper. A machine learning method called ensemble learning makes use of several models to increase prediction accuracy39. ERCNN enhances model performance by combining multiple ResNet-CNN models into an ensemble framework. The concatenation of sub-models facilitates mitigating overfitting and capturing a broader range of features. This improved generalization is particularly advantageous when dealing with small datasets. Combining predictions from multiple sub-models leads to a more robust final decision, leveraging the consensus across several networks for increased reliability40.

Keeping in view the advantages above, we implemented ensemble residual CNN. This work constructs the final model by two types of residual building block (RBB), i.e., RBB-1 and RBB-2. Both RBBs have three convolutional and batch normalization layers. Before convolutional layers, a padding layer is added to preserve the spatial dimensions of the input feature while performing the convolution operation. The process involves padding the input data with a layer of zeros around its edges. Each convolutional layer is tested with 32, 64, 128, and 256. The model achieved strong performance with 64 filters. Similarly, strider size of 3 is applied for all convolutional layers. The shortcut in RBB-1 is identified by \(\:x\), while the shortcut in RBB-2 contains one convolutional and batch normalization layer. Both RBBs are represented in Fig. 3. RBB-1 is computed as:

$$\:y=F\left(x\right)+x$$
(1)

where, \(\:F\) is the nonlinear function for the convolutional path in RBB-1.

RBB-2 is expressed as:

$$\:y=F\left(x\right)+H\left(x\right)$$
(2)

where, \(\:H\) is the shortcut path.

Dropout layers are incorporated after the fully connected layers to enhance classification performance. The output layer contains a flatten layer and a sigmoid layer. The flatten layer transforms features into a vector shape. The sigmoid layer outputs the probability of each possible outcome. Other hyperparameters that we implemented in our model are listed in Table 1.

Fig. 3
figure 3

Structural representation of RBB-1 and RBB-2.

Table 1 ERCNN model hyperparameters.

Feature selection approach

Feature selection is a crucial process in machine learning that involves choosing a subset of relevant features to use in model construction. The primary goal is to simplify the model, improve its performance, reduce training time, and make it more interpretable41. This is especially important in high-dimensional datasets where many features may be redundant or irrelevant. By removing noise and focusing on the most informative features, feature selection helps prevent overfitting and enhances the model’s generalization ability. In this work, we used XGBoost-Feature Forward Selection (XGBoost-FFS) approach.

XGBoost-FFS is an advanced wrapper-based feature selection method that leverages the power of the XGBoost (Extreme Gradient Boosting) algorithm. Unlike filter methods that rank features independently of the model, wrapper methods evaluate subsets of features by training and testing a model on them. XGBoost-FFS is a greedy and iterative process that progressively builds an optimal feature set. It starts with an empty set and, in each step, adds the single feature that shows improvement to an XGBoost model’s performance. This process continues until a predetermined number of features is selected.

The process begins with Initialization, where an empty set is created to hold the selected features. Next, during the Iteration phase, the algorithm loops through all features that are not yet in the set and temporarily adds each one. For each of these temporary sets, the Evaluation step occurs: an XGBoost model is trained and its performance is measured using a metric Like accuracy. After all options are evaluated, the Selection step identifies and permanently adds the single feature that resulted in the best performance improvement. This entire process is then Repeated until the desired number of features has been chosen. Although this method is computationally demanding because it trains a model in every loop, it is highly effective at identifying the most predictive features specifically for an XGBoost model. In this work, XGBoost-FFS selected 116, 18, 21, and 234 the best feature sets from KSCTD, AmpPseAAC, CDT, and ProtBERT-BFD.

Model assessment

To ensure the reliability of our novel approach, it is examined using assessment methods. The most extensively used method in bioinformatics is 10-fold cross-validation42,43,44,45. 10-fold is the division of our data into ten sets. The model is tested on the tenth set after being trained on nine folds of the sets. We repeat this ten times, testing with a new set each time. The final forecast is regarded as the mean of the 10-fold outcomes46. Additionally, we assess the model’s performance using assessment metrics, such as MCC, sensitivity (Sn), specificity (Sp), and accuracy (Acc)47,48. The confusion matrix is used to formulate these parameters. These parameters are computed using the following equations.

$$\:Acc=\:1-\frac{{EG}_{-}^{+}+{EG}_{+}^{-}}{{EG}^{+}+{EG}^{-}}$$
(3)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:Sn=\:1-\frac{{EG}_{-}^{+}}{{EG}^{+}}$$
(4)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:Sp=\:1-\frac{{EG}_{+}^{-}}{{EG}^{-}}$$
(5)
$$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:MCC=\:\frac{1-\left(\frac{{EG}_{-}^{+}+{EG}_{+}^{-}}{{EG}^{+}+{EG}^{-}}\right)}{\sqrt{\left(1+\frac{{EG}_{-}^{+}+{EG}_{+}^{-}}{{EG}^{+}}\right)\:\left(1+\frac{{AG}_{-}^{+}+{EG}_{+}^{-}}{{EG}^{-}}\right)}}$$
(6)

\(\:{EG}^{+}\) indicates accurately identified EGFR, while \(\:{EG}^{-}\) denotes non-EGFR instances that are accurately identified. Similarly, \(\:{EG}_{-}^{+}\) and \(\:{EG}_{+}^{-}\) refer to samples that are mistakenly identified.

Results and discussion

Results analysis of the DL algorithms using training set

The performance obtained by various DL models prior to feature selection is presented in Table 2. Using the CDT, BiLSTM model achieves 57.40% Acc, 65.48% Sn, and 49.47% Sp, with MCC of 0.276, indicating moderate predictive ability. GAN model with CDT performs slightly worse overall, with 53.13% Acc, 67.57% Sn, and 38.92% Sp, reflected by its MCC of 0.389. GRU model demonstrates stronger performance on CDT, reaching 69.40% Acc, 77.48% Sn, and 61.47% Sp, with MCC of 0.396. ERCNN model secures the best CDT-based performance, with 72.69% Acc, 88.62% Sn, 57.03% Sp, and MCC of 0.485, showcasing its effectiveness in learning robust sequence features.

BiLSTM on AmpPseAAC attains only 52.38% Acc, with high sensitivity (77.86%) but very low specificity (27.42%), resulting in MCC of 0.103. GAN slightly improves on this with 54.61% Acc, 68.76% Sn, 40.85% Sp, and MCC of 0.465. GRU generated stronger outcomes, with 72.46% Acc, 83.40% Sn, 61.70% Sp, and MCC of 0.467. ERCNN achieves the highest results for AmpPseAAC, with 78.32% Acc, 74.31% Sn, 82.26% Sp, and MCC of 0.596, confirming its consistent superiority.

With KSCTD features, BiLSTM attains 59.24% Acc, 66.39% Sn, and 52.15% Sp, with MCC of 0.228. GAN performs slightly better, reaching 62.91% accuracy, but suffers from poor specificity (37.45%) despite high sensitivity (87.86%), resulting in a low MCC of 0.201. GRU with KSCTD attains 85.54% Acc, 88.88% Sn, 82.25% Sp, and MCC of 0.732. ERCNN further enhances performance, achieves 87.84% Acc, 91.59% Sn, 84.13% Sp, and MCC of 0.761, underscoring the power of ensemble and residual learning in extracting meaningful sequence-level features.

Table 2 Performance of DL frameworks before feature selection.

ProtBERT-BFD embeddings produce promising results overall. BiLSTM secures 75.46% Acc, 76.89% Sn, 74.65% Sp, and MCC of 0.641. GAN improves slightly to 77.24% accuracy and MCC of 0.663. GRU model further boosts performance with 89.28% Acc, 89.31% Sn, 89.15% Sp, and MCC of 0.774. ERCNN obtains the best ProtBERT-BFD results, with 91.15% Acc, 90.27% Sn, 88.95% Sp, and MCC of 0.795, making it the best performer across all descriptors.

Results analysis of the DL algorithms after feature selection approach

The performance of deep learning frameworks after feature selection is summarized in Table 3. Using the CDT, BiLSTM model achieves an Acc of 59.56%, Sn of 60.76%, and Sp of 57.64%, with MCC of 0.402, indicating a modest classification ability. GAN performs slightly worse with 56.35% Acc and MCC of 0.367. However, GRU shows a progressive improvement, achieving 71.57% Acc, 71.27% Sn, and 71.90% Sp, with MCC of 0.543. ERCNN secures the best CDT-based results, with 74.68% Acc, 75.76% Sn, 73.37% Sp, and MCC of 0.575, highlighting its superior learning of discriminative sequence patterns.

BiLSTM using AmpPseAAC shows lower performance with 54.57% accuracy and MCC of 0.346. GAN achieves slightly higher values (55.57% accuracy, 55.36% sensitivity, 55.85% specificity, 0.357 MCC). GRU model significantly outperforms both BiLSTM and GAN, secures 74.68% Acc, 73.35% Sn, 75.33% Sp, and MCC of 0.576. ERCNN consistently generated the highest results on AmpPseAAC, producing 79.89% Acc, 80.99% Sn, 78.28% Sp, and MCC of 0.622, confirming its robustness in feature learning.

Table 3 Performance of DL frameworks after feature selection.

When considering KSCTD, BiLSTM achieves 60.36% Acc with balanced Sn (60.42%) and Sp (60.24%), leading to MCC of 0.415. GAN model shows an improvement to 64.76% Acc and MCC of 0.451. GRU demonstrates remarkable performance with 87.33% Acc, 88.95% Sn, 87.02% Sp, and MCC of 0.764. ERCNN further enhances the results, obtaining 88.25% accuracy, 90.13% sensitivity, 87.93% specificity, and the highest MCC for KSCTD (0.772).

ProtBERT-BFD embeddings yield the best performance across all descriptors. BiLSTM achieves 77.12% Acc, 77.71% Sn, 77.34% Sp, and MCC of 0.664. GAN improves further with 79.44% accuracy and 0.688 MCC, while GRU generated 91.67% Acc, 90.87% Sn, 90.89% Sp, and MCC of 0.793. ERCNN obtains the best performance, with 93.48% accuracy, 94.53% sensitivity, 92.58% specificity, and MCC of 0.816. These findings confirm that ProtBERT-BFD embeddings, when combined with ERCNN, show the best generalizable representation for classification.

Results analysis of the ML frameworks on the best feature selection approach

We have extended our experimentation to include several ML classifiers, with the results presented in Table 4. These ML classifiers were trained via XGBoost-FFS’s best feature set. RF model, trained on these selected features, demonstrated a solid baseline performance, achieving an accuracy of 80.68% and MCC of 0.684. ERT slightly outperformed RF, yielding higher accuracy (83.65%), specificity (84.27%), and MCC (0.709). Adaboost, on the other hand, showed even better performance, Surpassing both RF and ERT with accuracy of 85.72% and MCC of 0.735. This indicates its superior ability to handle the classification task.

Our proposed model, ERCNN-EGFR, consistently exhibited remarkable performance. With accuracy of 88.56%, it outperformed all other machine learning classifiers. Notably, the model achieved a high sensitivity of 89.78%, demonstrating its strong capability to correctly identify EGFR-positive instances. Specificity of 87.29% also confirms its robust performance in correctly classifying EGFR-negative instances. High MCC of 0.773 further validates the model’s best performance, confirming its efficacy in accurately distinguishing between EGFR and non-EGFR.

Table 4 Performance of ML models on the best feature selection approach.

Performance of the classifiers on the independent test set

We further examined the performance of baseline classifiers on the Independent Test Set to assess their generalization ability. BiLSTM model demonstrated reasonable performance, achieving Acc of 72.24%, Sn of 72.35%, Sp of 72.53%, and MCC of 0.442. GAN showed slightly lower performance, with Acc of 66.15%, Sn of 67.26%, Sp of 65.26%, and MCC of 0.407. GRU performed better than GAN, and secured 69.17% Acc, 69.07% Sn, 69.97% Sp, and MCC of 0.424, however, it was lower than BiLSTM.

On the Independent Test Set, the proposed model demonstrated strong generalization, obtaining an accuracy of 82.85%, sensitivity of 81.49%, specificity of 83.04%, and MCC of 0.652. This performance highlights the great precision of EGFR prediction provided by ERCNN-EGFR. The findings validate our predictor’s exceptional performance and promising generalization efficacy by showing it to outperform BiLSTM, GAN, and GRU models on the testing dataset across all assessment parameters. A thorough comparison of these classifiers is given by the results in Table 5, which also demonstrate how well ERCNN-EGFR performs in predicting EGFR accurately.

The proposed ERCNN-EGFR significantly outperformed all other classifiers, obtaining 82.85% Acc, 81.49% Sn, 83.04% Sp, and MCC of 0.654. These results highlight the superior predictive power of ERCNN-EGFR, demonstrating its ability to effectively capture discriminative features and achieve higher generalization capability than BiLSTM, GAN, and GRU models on the testing dataset.

Table 5 Results on the independent test Set.

As part of our model evaluation, we present the ROC curves for the ERCNN model on both the training and independent test datasets, as shown in Figs. 4 and 5, respectively. The ROC curve provides a comprehensive assessment of the model’s discriminative ability by illustrating the trade-off between true positive rate and false positive rate across different classification thresholds.

In our results, the ERCNN achieves an AUC of 0.973 on the training set and 0.904 on the independent test set. These strong AUC values indicate that the probability scores generated by the ERCNN model reliably distinguish between positive and negative classes in both training and unseen data. The smooth and consistently elevated curves, well above the diagonal random classifier line, further confirm robust separation and good classification capacity.

In addition to the ROC analysis, we have computed Precision-Recall AUC values for the ERCNN model as presented in Tables 2 through 5 of the manuscript. Precision-Recall AUC is particularly meaningful for imbalanced datasets, as it focuses on the trade-off between precision and recall. The high Precision-Recall AUC values further corroborate the ERCNN model’s strong performance and reliable probability outputs.

Fig. 4
figure 4

ROC curve of the ERCNN on the training set.

Fig. 5
figure 5

ROC curve of the ERCNN on the Independent test set.

Ablation study analysis of the proposed model

To assess the contribution of different architectural components in ERCNN, we performed an ablation study using ProtBERT-BFD features (Table 6). The results demonstrate that each architectural enhancement meaningfully improves predictive performance.

The baseline Plain CNN achieved 90.80% Acc, 91.90% Sn, 89.60% Sp, a PR-AUC of 95.10%, and MCC of 0.771. While these results confirm the model’s capacity to capture local sequence patterns, they also reveal its Limited generalization ability. Incorporating residual connections in the RCNN variant improved overall performance, achieving 91.50% Acc and MCC of 0.784. This improvement highlights the effectiveness of residual learning in stabilizing deeper networks and mitigating vanishing gradient issues.

Further analysis of individual residual building blocks showed that both RBB-1 and RBB-2 enhanced model capability beyond Plain CNN and RCNN. The RBB-1 variant attained 92.20% Acc, 93.50% Sn, and the highest MCC among the single-block models (0.801), demonstrating its stronger capacity to capture discriminative sequence features. Similarly, the RBB-2 variant achieved 92.00% Acc, 92.80% Sn, and MCC of 0.797, underscoring its complementary role in refining representation learning.

The Full ERCNN, which integrates ensemble learning with both RBB-1 and RBB-2, consistently outperformed all ablated variants. It achieved the best results with 93.48% Acc, 94.53% Sn, 92.58% Sp, PR-AUC of 97.38%, and MCC of 0.816. These findings confirm that the synergistic integration of multiple residual blocks within an ensemble framework allows ERCNN to capture both local and global sequence-level features more effectively, thereby ensuring robust and reliable EGFR prediction.

Table 6 Ablation study of the proposed model.

Computational cost and deployment challenges

ERCNN-EGFR was designed as a lightweight residual ensemble framework, making it computationally more efficient than deeper transformer-based models Such as ESM. The model was trained and tested on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB VRAM), 128 GB RAM, and an Intel Xeon Gold processor. Under this configuration, the average training time for ERCNN-EGFR was approximately 20–25 min, indicating feasibility for large-scale screening tasks.

In terms of expected hardware for deployment, the model can be efficiently run on a single modern GPU (e.g., NVIDIA A100/RTX series) or even optimized for CPU-based environments with slightly higher inference times (tens of milliseconds per sequence). This makes it scalable to high-throughput diagnostic pipelines.

However, we also acknowledge that in real-time diagnostic environments, challenges such as latency, throughput, and integration into clinical workflows must be carefully considered. High-throughput clinical pipelines often require efficient batch processing, seamless integration with hospital information systems, and potentially cloud-based deployment for distributed processing. To address this, we highlight in the Discussion that the model can be further optimized using model compression, pruning, or knowledge distillation to reduce memory footprint and further improve inference speed.

Conclusion and future directions

In this study, we developed ERCNN-EGFR, a novel deep learning–based predictor for the accurate identification of epidermal growth factor receptor (EGFR) proteins. By integrating ensemble learning with residual convolutional neural networks and leveraging multi-perspective feature representations, including CDT, AmpPseAAC, KSCTD, and ProtBERT-BFD, the proposed model demonstrated Superior performance compared to conventional deep learning architectures Such as BiLSTM, GRU, and GAN. Notably, ERCNN-EGFR achieved an accuracy of 87.84% on the training dataset and 82.85% on the independent testing dataset, underscoring its strong generalization ability. The ablation analysis further confirmed that the combination of ProtBERT-BFD embeddings and the dual residual building block design (RBB-1 and RBB-2) was central to the model’s predictive strength. These architectural innovations enabled ERCNN-EGFR to capture both local and global sequence patterns more effectively than standard CNN ensembles or ResNet-based frameworks.

From a practical perspective, ERCNN-EGFR offers a scalable, cost-effective, and reliable computational tool that can complement existing laboratory-based EGFR detection methods such as IHC and FISH. Unlike conventional approaches or static variant databases, the model has the potential to identify previously uncharacterized EGFR variants, thereby accelerating the discovery of therapeutic targets and supporting precision oncology.

Our model was trained on a curated EGFR dataset, which may carry potential bias related to dataset composition and size. Furthermore, while ERCNN-EGFR achieved strong performance for EGFR identification in breast cancer, its generalization to other cancers or receptor families has not yet been validated. In future work, we plan to extend ERCNN-EGFR by integrating additional sources of biological information, such as structural features, omics-based datasets, and pathway-level annotations, to further enhance predictive performance. Moreover, we aim to evaluate the model on diverse cancer types and receptor families beyond EGFR to assess its generalizability across broader oncogenic contexts. Incorporating explainable AI techniques will also be a priority, enabling the identification of key sequence determinants that drive EGFR prediction and improving biological interpretability. Finally, we envision deploying ERCNN-EGFR as a user-friendly web server or software package to facilitate its accessibility for the wider research community in cancer biology and drug discovery.