Introduction

OC is the most prevalent cancer worldwide and is considered by late diagnosis, disease, and high death rates1. Using tobacco in any form and excessive alcohol consumption are critical risk factors that lead to OC. The cause ratio is very peak in South and Southeast Asia, which is the consumption of betel quid, which generally contains slaked lime, areca nut, and betel leaves containing tobacco2. These are easy commercials in packets and are common in public owing to robust advertising strategies. OC is usually connected with delay in presenting, especially in LMICs, but more than two-thirds present at the final stages and as outcome survival rates are worst3. Treatment of cancer, specifically at the final stages, is costly. The lack of public awareness and awareness of the medical profession regarding OC is a significant cause of late recognition. Several imaging processes can take images of oral lesions using various methods4. A binary-method image processor that incorporates autofluorescence and white light image is suggested to identify OC. Oral lesions presented in the final stage have an injurious effect on survival rates, with more than two-thirds of oral lesions being identified at the final stages5. The lesion care cost is expensive, especially in the final stage. Late analysis of oral lesions is a significant concern for medicinal staff. To enhance the early diagnosis of OC and decrease the control of late analysis, it is vital to improve an automatic method for recognizing OC with less human interface6.

Machine learning (ML) was established to help improve classifier accuracy in automatic methods. Particularly, DL is demonstrated to decrease the requirement for human contribution in the study of massive databases7. AI might address recent challenges in diagnosing and identifying the estimation of OC by decreasing the workloads, difficulty, and exhaustion for doctors over complex procedures. It is a technical improvement that has collected significant attention from experts globally as it emulates human cognitive proficiencies. In dentistry, its application is comparatively new growth, but the outcomes are encouraging8. OSCC remains a leading global health concern due to its often late diagnosis and high mortality rates. Addressing OSCC efficiently is challenging, specifically in areas with high incidence associated with specific lifestyle factors9. Enhancements in early detection and accurate diagnosis are significant for enhancing patient results. Implementing cutting-edge DL approaches can substantially improve diagnostic accuracy and assist in timely intervention, ultimately mitigating the impact of this disease. Enhancing diagnostic tools for OSCC is significant, as early and precise detection can significantly improve survival rates and mitigate the burden on healthcare systems, emphasizing the requirement for innovative techniques in cancer detection and assessment10.

This study presents a Squeeze-Excitation with Hybrid Deep Learning for Oral Squamous Cell Carcinoma Recognition (SEHDL-OSCCR) technique on HIs. The presented SEHDL-OSCCR technique mainly concentrates on detecting OC using hybrid DL methods. The bilateral filtering (BF) technique is employed to eliminate noise. Next, the SEHDL-OSCCR technique utilizes the SE-CapsNet model to recognize the feature extractors. An enhanced crayfish optimization algorithm (ICOA) approach is used to improve the performance of the SE-CapsNet model. Finally, the OSCC classification is accomplished by designing a convolutional neural network using a bidirectional long short-term memory (CNN-BiLSTM) technique. The obtained simulation outcomes of the SEHDL-OSCCR technique are examined by employing a benchmark medical image dataset. The key contribution of the SEHDL-OSCCR technique is as follows.

  • The BF approach was employed to efficiently mitigate noise in the images, thus improving the overall quality of the image. This preprocessing step is significant for enhancing subsequent feature extraction and classification processes. The technique also confirms more precise and more dependable data for evaluation.

  • The SE-CapsNet method was implemented for advanced feature extraction and recognition, allowing more accurate detection of relevant image characteristics. This methodology substantially improves the approach’s capability to distinguish between diverse features. By employing the SE-CapsNet model, the evaluation attains greater accuracy in recognizing intrinsic patterns.

  • Incorporating the ICOA model elevated the accomplishment of the SE-CapsNet technique. This optimization method fine-tunes the model’s parameters, improving feature extraction and recognition accuracy. Using the ICOA model substantially enhances the comprehensive efficiency in processing and classifying data.

  • The SEHDL-OSCCR technique implements the CNN-BiLSTM model to accomplish precise OSCC classification. This incorporation employs the CNN model for feature extraction and the BiLSTM technique for capturing temporal dependencies, resulting in improved accuracy. The method efficiently enhances the model’s ability to classify OSCC.

  • more accurately.

  • The SEHDL-OSCCR technique presented a novel integrated model by integrating SE-CapsNet with the ICOA and CNN-BiLSTM models. This innovative methodology improves both feature extraction and classification accuracy for OSCC detection. By incorporating these advanced methods, the technique attains greater accomplishment in detecting and classifying oral cancer.

The article is structured as follows: section “Literature review” presents the literature review, section “Modeling of SEHDL-OSCCR technique” outlines the proposed method, section “Result analysis and discussion” details the results evaluation, and section “Conclusion” concludes the study.

Literature review

In11, an intelligent SmSL method is specially aimed to depend upon Self-supervised Pre-training (SP) and Adaptive Threshold (AT), called SPAT_SmSL. At first, the SP and AT models are employed to develop the unlabeled data, combined into the SPAT_SmSL model for identifying stroma, cancer, and tumour-infiltrating lymphocyte (TIL) areas. Next, pathological variables containing depth of invasion (DOI) and TIL-score were numerically measured and built on the outcomes of image detection. Fati et al.12 used hybrid models based on combined features. The 1st developed model depends on a hybrid technique of CNN methods (ResNet18 and AlexNet) and the support vector machines (SVMs) technique. The 2nd projected technique depends upon the hybrid feature, which is removed by CNN models and united with the texture, colour, and shape properties removed utilizing the local binary pattern (LBP), fuzzy colour histogram (FCH), discrete wavelet transforms (DWT), and grey-level co-occurrence matrix (GLCM) techniques. The principal component analysis (PCA) technique has decreased the dimension and led to the artificial neural networks (ANNs) technique. In13, the classification of OSCC histopathology imageries is executed utilizing dual-developed techniques. In the 1st method, transfer learning (TL) aided deep CNN (DCNN) techniques are measured. In the 2nd model, a base DCNN structure, trained from scratch with ten convolutional layers, is projected. In14, an innovative technique is presented for the early classification of OSCC using DL models. The method includes a Cyclic Learning Rate (CLR) approach for dynamic alteration of the learning rate through the training model. ResNet18 profits from skip connection to improve the gradient flow, whereas DenseNet and AlexNet structures considerably donate to exact image classification. Begum and Vidyullatha15 aim to automatically identify malignant and benign oral biopsy HIs by applying a DL-based CNN technique for the early analysis of OSCC model. The four currently proposed candidates pretrained DL-CNN techniques, namely InceptionNet, NASNetLarge, DenseNet201, and Xception, are chosen for the TL method in this study. These pre-trained techniques are then adapted with extra layers for effectual OSCC recognition.

In16, a CAD technique has been developed. Feature extraction was executed from this database utilizing 4 DL techniques: AlexNet, ResNet50, VGG16, and InceptionV3. Binary PSO (BPSO) has been employed to select the finest feature. Once the finest features were removed and nominated, they were categorized utilizing the XGBoost. In17, an advanced plan for OC classification is presented. This contains the Feature Fusion DCNN with SGD-based LR for the analysis of OC. The following layers, containing pooling, fusion, and transformation, are intended to grip ranked features across dissimilar branches. Lastly, the developed method experiences training through LR on the removed data utilizing the Cross-Entropy Loss, and the optimizer (SGD with weight decay) was used to upgrade the model parameter. Ahmad et al.18 introduced a hybrid mechanism based on combined features. The primary tactic is TL utilizing the techniques of Inceptionv3, Xception, NASNetLarge, InceptionResNetV2, and DenseNet201. Next, it includes a pre-trained CNN model for the feature extractor and an SVM for identification. Notably, features were removed utilizing numerous pre-trained techniques. The last tactic uses an innovative hybrid feature fusion model, employing a CNN extraction technique. Kadhim and Mohammed19 analyze the current AI methods in kidney cancer diagnosis, compute their efficiency, explore future research areas, and address threats to enhance patient outcomes and treatment effectiveness. In20, the authors integrate various omics data by implementing the Quantum Cat Swarm Optimization (QCSO) technique for feature selection (FS), incorporated with K-means clustering and SVM technique, attaining improved performance and interpretability of the model.

Das et al.21 present a framework utilizing a two-phase methodology: TL with CNN methods in the initial phase and constructing an ensemble method with the top-performing CNNs in the subsequent phase. The presented classifier is compared with advanced approaches such as AlexNet, ResNet, InceptionNet, and XceptionNet. Das, Dash, and Mishra22 introduce a convolutional neural network (CNN) technique for the automatic and early detection of OSCC, utilizing histopathological oral cancer images for experimentation. Meer et al.23 propose a fully automated architecture incorporating Self-Attention CNN and Residual Network techniques with fusion and optimization. It comprises augmenting training and testing samples, then training two deep techniques: a Self-Attention MobileNet-V2 and a Self-Attention DarkNet-19, with hyperparameters tuned utilizing the Whale Optimization Algorithm (WOA) technique. Features from both methods are integrated using the Canonical Correlation Analysis (CCA) model, additionally refined with Quantum WOA technique in order to choose relevant features, which are later classified by implementing wide neural network models. Raj and Muneeswari24 introduce the Optimal Archimedes Shooty Tern Deep Network (OASTDN) approach. This method also utilizes a Deep Belief Network (DBN) model with weights optimized by a novel Archimedes Shooty Tern Optimization Algorithm (ASTOA) technique, integrating Archimedes Optimization Algorithm (AOA) and Shooty Tern Optimization (STO) approaches. Shukla, Ajwani, Sharma, and Das25 propose a novel machine vision methodology by implementing conventional supervised DL approaches. The proposed methodology also detects the nucleus in cancerous biopsy images, extracts it employing K-means clustering with thresholding, and implements a new classification technique for final cancer detection.

The limitations of the existing studies include its dependence on self-supervised pre-training and adaptive thresholds, which may restrict generalizability. Another study implements complex hybrid techniques integrating CNNs with SVM techniques, potentially paving the way to overfitting issue. Techniques comparing pre-trained methods with those trained from scratch may only partially capture the advantages of the TL technique. Some methodologies concentrate on novel architectures or hybrid mechanisms, but their effectiveness might require to be validated more extensively, or difficulties with FS and computational complexity could be encountered. Models incorporating diverse omics data or implementing new optimization approaches may need assistance with model interpretability or additional validation. Furthermore, techniques based on conventional supervised DL might have limited adaptability to several datasets or scenarios. The present research on cancer detection faces gaps in generalizability across diverse datasets and validation of novel optimization models. Moreover, there is a requirement for enhanced model adaptability and interpretability in intrinsic hybrid techniques and integration models.

Modeling of SEHDL-OSCCR technique

The solution framework

This study presented a new SEHDL-OSCCR approach to HIs. The technique mainly focuses on detecting OC using hybrid DL models. To accomplish that, the SEHDL-OSCCR approach contains distinct processes such as noise reduction, a SE-CapsNet-based feature extractor, ICOA-based parameter tuning, and a classifier selection process. Figure 1 demonstrates the entire flow of the SEHDL-OSCCR method.

Fig. 1
figure 1

Working flow of SEHDL-OSCCR method.

Noise reduction

Initially, the BF technique is used to remove the noise. BF is an adaptable image processing method that intends to smooth images while maintaining edges26. BF was selected for the noise reduction process due to its capability to conserve edges while eliminating noise efficiently, which is significant in maintaining the integrity of crucial image features. Unlike other noise reduction methodologies that may blur or distort the image, BF selectively smooths regions depending on spatial distance and intensity differences, confirming that edges and fine details remain sharp. This makes BF specifically advantageous for processing medical images, where conserving substantial data is essential for precise evaluation. Its adaptability to varying noise levels and its capability to balance noise reduction with edge preservation offer crucial enhancements over other models, giving clearer and more reliable input for subsequent image processing tasks. Figure 2 shows the structure of the BF methodology.

Fig. 2
figure 2

Architecture of BF model.

BF considers the intensity variances amongst adjacent pixels in different classical smoothing filters that employ a weighted average solely based on spatial distance. This ensures that pixel with related intensity is smoothed together, maintaining edge sharpness. It is instrumental in applications where edge preservation and noise reduction are vital, like HDR tone mapping, computer vision, and image denoising. BF provides a flexible tool by adjusting parameters such as spatial distance and intensity similarity for manipulating and enhancing digital images with influence upon edge preservation and smoothing.

Architecture of the SE-CapsNet

Next, the SEHDL-OSCCR technique utilizes the SE-CapsNet model to recognize the feature extractors27. The SE-CapsNet method was chosen for recognizing feature extractors due to its advanced abilities in capturing and processing intrinsic features in data. Its architecture integrates the merits of capsule networks with squeeze-and-excitation blocks, improving the capability of the technique to learn and represent complex patterns more efficiently. Unlike conventional CNNs, SE-CapsNet can handle feature scale and context discrepancies better, paving the way to more accurate and robust feature extraction. This makes it suitable for applications needing detailed and precise analysis, namely recognizing subtle features in medical images. Its enhanced accomplishment in capturing spatial hierarchies and relationships gives a crucial advantage over other methods, which may need more depth in feature representation. The DL applications in the malware recognition area have attained outstanding research attainments; mainly, the CapsNet examines the relations among features, and this method has benefits that will be functional to lesser models. This research work unites a mechanism of channel attention, named SE block, by the CapsNet to construct the SE-CapsNet method, which mainly contains four layers. Figure 3 illustrates the architecture of the SE-CapsNet approach.

Fig. 3
figure 3

Structure of SE-CapsNet model.

  1. 1)

    CONVOLUTIONAL LAYER

    The convolutional is the 1st layer, whose main aim is to remove local features utilizing \(\:3\text{x}3\) convolutional kernels with a \(\:step\:size\) of one in a mixture with \(\:ReLU\).

  2. 2)

    Generally, the SE layer is effortless to utilize and can recover the feature extraction capability. So, it is beneficial to identify and contain dual processes such as squeeze and excitation. The squeeze process’s main aim was to attain an assumed network’s overall feature. \(\:{u}_{C}\) denotes the \(\:Cth\) feature mapping that yields by the convolutional layer. The channel-wise statistics \(\:{z}_{C}\) can be obtained in the global average pooling. The excitation process is the procedure of channel weight, whereas \(\:\sigma\:\) signifies sigmoid; \(\:\delta\:\) represents ReLU; \(\:{W}_{1}\) and \(\:{W}_{2}\) correspondingly denote the decreasing and increasing dimensionality activities. Over the excitation process, a non-linear interaction can be attained among networks, and the channel weights can be achieved. Lastly, over a scaling process, the weight of the channel is increased by the novel feature mappings to get the attention of the feature map as the yield of the SE layer. The Eqs. (1)-(3) shows the mathematical formulations for this layer.

    $$z_{C} = F_{{sq}} \left( {uc} \right) = \frac{1}{{H \times W}}\sum\limits_{{i = 1}}^{H} {\sum\limits_{{j = 1}}^{W} {u_{C} \left( {i,j} \right)} }$$
    (1)
    $$s = F_{{ex}} \left( {z,W} \right) = \sigma \left( {g\left( {z,W} \right)} \right) = \sigma \left( {W_{2} \delta \left( {W_{z} } \right)} \right)$$
    (2)
    $$\widetilde{{x_{C} }} = F_{{scale}} \left( {u_{C} ,s_{C} } \right) = s_{C} \cdot u_{C}$$
    (3)
  3. 3)

    PRIMARYCAPS LAYER

    Next, as a PrimaryCaps input, every feature map is taken using its equivalent attention weight. This layer is dissimilar from the standard convolutional layer. As per the explanation, once this layer is completed, capsules, also known as vectors, can store more data.

  4. 4)

    DIGITCAPS LAYER

    This layer is mainly employed to keep non-Ponzi and Ponzi capsules. The vector primarily signifies the last outcome. The CapsNet employed a squash function. When preserving the vector route, the output vector length was used as the potential of the present entity. The mathematical equations among capsules \(\:i\) and \(\:j\) were exposed in Eqs. (4)- (6):

    $$\hat{u}_{{j1}} = W_{{ij}} u_{i}$$
    (4)
    $$s_{j} = \mathop \sum \limits_{i} c_{{ij}} \hat{u}_{{j|i}} ~$$
    (5)
    $$v_{j} = \frac{{\left\| {s_{j} } \right\|^{2} }}{{1 + \left\| {s_{j} } \right\|^{2} }}\frac{{s_{j} }}{{\left\| {s_{j} } \right\|}}$$
    (6)

Whereas, \(\:{W}_{ij}\) signifies the matrix of weight, indicating the association among capsules \(\:i\) capsule \(\:j\); \(\:{\widehat{u}}_{j|i}\) indicates the forecast that the \(\:ith\) lower-level capsule creates the \(\:jth\) higher-level capsule. \(\:{c}_{ij}\) refers to the concatenation factor attained over a dynamic route. The outcome of \(\:{v}_{j}\) is ranked through the result of the last squash function.

DL fine-tuning process

The ICOA technique is utilized to improve the performance of the SE-CapsNet model. The COA model is a newly presented SI approach28. The ICOA technique was selected for hyperparameter tuning due to its improved efficiency and effectiveness in navigating complex parameter spaces. ICOA integrates enhancements that refine the conventional crayfish optimization model, resulting in more precise and faster convergence on optimal hyperparameters. Its capability to handle non-linear and multi-modal optimization issues makes it appropriate for tuning parameters in DL methods. Compared to other techniques, the ICOA technique performs better in balancing exploration and exploitation, mitigating the likelihood of overfitting, and attaining improved model accuracy. This makes the ICOA approach compelling for optimizing hyperparameters in complex models where conventional techniques might face difficulties. Figure 4 specifies the overall structure of the ICOA model.

Fig. 4
figure 4

Steps involved in the ICOA model.

This method proposes to define the optimum performance of the problem by mimicking the competitive, heat avoidance, and foraging strategies. It contains two major phases, namely exploration and exploitation. This method employs \(\:X\) to signify the primary population position, whereas \(\:{X}_{ij}\) denotes the crayfish (CF) position \(\:i\) from dimensional \(\:j\). The value of \(\:{X}_{ij}\) has been computed utilizing the subsequent formula:

$$x_{{(i,j) = lb_{j} + (ub_{j} - 1b_{j} )}} \times rand$$
(7)

Whereas \(\:l{b}_{j}\) signifies the lower boundaries of the \(\:{j}^{th}\) dimensional, \(\:u{b}_{j}\) signifies the upper boundaries of the \(\:{j}^{th}\) dimensional, and \(\:rand\) signifies the random number.

CF thrives in environments with temperatures from 15-\(\:3{0}^{o}C\). If the temperature variable “temp” goes beyond \(\:3{0}^{o}C\), CF searches for refuge in caves. Under the condition that the count of accessible caves is restricted, cave scrambling events could occur. \(\:rand<0.5\) has been deployed to define the lack of cave scrambling event. The subsequent equation can be employed to indicate the CF’s entrance into the cave.

$$X_{{(i,j)}}^{{(i + 1)}} = X_{{(i,j)}}^{t} + C_{2} \times rand \times (X_{s} hade - X_{{(i,j)}}^{t} )$$
(8)

In which \(\:t\) represents the present iteration number, \(\:{C}_{2}\) implies the lessening curve, and \(\:{X}_{shade}\) stands for the cave position.

CF is employed in competition for burrows once the temperature exceeds 30 and the random variable is superior to or equivalent to 0.5.

$$X_{{i,j}}^{{i + 1}} = X_{{i,j}}^{t} - X_{{z,j}}^{t} + X_{{shade}}$$
(9)

Whereas \(\:z\) denotes the random individual of CF; \(\:z=round\left(rand\times\:\left(N-1\right)\right)+1.\)

If “temp” ≤ 30, then CF starts feeding. Because of its restricted body size, CF displays 2 different feeding behaviours. If the food was substantial, CF deployed its claws to shred it into controllable pieces previously feeding, employing its 2nd and 3rd walking feet in changing patterns. In contrast, if the size of the food can be appropriate, CF involves indirect feeding. The behaviours of foraging for normal and huge-size feed are defined as:

$$X_{{i,j}}^{{t + 1}} = X_{{ij}}^{t} + X_{{food}} \times p \times \left( {{\text{cos}}\left( {2 \times \pi \times rand} \right) - {\text{sin}}\left( {2 \times \pi \times rand} \right)} \right)$$
(10)
$$X_{{i,j}}^{{t + 1}} = \left( {X_{{i,j}}^{t} - X_{{food}} } \right) \times p + p \times rand \times X_{{i,j}}^{t}$$
(11)

In this case, \(\:{x}_{food}\) signifies the food place, and \(\:p\) defines the mathematical equation of CF intake. Sin and cos functions can be deployed to demonstrate the changing feeding CF behaviour.

During distinct phases of the algorithm, the CF symbolizes the optimum performance once it enters the cave and eats the food. By constantly upgrading the CF position, it remains near the target variable, accomplishing the method’s optimizer function. The pseudocode for COA is depicted in Algorithm 1.

Algorithm 1
figure a

Pseudocode of crayfish optimization algorithm.

The COA population-initialized approach restricts the speed and direction with the optimum result found, affecting the process’s entire efficiency. Chaotic mapping has been presented to enhance the population initialized process and improve its global searching ability. Chaotic mapping was employed for orders that reveal features of ergodicity, arbitrariness, and orbital instability. Generally utilized chaotic mapping comprises tent mapping, sin mapping, logistic mapping, circle mapping, and singer mapping. The circle map has been recognized for its constancy and wide range of chaotic rates. During this case, the circle map has been deployed for initializing the population CF as:

$$x_{{n + 1}} = ~mod~\left( {x_{n} + 0.2 - \frac{{0.5}}{{2\pi }}{\text{sin}}\left( {2\pi x_{n} } \right),~1} \right)$$
(12)

On the other hand, \(\:n\) refers to the size of the solutions.

The ICOA is used to derive an FF to enhance the efficiency of a classifier. It defines a positive number to signify the superior performance of the candidate solution. In this research, the minimizer of the classifier rate of error is measured as the FF, as set in Eq. (13).

$$\begin{aligned} fitness\left( {x_{i} } \right) & = ClassifierErrorRate\left( {x_{i} } \right) \\ & = \frac{{no.~\;of~\;misclassified~\;instances}}{{Total~\;no.\;~of~\;instances}} \times 100 \\ \end{aligned}$$
(13)

Classifier selection

At last, the classification of OSCC is performed by utilizing the CNN-BiLSTM model29. LSTM is a different RNN structure, as presented by S. Hochreiter and J. Schmidhuber in 1997. Its internal configuration involves the memory cell, forget, input, and output gates. The CNN-BiLSTM technique was chosen for OSCC classification due to its unique capability to integrate CNNs with BiLSTM networks. The CNN component outperforms in extracting spatial features from histopathological images, capturing complex patterns and structures. The BiLSTM approach improves this by capturing temporal dependencies and contextual data from sequential data, enhancing the technique’s understanding of intrinsic patterns. This incorporation allows for more precise classification by employing spatial and sequential features, making it specifically efficient for the complex evaluation needed in OSCC detection. The incorporation of CNN and BiLSTM presents a robust approach that can outperform other techniques that may only partially capture the temporal and spatial aspects of the data. Figure 5 illustrates the structure of the CNN-BiLSTM model.

Fig. 5
figure 5

Architecture of CNN-BiLSTM.

Eventually, integrating these two layers and the forget gate upgrades the LSTM storage cell based on Eq. (17).

$$\left\{ {\begin{array}{*{20}l} {i_{t} = \sigma \left( {w_{{ix}} x_{t} + w_{{ih}} h_{{t - 1}} + b_{i} } \right)} \hfill \\ {f_{t} = \sigma \left( {w_{{fx}} x_{t} + w_{{fh}} h_{{t - 1}} + b_{f} } \right)} \hfill \\ {\tilde{c}_{t} = tanh\left( {w_{{cx}} x_{t} + w_{{ch}} h_{{t - 1}} + b_{c} } \right)} \hfill \\ {o_{t} = \sigma \left( {w_{{ox}} x_{t} + w_{{oh}} h_{{t - 1}} + b_{o} } \right)} \hfill \\ \end{array} } \right.$$
(14)

In which, \(\:{i}_{t},{\:f}_{t}\), and \(\:{o}_{t}\) refer to the input, forget, and output gates at a \(\:t\) time step of the LSTM cell; correspondingly, \(\:{C}_{t}\) implies the candidate cell layer describing the long-term memory at time step \(\:t\), \(\:{h}_{t-1}\) defines the outcome of the preceding time step \(\:t-1\), \(\:{x}_{t}\) illustrates the input of present time step \(\:t\),\(\:\:\sigma\:\) demonstrates the sigmoid activation function, \(\:{b}_{i}\) stands for the bias term of input gate, \(\:{w}_{ix}\) illustrates the weight factor of the input gate \(\:{x}_{t}\), \(\:{w}_{hich}\) shows the weighted factor of the preceding hidden layer \(\:{h}_{t-1}\), \(\:{b}_{f}\) indicates the biased term of forget gate, \(\:{w}_{fx}\) represents the weighted factor of forget gate \(\:{x}_{t}\),\(\:\:{b}_{c}\) signifies the bias term, \(\:{w}_{cx}\) denotes the weighted coefficient of cell \(\:{x}_{t}\), \(\:{w}_{fh}\) defines the preceding hidden layer \(\:{h}_{t-1}\), \(\:{w}_{ch}\) indicates the weighted coefficient of cell \(\:{h}_{t-1}\), \(\:{b}_{o}\) demonstrates the bias term of output gate, \(\:{w}_{ox}\) refers to the weighted coefficient of the output gate \(\:{x}_{t}\), \(\:{w}_{oh}\) defines the weighted coefficient of output gate \(\:{h}_{t-1}.\)

Write Eq. (14) in vector method as follows

$$\left[ {\begin{array}{*{20}l} {i_{t} } \hfill \\ {f_{t} } \hfill \\ {o_{t} } \hfill \\ \end{array} } \right] = \sigma \left( {\left[ {\begin{array}{*{20}l} {w_{{ix}} w_{{ih}} } \hfill \\ {w_{{fx}} w_{{fh}} } \hfill \\ {w_{{ox}} w_{{oh}} } \hfill \\ \end{array} } \right] \cdot \left[ {\begin{array}{*{20}l} {x_{t} } \hfill \\ {h_{{t - 1}} } \hfill \\ \end{array} } \right] + \left[ {\begin{array}{*{20}l} {b_{i} } \hfill \\ {b_{f} } \hfill \\ {b_{o} } \hfill \\ \end{array} } \right]} \right)$$
(15)
$$\left[ {\tilde{c}_{t} } \right] = tanb\left( {\left[ {w_{{cx}} w_{{ch}} } \right] \cdot \left[ {\begin{array}{*{20}l} {x_{t} } \hfill \\ {h_{{t - 1}} } \hfill \\ \end{array} } \right] + \left[ {b_{c} } \right]} \right)$$
(16)

Eqs. (15) and (16) are expressed as:

$$\left\{ {\begin{array}{*{20}c} {y_{1} = \sigma \left( {w_{1} x_{1} + b_{1} } \right)} \\ {y_{2} = tanb\left( {w_{1} x_{1} + b_{1} } \right)} \\ \end{array} } \right.$$
(17)

whereas \({y}_{1}= [i_{t} f_{t} o_{t} ]^{T}\), \(w_{1} = \left[ {\begin{array}{*{20}l} {w_{{ix}} w_{{fx}} w_{{ox}} } \hfill \\ {\:w_{{ih}} w_{{fh}} w_{{oh}} } \hfill \\ \end{array} } \right]\), \(x_{1} = [x_{t} h_{{t - 1}} ]^{T}\), \(b_{1} = [b_{i} b_{f} b_{o} ]^{T}\), \(y_{2} = [c_{t} ]\), \(w_{2} = [w_{{cx}} w_{{ch}} ]^{T}\), \(x_{2} = [x_{t} h_{{t - 1}} ]^{T}\) and \({b}_{2}=\left[{b}_{c}\right]\). The resultant of LSTM cell for every time step is

$$\left\{ {\begin{array}{*{20}c} {c_{t} = f_{t} *c_{{t - 1}} + i_{t} *C_{t} } \\ {h_{t} = o_{t} tanh\left( {c_{t} } \right)} \\ \end{array} } \right.$$
(18)

In this case, \(\:{c}_{t}\) represents the cell layer at time step \(\:t\), and \(\:{c}_{t-1}\) defines the cell layer from the preceding time step \(\:t-1.\)

The architecture of the BiLSTM network, whereas \(\:{x}_{t}\) refers to the input data at time step \(\:t\), \(\:{\overrightarrow{h}}_{t}=\left({\overrightarrow{h}}_{1},{\overrightarrow{h}}_{2},\:\cdots\:,\:{\overrightarrow{h}}_{n}\right)\) denotes the resultant of the forward LSTM hidden layer at time step \(\:t\), \(\:{\overleftarrow{h}}_{t}=\)\(\:\left({\overleftarrow{h}}_{1}{,\overleftarrow{h}}_{2},\:\cdots\:,\:{\overleftarrow{h}}_{n}\right)\) refers to the resultant of reverse LSTM hidden layer at time step \(\:t,\)\(\:{y}_{t}=({y}_{1},\:{y}_{2},\:\cdots\:{y}_{n})\) denotes the resultant of the Bi-LSTM network at time step \(\:t\). The last resultant vector is an integrated effect of forward or reverse data flow \(\:{y}_{t}=f\left({\overrightarrow{h}}_{t},{\overleftarrow{h}}_{t}\right)\). The mathematical formulae of the Bi-LSTM are given below.

$$\left\{ {\begin{array}{*{20}c} {\vec{h}_{t} = \sigma \left( {w_{{\vec{h}x}} x_{t} + w_{{\vec{h}\vec{h}}} h_{t} + b_{{\vec{h}}} } \right)} \\ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} _{t} = \sigma \left( {w_{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} x}} x_{t} + w_{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} }} \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} _{t} + b_{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} }} } \right)} \\ {y_{t} = w_{{y\vec{h}}} \vec{h}_{t} + w_{{y\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} }} \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h} _{t} + b_{y} } \\ \end{array} } \right.$$
(19)

The CNN-BiLSTM method integrates the benefits of CNNs and Bi-LSTM. A CNN has been deployed to capture the local features of input text, gradually decreasing the size and count of feature data with the sequence of convolution and pooling layers. Afterwards, the Bi-LSTM network removes global feature data in the CNN output by assuming the entire infrastructure and long‐term dependencies. The forward and backward networks of Bi-LSTM individual method the resultant features of CNN, recollecting the primary data with the memory unit and gating mechanism to acquire the last feature data. Eventually, the Bi-LSTM output is distributed to a fully connected (FC) layer. Combining the local feature extractor proficiency of the CNN with the global data processing proficiency of BiLSTM, the CNN‐BiLSTM technique effectually improves the accuracy.

Result analysis and discussion

The SEHDL-OSCCR technique’s performance evaluation is examined using the OSCC dataset from the Kaggle repository30,31. The images were captured using a Leica ICC50 HD microscope from H&E-stained tissue slides prepared by medical experts from 230 patients. The dataset comprises 528 samples with two classes, as demonstrated in Table 1. Figure 6 represents the sample images. The suggested technique is simulated using the Python 3.6.5 tool on PC i5-8600k, 250GB SSD, GeForce 1050Ti 4GB, 16GB RAM, and 1 TB HDD. The parameter settings are provided: learning rate: 0.01, activation: ReLU, epoch count: 50, dropout: 0.5, and batch size: 5.

Table 1 Details on dataset.
Fig. 6
figure 6

Sample images.

Figure 7 demonstrates the confusion matrices produced by the SEHDL-OSCCR method on different epochs. The outcomes indicate that the SEHDL-OSCCR approach effectively recognizes samples under all classes.

Fig. 7
figure 7

Confusion matrices of SEHDL-OSCCR technique (a-f) Epochs 500–3000.

In Table 2; Fig. 8, the detection results of the SEHDL-OSCCR technique are depicted under several epochs. The results signified that the SEHDL-OSCCR technique correctly identified the normal and OSCC samples. With 500 epochs, the SEHDL-OSCCR technique gains an average \(\:acc{u}_{y}\) of 94.89%, \(\:pre{c}_{n}\) of 91.65%, \(\:rec{a}_{l}\) of 89.76%, \(\:F{1}_{score}\) of 90.67%, and MCC of 81.39%. Also, with 1000 epochs, the SEHDL-OSCCR method gets an average \(\:acc{u}_{y}\) of 96.97%, \(\:pre{c}_{n}\) of 93.93%, \(\:rec{a}_{l}\) of 95.49%, \(\:F{1}_{score}\) of 94.69%, and MCC of 89.41%. Concurrently, with 1500 epochs, the SEHDL-OSCCR method attains an average \(\:acc{u}_{y}\) of 97.16%, \(\:pre{c}_{n}\) of 94.42%, \(\:rec{a}_{l}\) of 95.60%, \(\:F{1}_{score}\) of 95.00%, and MCC of 90.02%. Simultaneously, with 2000 epochs, the SEHDL-OSCCR approach achieves an average \(\:acc{u}_{y}\) of 98.11%, \(\:pre{c}_{n}\) of 96.25%, \(\:rec{a}_{l}\) of 97.07%, \(\:F{1}_{score}\) of 96.65%, and MCC of 93.31%. At last, with 2500 epochs, the SEHDL-OSCCR approach gains an average \(\:acc{u}_{y}\) of 97.73%, \(\:pre{c}_{n}\) of 95.95%, \(\:rec{a}_{l}\) of 95.95%, \(\:F{1}_{score}\) of 95.95%, and MCC of 91.89%.

Table 2 Classifier outcome of SEHDL-OSCCR method under distinct epochs.
Fig. 8
figure 8

Average outcome of SEHDL-OSCCR technique under distinct epochs.

In Fig. 9, the training and validation accuracy results of the SEHDL-OSCCR technique are established. The accuracy values are computed in the interval of 0-3000 epochs. The result emphasizes that the training and validation accuracy values show a rising tendency, which alerts the ability of the SEHDL-OSCCR approach to improve performance over numerous iterations. Moreover, the training and validation accuracy remain closer over the epochs, which states low least overfitting and displays amended performance of the SEHDL-OSCCR approach, guaranteeing consistent prediction on hidden samples.

Fig. 9
figure 9

\(\:Acc{u}_{y}\) curve of SEHDL-OSCCR technique (a-f) Epochs 500–3000

In Fig. 10, the training and validation loss graph of the SEHDL-OSCCR method is demonstrated. The loss values are computed for 0-3000 epochs. It is indicated that the training and validation accuracy values demonstrate a decreasing tendency, notifying the capability ability of the SEHDL-OSCCR method in balancing a tradeoff between data fitting and generalization. The continual reduction in loss values also ensures the upgraded performance of the SEHDL-OSCCR approach and tunes the prediction outcomes over time.

Fig. 10
figure 10

Loss curve of SEHDL-OSCCR technique (a-f) Epochs 500–3000.

In Fig. 11, the precision-recall (PR) curve examination of the SEHDL-OSCCR technique under dissimilar epochs delivers an interpretation of its performance by plotting Precision against Recall for all the classes. The figure denotes that the SEHDL-OSCCR method continuously accomplishes improved PR values across diverse class labels, representing its capability to maintain a significant part of true positive predictions between every positive prediction (precision) while capturing a substantial proportion of actual positives (recall). The stable growth in PR outcomes among every class signifies the efficacy of the SEHDL-OSCCR methodology in the classification process.

Fig. 11
figure 11

PR curve of SEHDL-OSCCR technique (a-f) Epochs 500–3000.

In Fig. 12, the ROC curve of the SEHDL-OSCCR model under discrete epochs is studied. The outcomes suggest that the SEHDL-OSCCR method improves ROC results over every class, representing the critical skill of discriminating the classes. This reliable trend of improved ROC values over many classes indicates the proficient performance of the SEHDL-OSCCR approach in forecasting classes, highlighting the robust nature of the identification process.

Fig. 12
figure 12

ROC curve of BADD-SAODFF technique (a-f) Epochs 500–3000.

Table 3; Fig. 13 show a widespread comparison study of the SEHDL-OSCCR technique clearly illustrated32. The results indicate that the Resnet50-VGG16 model has shown ineffective performance. Additionally, three kinds of CNN and MobileNetv3-GTO models have exhibited slightly boosted results. Meanwhile, Resnet50-Feature fusion, VGG16, Mobile Inception, and Swin-Transformer models have demonstrated moderately closer results. However, the SEHDL-OSCCR technique outperforms the other models with an increased \(\:acc{u}_{y}\) of 98.75%, \(\:pre{c}_{n}\) of 96.69%, \(\:rec{a}_{l}\) of 98.75%, and \(\:F{1}_{score}\) of 97.69%.

Table 3 Comparative analysis of SEHDL-OSCCR technique with recent methods.
Fig. 13
figure 13

Comparative analysis of SEHDL-OSCCR technique with recent methods.

The computation time (CT) results of the SEHDL-OSCCR technique are compared with other DL models in Table 4; Fig. 14. The results show that the SEHDL-OSCCR technique attains a minimal CT of 2.70s. On the other hand, the ResNet50-feature fusion, ResNet50-DCNN, ResNet50-VGG16, three kinds of CNN, VGG16 and Mobile Inception, MobilenetV3-GTO, and Swin Transformer models obtain increased CT values of 7.49s, 8.59s, 5.71s, 4.62s, 6.88s, 3.84s, and 4.66s, correspondingly. Thus, the SEHDL-OSCCR technique can be used to identify OC.

Table 4 CT outcome of SEHDL-OSCCR technique with recent models.
Fig. 14
figure 14

CT outcome of SEHDL-OSCCR technique with recent models.

Conclusion

This study developed a novel SEHDL-OSCCR approach to HIs. The presented SEHDL-OSCCR technique mainly focuses on detecting OC using hybrid DL models. To accomplish that, the SEHDL-OSCCR approach contains distinct processes such as noise reduction, SE-CapsNet-based feature extractor, ICOA-based parameter tuning, and classifier selection process. Initially, the BF technique is used to remove the noise. Next, the SEHDL-OSCCR technique applies the SE-CapsNet model to recognize the feature extractors. The ICOA is utilized to boost the performance of the SE-CapsNet model. Finally, the classification of OSCC is performed using the CNN-BiLSTM model. The obtained simulation results of the SEHDL-OSCCR technique can be tested under a benchmark medical image dataset. The experimental validation of the SEHDL-OSCCR technique illustrated a greater accuracy outcome of 98.75% compared to recent approaches. The limitations of the SEHDL-OSCCR approach encompass challenges in managing diverse data discrepancies and complexities that can affect its overall performance. Moreover, the technique may need help with scalability problems as the dataset size increases, potentially impacting processing effectiveness. There may also be limitations in handling subtle variations between classes, which could affect the precision of the technique in specific scenarios. Future studies should address these limitations by expanding the dataset to include a broader range of conditions and focusing on improving the scalability and capability of the system to differentiate subtle features. Incorporating advanced techniques and innovative strategies will significantly enhance robustness, accuracy, and effectiveness. Future research should also include incorporating multi-modal data, namely genomic and clinical information, with the present histopathological images to improve the capability of the model to make more comprehensive and precise diagnoses.