An extensive experimental analysis for heart disease prediction using artificial intelligence techniques

Rohan, D.; Reddy, G. Pradeep; Kumar, Y. V. Pavan; Prakash, K. Purna; Reddy, Ch. Pradeep

doi:10.1038/s41598-025-90530-1

Download PDF

Article
Open access
Published: 20 February 2025

An extensive experimental analysis for heart disease prediction using artificial intelligence techniques

D. Rohan¹,
G. Pradeep Reddy²,
Y. V. Pavan Kumar³,
K. Purna Prakash⁴ &
…
Ch. Pradeep Reddy⁵

Scientific Reports volume 15, Article number: 6132 (2025) Cite this article

7170 Accesses
10 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The heart is an important organ that plays a crucial role in maintaining life. Unfortunately, heart disease is one of the major causes of mortality globally. Early and accurate detection can significantly improve the situation by enabling preventive measures and personalized healthcare recommendations. Artificial intelligence is emerging as a powerful tool for healthcare applications, particularly in predicting heart diseases. Researchers are actively working on this, but challenges remain in achieving accurate heart disease prediction. Therefore, experimenting with various models to identify the most effective one for heart disease prediction is crucial. In this view, this paper addresses this need by conducting an extensive investigation of various models. The proposed research considered 11 feature selection techniques and 21 classifiers for the experiment. The feature selection techniques considered for the research are Information Gain, Chi-Square Test, Fisher Discriminant Analysis (FDA), Variance Threshold, Mean Absolute Difference (MAD), Dispersion Ratio, Relief, LASSO, Random Forest Importance, Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA). The classifiers considered for the research are Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Gaussian Naïve Bayes (GNB), XGBoost, AdaBoost, Stochastic Gradient Descent (SGD), Gradient Boosting Classifier, Extra Tree Classifier, CatBoost, LightGBM, Multilayer Perceptron (MLP), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BiLSTM), Bidirectional GRU (BiGRU), Convolutional Neural Network (CNN), and Hybrid Model (CNN, RNN, LSTM, GRU, BiLSTM, BiGRU). Among all the extensive experiments, XGBoost outperformed all others, achieving an accuracy of 0.97, precision of 0.97, sensitivity of 0.98, specificity of 0.98, F1 score of 0.98, and AUC of 0.98.

AttGRU-HMSI: enhancing heart disease diagnosis using hybrid deep learning approach

Article Open access 03 April 2024

Effectiveness of machine learning models in diagnosis of heart disease: a comparative study

Article Open access 08 July 2025

Predicting cardiovascular risk with hybrid ensemble learning and explainable AI

Article Open access 23 May 2025

Introduction

As per the recent World Heart Report, 20.5 million deaths were accounted to cardiovascular disease (CVD) globally¹. Heart disease can be caused due to reduced flow of blood to the heart, infection, atherosclerosis, high blood pressure, or uncontrolled diabetes. The common types are heart failure, heart attack, myocarditis, sudden cardiac arrest, atrial septal defect, atrial fibrillation, coronary heart disease, angina, ventricular tachycardia, and pericarditis.Heart failure impacts the hearts’ ability to pump blood effectively. Heart attack is due to the block in arteries which causes loss of blood supply. Inflammation of the myocardium is myocarditis, which is usually caused by a viral infection or bacterial infection or can be due to a fungal infection. Sudden cardiac arrest is different from a heart attack. As the name says, it is a condition where the heart suddenly stops beating. The atrial septal defect occurs due to the presence of a hole in the wall in between atria. An irregular heartbeat leads to atrial fibrillation. A condition where the major blood vessels or coronary arteries are narrowed leads to coronary heart disease. Receiving insufficient oxygen-rich blood may lead to chest discomfort, often referred to as angina, a type of chest pain. The fast heartbeat rhythm of the lower chambers of the heart or the ventricles causes ventricular tachycardia. Pericarditis is caused by the inflammation of a thin membrane around the heart called pericardium. Early detection of heart disease is important to prevent adverse outcomes and reduce the burden on healthcare systems.

The term “Artificial Intelligence” was coined at the Dartmouth conference in 1956. In the 1950s, Alan Turing proposed the Turing Test, a benchmark for machine intelligence. Knowledge-based systems emerged in the 1970s, with MYCIN (designed to identify infection-causing bacteria) as a notable example. Artificial intelligence gained further prominence in the 1990s with the development of neural networks and backpropagation. Advances in computing power in the 2000s led to the development of techniques such as natural language processing (NLP) and image recognition. From the 2010s, deep learning has become a dominant force in AI development. Machine learning algorithms, particularly supervised learning, unsupervised learning, and reinforcement learning have shown great promise in healthcare. Deep learning techniques such as neural networks, multilayer perceptron (MLP), convolutional neural networks (CNNs), and recurrent neural networks (RNNs) are also being actively used. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are the variants of RNN. RNNs and their variants have a wide range of applications beyond healthcare, including natural language processing, time series prediction, video analysis, music generation, and robotics. Deep learning has recently demonstrated significant potential in detecting COVID-19². Explainable AI (XAI) plays a major role in healthcare systems by addressing the transparency and interpretability of the models³. It comes into play when the decisions made by these models may lead to high consequences. It helps in clinical decision support, treatment recommendation, ensuring fairness, and avoiding bias. Federated learning is also widely used in healthcare systems⁴. Various research works that are related to the objective of this paper are segregated in terms of literature related to Artificial Intelligence, the Internet of Things (IoT) and a combination of these are discussed as follows. With the current advancements in neural network architectures and computing power, researchers used models such as artificial neural network (ANN) and deep neural network (DNN) for heart disease classification tasks^5,6. The integration of the IoT with deep learning-modified neural networks (DLMNN) was designed to predict the presence of heart disease. This involves three phases - authentication, encryption, and classification. The DLMNN was trained on the Hungarian heart disease dataset. With 100 nodes, the DLMNN achieved 92% accuracy, a 92.59% F1 score, with 500 nodes it achieved 96.8% accuracy, a 98.25% F1 score. The IoT-centered DLMNN also achieved 95.82% security during the data transfer and exhibited the lowest encryption and decryption time⁷. Similarly, IoT-based heart failure prediction complex event processing for heart failure prediction (CEP4HFP) was presented⁸. This consists of three modules namely monitoring, analysis, and visualization. Arduino MEGA microcontroller and Raspberry Pi were used for monitoring, and NoSQL and CEP engines were used for data storage and analysis respectively. CEP4HFP achieved 84.75% precision and 91.74% F1 score. For heart disease-related works, the Cleveland dataset is widely used by researchers^9,10,11. A hybrid random forest with a linear model was developed and implemented on the Cleveland dataset to predict heart disease^12,13. Similarly, another work based on random forest, namely the machine intelligence framework for heart disease diagnosis (MIFH) was proposed in¹⁴. MIFH is a random forest classifier with a factor analysis of mixed data (FAMD). FAMD was used on the Cleveland dataset as a feature selection algorithm with classifiers such as logistic regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), decision tree (DT), and random forest (RF). Data mining is a vital step in machine learning, especially for classification tasks to discover important patterns, insights, and relationships from the raw data is essential in building effective machine learning models. Various classifiers were implemented after applying data mining techniques on the Cleveland dataset^15,16. Boosting is an ensemble learning technique that combines the predictions of multiple weak learners to create a strong learner¹⁷. XGBoost is a well-known member of its family and a widely used boosting technique. Various research works that used boosting were discussed^18,19,20,21. Several methods such as decision tree classifier, K-means clustering, and SVM were also considered for heart disease-related research^22,23,24,25. Further, to improve the performance of various models, numerous feature selection and oversampling techniques were used^26,27. Apart from the Cleveland dataset, researchers also considered other datasets such as the Z-Alizadeh Sani dataset, and the CHD dataset for the experiments^28,29. Recent advancements in hybrid optimization and rule-mining techniques have also contributed to the prediction of heart diseases. One such approach is the Grey Wolf Levy Updated-Dragonfly Algorithm (GWU-DA), which integrates Grey Wolf Optimization (GWO) with the Dragonfly Algorithm (DA) for optimized feature selection³⁰. This model leverages weighted coalesce rule generation and hybrid classifiers, combining SVM with Deep Belief Networks (DBN), to effectively predict the presence of heart disease and other conditions such as breast cancer. Furthermore, two-phase parallel frameworks employing weighted coalesce rule mining have been developed to accelerate disease prediction tasks, efficiently handling large datasets^31,32. Although there were several works on heart disease prediction, attaining precise heart disease prediction is still a challenge. In healthcare, making a model with high performance is highly important. To address this challenge, this paper proposes an extensive experimental analysis for heart disease prediction using artificial intelligence techniques to identify the best model for heart disease prediction. The contributions of this work are given as follows.

Comprehensive experimentation: 11 feature selection techniques and 21 classifiers were implemented on the heart disease dataset (Comprehensive)³³.
Two-phase methodology: The whole experiment was performed in two phases.
- In the first phase, classifiers were directly implemented on the dataset.
- In the second phase, the feature selection techniques are employed and later the classifiers are implemented.
Optimized model performance: The XGBoost model has achieved 97.3% accuracy without feature selection and after hyperparameter tuning. This model’s performance is the highest among the other classifiers.

The rest of the paper is organized into the following sections: “Methodology”, results and discussion, and conclusion. Section 2 describes the methodology, Section 3 presents the results of feature selection techniques and classifiers, and Section 4 summarizes the “Results”.

Methodology

The proposed research is performed in two approaches as shown in Fig. 1. In the first approach, 21 classifiers are directly applied to the dataset to predict heart disease and then the performance evaluation is done. Further, hyperparameter tuning is performed on the classifiers. In the second approach, 11 feature selection techniques are applied to the dataset and 21 classifiers are applied to the features selected. The heart disease dataset (Comprehensive) is considered for conducting the proposed research³³. This dataset consists of 11 features and 1190 instances. This dataset is curated by combining the 5 datasets with over 11 common features. The datasets used for the curation are Cleveland, Hungarian, Switzerland, Long Beach VA, and Statlog heart disease datasets. The 11 features of the dataset are tabulated in Table 1.

Performance measures

Six metrics were considered for evaluating the classifiers: accuracy, precision, sensitivity, specificity, F1 score, and AUC. Accuracy measures overall correctness, i.e., the ratio of correctly predicted instances to the total number of instances in the dataset. Precision is used to find out the number of actual positives out of predicted positives. Sensitivity gives the proportion of actual positive cases correctly identified by the model, i.e., the true positive rate. Specificity gives the proportion of actual negative cases the model identifies, i.e., the true negative rate. F1 score is the harmonic mean of precision and sensitivity, which balances the false positives and false negatives. AUC measures the ability of the model to distinguish between the classes, which plots the true positive rate against the false positive rate.

Statistical analysis

The dataset consists of both continuous and categorical features. The baseline characteristics were analyzed to compare two groups: patients with (group-1) and without (group-2) heart disease. For continuous features such as age, resting bp s, cholesterol, max heart rate, and old peak, mean ± standard deviation (SD) was calculated for both groups, and the p-values were calculated using the t-test. For categorical features such as sex, chest pain type, fasting blood sugar, resting ecg, exercise angina, and ST slope, the counts were calculated within two groups, and p-values were calculated using the chi-square test. The baseline characteristics p-values were calculated to determine if there was a statistically significant difference in a feature between the two groups. If the p-values of the features are less than 0.05, they are statistically significant. The p-values of both continuous and categorical features tabulated in Tables 2 and 3 are less than 0.05, indicating that they are statistically significant.

Table 1 Features of the dataset.

Full size table

Table 2 Significance of continuous features.

Full size table

Table 3 Significance of categorical features.

Full size table

Implementation environment

The experiment was carried out using various software and hardware resources. Python 3.10.14 was used to carry out the implementation of the experiment. The Scikit-Learn library was used to implement machine learning algorithms, while TensorFlow and Keras were utilized to build and implement the deep learning classifiers. Numerical computation and pre-processing of the dataset were carried out using the NumPy and Pandas libraries. Statistical analysis was performed using SciPy, and data visualization was conducted with the Matplotlib and Seaborn libraries. The hardware setup includes a PC with Intel$\circledR$ $\text {Core}^{\text {TM}}$ i5-10300H CPU @ 2.50GHz processor, 8GB RAM, 237 GB storage, and NVIDIA GeForce GTX 1650 GPU.

Feature selection techniques

For feature selection, 11 selection techniques of various types such as filter and embedded methods are chosen and implemented on the dataset. These feature selection techniques such as information gain¹³, Chi-square test⁵, Fisher’s discriminant analysis³⁴, variance threshold³⁵, mean absolute difference³⁶, dispersion ratio³⁷, relief¹⁷, Lasso regularization¹⁷, random forest importance³⁵, linear discriminant analysis¹³, and principal component analysis³⁸ are considered for the research and the key feature selection techniques are described as follows.

Information gain

Information gain finds out how much information a feature is providing or the contribution of a feature in identifying the target value. It is the measure of reduction in entropy. Information gain talks about the relevance of a feature concerning the target variable. The pseudocode for calculating information gain is given in Algorithm 1.

Chi-square test

The Chi-square test helps to test how categorical variables are related. It is done by comparing the observed values to the expected values. First, the chi-square is calculated using Eq. (1) between the target variable and the features, and then the desired features will be selected. The pseudocode for Chi-square test is given in Algorithm 2.

$$\begin{aligned} x^2=\sum \frac{(\text { Observed value }- \text { Expected value })^2}{\text { Expected value }} \end{aligned}$$

(1)

FDA

The FDA is widely used in feature selection for classification problems. A scalar combination of features that separates two or more classes is often aimed to be found in the FDA. The Fisher’s score is derived from the Fisher’s ratio which can be calculated using Eq. (2). The pseudocode for FDA is given in Algorithm 3.

$$\begin{aligned} F_i=\frac{\left( \text { Mean }_{\text {Class } 1}-\text { Mean }_{\text {Class } 2}\right) ^2}{\text { Variance }_{\text {Class } 1}+\text { Variance }_{\text {Class } 2}} \end{aligned}$$

(2)

$\text { Mean }_{\text {Class } 1}$ and $\text { Mean }_{\text {Class } 2}$ are the means of the feature Xi in both classes respectively. Similarly, $\text { Variance }_{\text {Class } 1}$ and $\text { Variance }_{\text {Class } 2}$ are the variances of the feature Xi in both classes respectively.

MAD

The MAD and the variance threshold are similar, but the absence of a square makes the difference. It is calculated between each point and the mean which is also a scaled variant. The higher the MAD the higher the information carried by the feature or the better the discriminatory power. The mean absolute difference is calculated using Eq. (3). The pseudocode for MAD is given in Algorithm 4.

$$\begin{aligned} \operatorname {MAD}\left( x_i\right) =\frac{1}{n} \sum _{j=1}^n\left| x_{i j}-\operatorname {Mean}\left( x_i\right) \right| \end{aligned}$$

(3)

DR

DR for a given feature is the relationship between the arithmetic mean (AM) and the geometric mean (GM). The higher the dispersion ratio, the more relevant the feature is. The AM, GM, and DR are calculated using Eqs. (4), (5), and (6). The pseudocode for the dispersion ratio is given in Algorithm 5.

$$\begin{aligned} A M_i= & \frac{1}{n} \sum _{j=1}^n x_{i j} \end{aligned}$$

(4)

$$\begin{aligned} G M_i= & \left( \prod _{j=1}^n x_{i j}\right) ^{1 / n} \end{aligned}$$

(5)

$$\begin{aligned} D R= & \frac{A M_i}{G M_i} \end{aligned}$$

(6)

The DR holds for ${AM_i}$ $\ge$ ${GM_i}$ and lies in the interval $[1, +\infty )$.

Relief

In the relief technique, weights are assigned to the features according to how well they can differentiate between the instances of the same class and different classes. Relief and ReliefF are the two variants of the relief feature selection technique. Relief is used for binary classification and ReliefF is used for binary and multi-class classification. The weights are updated using the Eq. (7). The pseudocode for the Relief is given in Algorithm 6.

$$\begin{aligned} W_i=W_i-\frac{\left( x_i-x_i^{\prime }\right) }{k}+\frac{\sum _{c=1}^C \mid x_i-x_i^{\prime \prime }(C)}{C . k} \end{aligned}$$

(7)

Where C is the number of classes and $x_i^{\prime \prime }(C)$ is the value of feature i in the closest class C miss instance.

Random forest importance

Random forest aggregates predetermined decision trees. The ranking is based on how well the node’s purity is improving. The beginning of the trees contains the nodes with the largest decrement in impurity. Trees below a specific node can therefore be pruned to produce a set of the most significant features. The pseudocode for the Random Forest Importance is given in Algorithm 7.

Classifiers

Classifiers such as boosting algorithms, ensemble learning algorithms, tree-based algorithms, and neural networks are applied in this experiment to predict heart disease, namely, logistic regression³⁹, decision tree⁶, random forest¹², k-nearest neighbors⁶, support vector machine¹², gaussian naïve bayes¹², extreme gradient boosting⁴⁰, adaptive boosting⁶, stochastic gradient descent⁴¹, gradient boosting classifier¹³, extra tree classifier⁴², categorical boosting⁴⁰, light gradient boosting machine²⁹, multi-layered perceptron⁴³, recurrent neural network⁴⁴, long short-term memory⁴⁴, gated recurrent unit⁴⁵, bidirectional long short-term memory⁴⁴, bidirectional gated recurrent unit⁴⁵, convolutional neural network³⁶, and hybrid model. The ‘window_size’ was set to 1 while implementing LSTM, GRU, Bi-LSTM, Bi-GRU, and CNN. The key techniques are described as follows.

RF

The class that serves as a mode for all the other classes will be the result of the random forest classifier’s training phase. The goal of the training is to reduce the error across the group of trees. When it comes to classification, the majority answer from each tree determines the random forest’s prediction. The Eq. (8) represents the predicted class. The pseudocode for RF is given in Algorithm 8.

$$\begin{aligned} {\hat{y}}=\arg \max _m\left( \sum _{n=1}^N I\left( y_{m n}=m\right) \right) \end{aligned}$$

(8)

Where $y_{m n}$ is the anticipated class for the $m^{th}$ sample in the $n^{th}$ tree.

Extreme gradient boosting (XGBoost)

The XGBoost builds the model by combining the predictions of several weak learners. It optimizes the gradient descent algorithm for high efficiency and predictive power. The core of XGBoost is its objective function, which is optimized during the training process. The loss function for regression tasks is typically ‘mean_squared_error’ (MSE), and for classification tasks, it can be ‘Logloss’ (binary or multi-class) or other suitable functions. The objective function for a binary classification problem is given in Eq. (9). The pseudocode for XGBoost is given in Algorithm 9.

$$\begin{aligned} \operatorname {Obj}(\theta )=\sum _{i=1}^n L\left( y_i-{\hat{y}}_i\right) +\sum _{k=1}^K \Omega \left( f_k\right) \end{aligned}$$

(9)

Where the $\operatorname {Obj}(\theta )$ is the overall objective to be minimized, $L\left( y_i-{\hat{y}}_i\right)$ is the individual loss function, $\Omega \left( f_k\right)$ is the regularization term for each tree.

AdaBoost

The ensemble learning algorithm AdaBoost builds the model as same as the XGBoost model. The performance of weak learners is improved by giving more importance or weight to the wrongly classified data points in each iteration. It adapts by emphasizing the training instances that are difficult to classify correctly. At its core is a formula given in Eq. (10) that calculates a weighted error rate for each classifier and assigns a “voting power” based on accuracy. The pseudocode for AdaBoost is given in Algorithm 10.

$$\begin{aligned} \varepsilon _t=\sum _{i=1}^N w_t^{(i)} I\left( h_t\left( x^{(i)} \ne y^{(i)}\right) \right) \end{aligned}$$

(10)

Where N is the total count of training instances,

$w_t^{(i)}$ represents the weight of instance i at iteration t

$h_t\left( x^{(i)} \ne y^{(i)}\right)$ is the prediction of the weak learner $h_t$ for instance $x^{(i)}$,

$y^{i}$ is the true label for instance i,

$I\left(h_t\left( x^{(i)} \ne y^{(i)}\right)\right)$ is an indicator function that equals one if the prediction is incorrect and zero otherwise.

The voting power of the weak learner at iteration I is calculated using Eq. (11).

$$\begin{aligned} \alpha _t=\frac{1}{2} \ln \left( \frac{1-\varepsilon _t}{\varepsilon _t}\right) \end{aligned}$$

(11)

SGD

The basic principle of SGD is to update model parameters by calculating the gradient of the loss function concerning the parameters on a mini-batch of the training set in each iteration. The update rule for the model parameters in each iteration is given in Eq. (12). The pseudocode for SGD is given in Algorithm 11.

$$\begin{aligned} \theta _{t+1}=\theta _t-\eta _t \nabla J_t\left( \theta _t\right) \end{aligned}$$

(12)

Where $\nabla J_t\left( \theta _t\right)$ is the gradient of the objective function concerning $\theta$ at iteration t.

The loss function used for binary classification tasks in the SGD classifier is given in Eq. (13).

$$\begin{aligned} L(y, {\hat{y}})=-(y(\log {\hat{y}})+(1-y) \log (1-{\hat{y}})) \end{aligned}$$

(13)

Where y is the expected label and ${\hat{y}}$ is the predicted probability.

GB

The gradient-boosting algorithm (GB) builds the model by combining the predictions of several weak learners. The loss function must be minimized, which is the sum of the individual losses for each instance. Each weak learner (typically a shallow decision tree) predicts a score for each instance. The class label of a new instance is obtained by summing the predictions from all weak learners as given in Eq. (14). The pseudocode for GB is given in Algorithm 12.

$$\begin{aligned} {\hat{y}}(x)=\sigma \left( \sum _{t=1}^T \eta h_t(x)\right) \end{aligned}$$

(14)

Where $h_t(x)$ is the prediction of the weak learner at iteration t, $\eta$ is the learning rate.

Extra tree classifier (ETC)

The ETC belongs to decision tree-based classifiers. Some additional randomness is introduced by the extra trees which reduce variance and overfitting. It is very similar to a random forest. The final prediction will be obtained by aggregating the individual predictions of all trees. A majority vote is considered for classification tasks and averaging the predictions is done for regression tasks. The pseudocode for ETC is given in Algorithm 13.

Categorical boosting (CatBoost)

CatBoost is known for its efficient handling of categorical features, better performance, and high computational efficiency. Its major use case is that it can handle categorical features without one-hot encoding. Like the other gradient boosting algorithms, CatBoost also uses a gradient boosting framework and employs ‘ordered boosting’ to deal with the categorical variables. The final prediction is given by the Eq. (15).The pseudocode for CatBoost is given in Algorithm 14.

$$\begin{aligned} {\hat{y}}(x)=\sigma \left( \sum _{t=1}^T F_t(x)\right) \end{aligned}$$

(15)

Where $F_t(x)$ the $t^{th}$ tree’s prediction for input x

Light gradient boosting machine (LightGBM)

Microsoft developed LightGBM. It is mostly suitable for large datasets and comes under the category of gradient boosting algorithms. The characteristics viz., efficiency, speed, and scalability set LightGBM separately from the other gradient-boosting algorithms. Like CatBoost, LightGBM fits each new tree to the negative gradient of the loss function. The ‘Logloss’ function is used for classification tasks and the ‘mean_squared_error’ function is used for regression tasks. The pseudocode for LightGBM is given in Algorithm 15.

MLP

MLP is an ANN that consists of more than one hidden layer. The model consists of hidden layers with 150, 100, and 50 neurons each and a ReLU activation function, which adapts to the input dimensions. The final layer employs sigmoid activation with a single neuron, offering probability predictions. The architecture of MLP is shown in Fig. 2, and The pseudocode for MLP is given in Algorithm 16.

RNN

A recurrent layer in an RNN is a layer that sequentially processes a sequence of inputs. Time series forecasting is one of the major applications of the RNNs, where the full input shape is 3D. It means batch size, time steps, and dimensionality of the inputs at every time step. A recurrent layer is composed of a single memory cell, which is used repeatedly to compute the outputs. A memory cell is a small neural network. It can be a simple dense layer or a complex memory cell such as an LSTM or GRU cell. The structure and the architecture of RNN are shown in Figs. 3 and 4 respectively. In this, the first layer outputs a sequence which is fed to the second recurrent layer and then to the third layer. In the third layer, ‘return_sequences’ is set to false and only outputs the final time step. The last output is sent to the dense layer.

LSTM

LSTM consists of three gates and one memory cell. During training, the forget gate will gradually learn when to erase part of the long-term memory. For example, if the long-term memory holds information that there is a string upward trend and the forget gate sees a severe drop in the inputs at the current time step, then it will probably learn to erase that part of the long-term memory since the upward trend is over. In short, the forget gate learns when to forget things and when to preserve them. The input gate learns when it should store information in the memory and the output gate learns which part of the long-term state factor it should output at each time step. The memory cell learns when it should forget the memories, when it should store new ones, and which part of the long-term state vector it should output at each time step. LSTM has a much longer memory than RNN. The structure and architecture of LSTM are shown in Figs. 5 and 6 respectively.

GRU

A GRU processes sequential input data like LSTM. It has a reset gate and an update gate. These gates decide what information to remove and what to keep. GRU is like LSTM, but it doesn’t maintain a cell state. The structure and the architecture of GRU are shown in Fig. 7 and Fig. 8 respectively.

Bi-LSTM

A Bi-LSTM consists of two LSTM models. The first model takes the input as it is and learns the sequence. The second model takes the backward sequence as input and learns the sequence. Bi-LSTM is more complex than Bi-GRU, which makes the training longer and has more parameters due to its complex nature. It is known for its long-term dependencies. The structure and the architecture of Bi-LSTM are shown in Figs. 9 and 10 respectively.

Bi-GRU

Bi-GRU has a more straightforward structure since it merges the hidden state and the memory into one state. It uses two gates, the update gate, and the reset gate, but lacks the explicit memory cell found in LSTM. Due to its simpler structure, training takes place a bit faster and is computationally preferable. It is suitable where a balance between the model performance and computational efficiency is required. The structure and the architecture of Bi-GRU are shown in Figs. 11 and 12 respectively.

CNN

A 1D CNN is exactly like a 2D CNN except it slides filters across just 1D instead of sliding them across 2Ds, typically the breadth and height of an image. Similar to an RNN, a 1D CNN can take input of any length. It has no memory at all. The computation of output is done based on the window of input time steps and the kernel size. Instead of using a single layer with a large kernel, it is better to stack multiple CNN layers each with a small kernel. The architecture of CNN is shown in Fig. 13.

Hybrid model

The hybrid model is built using the 6 neural networks namely CNN, RNN, LSTM, GRU, Bi-LSTM, and Bi-GRU. A total of 3 hidden layers were introduced in the 6 individual models. After concatenation of the 6 models, a dense layer with 128 neurons and the ‘ReLU’ activation function was introduced. The output layer was made up of a dense layer with one neuron and a sigmoid activation function. The model architecture is shown in Fig. 14.

Results and discussion

This section is classified into three sub-sections viz., the results of feature selection techniques, the results of classifiers without feature selection, and the results of classifiers with feature selection.

Results of feature selection techniques

This section discusses how the feature selection is performed, the individual ranks and scores of the features, and how the desired features are selected for implementation.