Introduction

ADHD is one of the most prevalent neurodevelopmental disorders in the world with prevalence of 5–7% in children and 2–5% in adults1. Currently, it is estimated that over 365 million people globally suffer from ADHD, and the condition is more rampant in the US, Canada, and some regions of Western Europe2. The vulnerable population is children aged 4–17 years, and the boys are nearly three more likely to be diagnosed than girls3. ADHD symptoms might require early intervention and the best practices to minimize risk factors are important to prevent4. Examples include sedentary lifestyles due to spending prolonged hours in front of screens, irregular schedules, and stressful conditions that tend to worsen the symptoms of ADHD, which means that children who are at risk should be given adequate attention and support5.

ADHD is identified as a condition of deficits in the areas of inattention, hyperactivity, and impulsivity as well as influencing cognitive and social functioning6. The causes of ADHD are considered to be not entirely known; indeed, it is thought to be due to an interplay of genetics, environmental, and neurological factors. Genetic predisposition plays a big role, and other environmental factors like exposure to toxins during the prenatal period, low birth weight, and childhood trauma may all have been involved in increasing their risk7,8. ADHD often interferes with educational advancement, work and social interactions, all thereby reducing quality of life. Without treatment, ADHD develops comorbid conditions like anxiety, depression, and substance abuse, so diagnosis and intervention are fundamentally important at an early stage9.

Identifying ADHD is important for its management, early intercession and interventions as well as support that may enhance the life span of the patients10. Diagnosis in its own turn is not very accurate since ADHD’s symptoms are quite vague and can be accompanied by other diseases. Age, gender, cultural beliefs, and perception of symptoms differ from male to female, and the method and criteria used in assessment also affect the detection11,12. Neuroimaging and Machine Learning (ML) are two fields that provide hope in increasing the specificity in ADHD, which in turn, will lead to better treatment and prognosis13.

There are numerous diagnostic methods for ADHD, and each has its pros and cons. The main procedures include the traditional behavioral evaluations-based questionnaires and semi-structured interviews among others. However, these commonly used methods often suffer from lack of objectivity and systematic errors since it often depends on information provided by the patients’ parents, teachers, or even the patients themselves14. This risks over-diagnosis or failure to diagnose, since some symptoms can overlap with others. Neuropsychological testing, comprising assessment tests for attention and impulse regulation, provides more standardized measurements but is time-consuming, expensive, and resource-intensive, thereby not easily accessible. Neuroimaging techniques like fMRI and EEG were promising for determining patterns of brain activity that may be related to ADHD, but expensive and quite impractical for more routine clinical utility from a practical and technical standpoint15,16. Due to these limitations, a novel detection technique is introduced to bridge the gaps in ADHD diagnosis.

The following highlights the research’s primary contribution:

  • To utilize two databases such as EEG and diagnostic databases in the ADHD detection for enabling comprehensive, accurate, and reliable ADHD detection.

  • To introduce a novel NeuroDCT-ICA for EEG data preprocessing which combines Improved Discrete Cosine Transform (I-DCT) with Independent Component Analysis (ICA) to eliminate noise and extract meaningful features from raw EEG signals, ensuring high-quality data preprocessing.

  • To utilize Multi-Modal Feature Fusion using an attention network to integrate EEG features and diagnostic data, ensuring robust feature representation for ADHD detection.

  • To develop a new RFO for feature selection which is a hybrid algorithm, blending Bitterling Fish Optimization (BFO) and Rhinopithecus Swarm Optimization (RSO) for optimal feature selection, enhancing processing stability and reducing computational complexity.

  • To introduce a novel ADHD AttentionNet, DL model designed for accurate and efficient ADHD identification, offering superior performance with minimal computational cost.

The organization of the work is as follows: Sect “Introduction” gives an introduction related to the given title, and Sect “Related work” mentions recent literature related to the title. Furthermore, the suggested approach will be deliberated in Sect “Proposed methodology”. Lastly, the result and discussion of the suggested model are given in Sect “Result and discussion”. Finally, the work will be concluded with the conclusion Sect “Conclusion”.

Related work

In 2022, Liu et al.17 presents a DL model for the diagnosis of ADHD through rs-fMRI. For spatial feature extraction, the model uses Nested Residual Convolutional Denoising Autoencoder (NRCDAE) and for temporal feature extraction, a 3D Convolutional Gated Recurrent Unit (GRU) is used. These features are then classified using a sigmoid classifier: The method demonstrates enhanced classification accuracy when distinguishing between ADHD and control subjects and generalization to other sites more effectively than other models.

In 2023, Chen et al.18 suggested an attention auto-encoding network (Att-AENet) under binary hypothesis testing (BHT) paradigm. This model takes brain functional connectivities (FCs) as input and dynamically computes FC weights using the attention mechanism to quantify the importance of FCs concerning ADHD. The Att-AENet consists of two attention subnetworks: one with dense structure and the other with two-dimensional convolution structures and their performance on the classification of ADHD is compared. The dense-based subnetwork is particularly employed for biomarker detection analysis because of its higher classification accuracy.

In 2023, Liu et al.19 suggested integrating radiomics features from resting-state functional magnetic resonance imaging (rs-fMRI). From the four kinds of preprocessed rs-fMRI data: ReHo, ALFF, VMHC, and DC, 93 features of the brain areas were obtained. The next step was to choose the most appropriate features and after this a SVM model was developed with accuracy of 76.3% in training and 77.0% in testing. The results indicate that rs-fMRI based radiomics features can be used as useful neuroimaging biomarkers for differentiating ADHD from HC.

In 2022, Hamedi et al.20 presented a brain connectomics model based on eyes-open resting state magnetoencephalography (rs-MEG) for the diagnosis of ADHD from HC. It involves computing Coherence (COH) between MEG sensors for different frequency bands, picking the best features for COH using Neighborhood Component Analysis (NCA) and using three classifiers, Support Vector Machines (SVMs), k-Nearest Neighbors (KNN) and Decision Tree for classification.

In 2023, Chandra and Rana21 presented an idea of an automatic diagnostic tool for ADHD based on structural MRI and PC data for classification. It includes feature extraction from GM volumes and CT; feature ranking using mRMR and EFS; and classifier training KNN, Logistic Regression (LR), linear SVM, RBSVM, and Random Forest (RF). The system performance is measured by accuracy, recall and specificity using ten-fold cross validation, with maximum accuracy of 75% using CT and PC features with RBSVM and SVM.

In 2022, Uyulan et al.22 proposed an automatic system for detecting ADHD through transfer learning with a ResNet-50 type 2D-Convolutional Neural Network (CNN). This method resolves issues of data acquisition variation and class distribution. The system attained a classification accuracy of 93.45% via 10-CV. The results are analyzed using the Class Activation Map (CAM) which shows the regions of the brain involved in the frontal, parietal, and temporal lobes where children with ADHD are different from the normal controls.

In 2024, Gülşah and Özmen23 introduced a new DL model in the form of a 3D CNN architecture for the classification of people with ADHD using fMRI. The model employed fractional amplitude of low-frequency fluctuations (fALFF) and regional homogeneity (ReHo) data and balances the datasets through five-fold cross-validation. The proposed 3D CNN is compared with a fully connected neural network (FCNN), which clearly depicts higher accuracy rates in classifying ADHD especially on fALFF data. It is found that the proposed 3D CNNs are useful for the diagnosis of ADHD and are better than the FCNN models in this regard.

In 2022, Maniruzzaman et al.24 presented an ML algorithm for classification of ADHD from normal children based on EEG signal. It uses morphological and time-domain features from the EEG signals and t-test and LASSO for feature selection. The selected features are then classified using four ML algorithms: SVM, KNN, MLP, and LR algorithms were used in the study. Comparing the results of LASSO and SVM, it was possible to conclude that the integrated method has the highest accuracy of 94.2%, which indicated the possibility of using this tool in the early diagnosis of ADHD.

In 2021, Zhou et al.25 proposed a novel ML approach based on Boruta feature selection and Multiple Kernel Learning (MKL) to incorporate structural and functional MRI and DTI data to diagnose early adolescent ADHD. Overall, the framework synthesizes macrostructural characteristics, microstructural parameters, and functional connection density at the kernel network level and a SVM classifier for differentiating ADHD from non-ADHD children. The experiment results demonstrated that, the kernel-level fusion of multimodal features can yield the classification accuracy of 64.3%, and the AUC of 0.698, which are higher than those of unimodal and early feature fusion methods.

In 2021, Khullar et al.26 presented a ML approach for the discrimination of ADHD from healthy controls based on resting state functional magnetic resonance imaging (rs-fMRI) data. The proposed method uses a 2D CNN and a 2D CNN-LSTM model as the base network architecture. The performance of the system is compared with the existing methods with reference to accuracy, specificity, sensitivity, F1-score, and AUC, and the results demonstrate the enhanced performance of the proposed method in differentiating ADHD from healthy control subjects.

In 2022, Barua et al.33 suggested a ternary motif pattern (TMP) for ADHD detection using EEG signals. To create wavelet subbands, the Tunable Q Wavelet Transform (TQWT) was employed. The TMP and statistics were employed to create meaningful features from both raw EEG signals and wavelet bands by producing TQWT. Here, the original EEG signal and 18 subbands had produced characteristics. This model was hence called TMP19. Neighborhood component analysis (NCA) was used to choose the most informative features, and the kNN classifier was used to classify the features that were chosen.

In 2022, TaghiBeyglou et al.6 suggested a novel CNN structure in combination with traditional ML models to diagnose attention deficit hyperactivity disorder (ADHD) in children. The raw electroencephalography (EEG) output was used as the input. Neither transformation nor artifact rejection procedures were needed for the EEG-based method. To diagnose ADHD, this model first trains a CNN using raw EEG data. Then, several traditional classifiers, including SVM, LR, RF, etc., were trained using the feature maps that were derived from the various layers of the learned CNN.

In 2023, Maniruzzaman et al.34 proposed two distinct methods such as SVMs and t-tests to choose the best channels separately. A hybrid channel selection strategy was then suggested, which combined the two methods to choose the best channels. The key features from the chosen channels were then chosen using a model based on LASSO LR. Lastly, six ML (ML)-based classifiers such as LR, multilayer perceptrons, KNNs, RF, and Gaussian process classification (GPC) were used to identify children with ADHD. Accuracy and area under the curve (AUC) were used to assess each classifier’s performance. Table 1 represents the comparison of related works of ADHD detection.

Problem statement

Table 1 Comparison of related works.

Proposed methodology

The proposed enhanced DL framework for ADHD detection aims at improving the accuracy of early-stage detection to enable effective handling of the condition. This approach utilizes two key databases namely the EEG database and the Diagnostic database. In case of EEG database, raw data are pre-processed with NeuroDCT-ICA which is a combination of ICA and I-DCT. This new kind of preprocessing helps to filter noise and extract useful features from EEG signals successfully. After that, several feature extraction techniques are applied on the pre-processed EEG signals, including time domain, frequency domain, and connectivity. The diagnostic database, on the other hand, is pre-processed by one-hot encoding, outlier detection, and correction, and mean imputation. Subsequently, feature extraction is done through the Pearson correlation coefficient and statistical measures, which give useful information from the clinical data. These two data modalities are combined through multi-modal feature fusion using an attention-based network to improve the system’s reliability. One of the new developments in this work is the RFO algorithm that is a combination of BFO and RSO, which enhances the process of feature selection, thus enhancing the efficiency in data processing and the stability of the system. Lastly, the newly proposed ADHD-AttentionNet is used for the ADHD detection with high accuracy and low computational complexity. This all-encompassing approach presents a major leap forward in the diagnosis of ADHD and presents a practical, effective, and accurate system to assist the clinician in enhancing the diagnostic process and subsequently the care of the patient. Figure 1 presents the overall architecture of the suggested approach.

Fig. 1
figure 1

Architecture of the suggested ADHD detection approach.

ADHD is often associated with abnormal brain activity, particularly in areas related to attention, working memory, and impulse control. These irregularities can manifest in complex brain signals, making it hard to interpret them directly. NeuroDCT-ICA optimizes the separation of brain activity into independent sources, helping to identify ADHD-related features, such as abnormal neural oscillations, more accurately. The combination of DCT and ICA allows the model to capture both high-level features (via DCT) and separate underlying sources of brain activity (via ICA). This ensures that only the most relevant and noise-free features are considered when diagnosing ADHD, improving the accuracy of detection. This framework incorporates DL components like an attention mechanism is likely being used in the feature-level fusion of pre-extracted features, which suggests that the overall system is a hybrid that still aligns with ML.

Data collection

The proposed approach is evaluated based on two databases specifically Diagnostic database (https://www.kaggle.com/datasets/arashnic/adhd-diagnosis-data) which is a public dataset containing health, activity, and heart rate data from adult patients diagnosed with ADHD. It includes the results and sum scores of many diagnostic assessment tools, time series of heart rate and motor activity, the output of a neuropsychological computer test, the participant’s sex, age, and medications. The numbers for sex are zero for females and one for males. Four groups of participants’ ages are shown: 1 represents those aged 17–29, 2 represents those aged 30–39, 3 represents those aged 40–49, and 4 represents those aged 50–67. Of the 85 patients whose motor activity was recorded, 23 were in age group 1, 26 were in age group 2, 24 were in age group 3, and 12 were in age group 4. The majority of the individuals did not use any drugs. Among those with an ADHD diagnosis who recorded their motor activities, 73% did not take medication, and only one person received a stimulant prescription.

The dataset consists of data collected from 51 patients with ADHD and 52 clinical-controlsand. EEG database (https://www.kaggle.com/datasets/inancigdem/eeg-data-for-mental-attention-state-detection), collection of 34 experiments for monitoring of attention state in human individuals using passive EEG BCI. The data collected from an EMOTIV device during a single experiment is contained in each file. O.data, an array of size {number-of-samples}×25, contains the raw data; therefore, o.data(:,i) consists of a single data channel. 128 Hz is the sampling frequency.

Pre-processing

EEG data

The NeuroDCT-ICA blends ICA with the I-DCT allowing for a better preprocessing of EEG signals. This method introduces in the new dimension of data enhancement by addressing the issues of cleaning of data and transforming it to the frequency domain. This has the effect of increasing the quality of the obtained EEG data. The technique prepares the data for further analysis – classification and feature extraction – by successfully removing recordings of eye blinks, movements of muscles, electrical noise or any other such artifacts while maintaining the desired data of brain activities. NeuroDCT-ICA does not simply apply filtering or ICA at one stage as in many other existing solutions, but rather incorporates stencil two step sequence of operations.

Mathematical model

Step 1: Savitzky-Golay filters.

Noise in high frequencies is reduced using the Savitzky-Golay filter27, but the essential properties of the EEG signal are preserved. This filter functions by applying a polynomial fit to a set of points from within a certain area of the signal that is translated to the right. In other words, it smoothed the signal out while not changing the frequencies contained in it too much. This method is applied to the EEG. The next steps are preparatory to the cleaning process, facilitating onward processing of the data by offering cleaner data for aiding in artifact removal and the analysis of the data. Mathematically, this can be given as per the Eq. (1).

$$\:s\left(t\right)=\frac{1}{N}{\sum\:}_{k=-N}^{N}\:{\psi\:}_{k}.r(t+k)$$
(1)

Where, \(\:r(t+k)\) and \(\:s\left(t\right)\) denotes the raw EEG signal and smoothened EEG signal, respectively. In addition, the weight function is denoted as \(\:{\psi\:}_{k}\)and the window size is denoted as\(\:\:N\).

Step 2: Normalization.

To address participant variability, the reduced EEG data is normalized across individuals. Techniques such as Z-score normalization28 are applied, rescaling the data to a standard range or a zero-centered normal distribution. The mathematical expression is expressed in the Eq. (2).

$$\:z\left(t\right)=\frac{s\left(t\right)-{\zeta\:}_{r}}{{\beta\:}_{r}}$$
(2)

Where, \(\:z\left(t\right)\) represents the normalized EEG signal, \(\:{\zeta\:}_{r}\)and \(\:{\beta\:}_{r}\)denotes the mean and standard deviation of the smoothened \(\:s\left(t\right)\) signal.

Step 3: Artifact removal using ICA.

One of the crucial procedures in EEG preprocessing is the removal of artifacts or any undesired elements such as eye blinks, movements of muscles, and electric noise, which can affect the evaluation of brain activities. ICA is a computational technique used to separate mixed signals into independent components. In brain signal processing, this means identifying distinct sources of brain activity from complex, overlapping signals, such as EEG (electroencephalography) or fMRI (functional magnetic resonance imaging). ICA helps extract relevant features from these signals, crucial for detecting patterns related to mental health conditions like ADHD. It is used to decompose the EEG data into separate spatially independent components, thus these artifacts can be placed and excluded from analysis. The mathematically representation is given in the Eq. (3).

$$\:I\left(t\right)=A.Q$$
(3)

where, \(\:I\left(t\right)\) denoted as the decomposed the EEG data, \(\:A\) represents the mixing matrix and \(\:Q\) represents the independent sources acquired after the artifact removal of \(\:z\left(t\right)\). Also, dot (\(\:\cdot\:\)) usually represents matrix multiplication.

Step 4: Signal segmentation via epoching.

The cleaned EEG signal \(\:I\left(t\right)\) is segmented into epochs \(\:E\left(t\right)\) that are related to certain events or actions. Epoching is done by splitting the uninterrupted EEG signal into brief time-locked segments which are co-ordinate with particular psychological events or stimuli. This facilitates understanding of the signal at various time instants enhancing its application in time series analysis and classification. Usually, the lengths of epochs are between 1 and 5 s but this varies with the depth of analysis that is needed.

Step 5: Frequency domain transformation using I-DCT.

NeuroDCT refers to a Neuro-Domain-Specific Transform that applies domain-specific signal processing techniques, like the Discrete Cosine Transform (DCT), to further refine the feature extraction process. DCT helps convert a signal into a frequency domain, making it easier to capture specific patterns of brain activity that are difficult to detect in the time domain. The Improved-DCT is performed on a cleaned EEG signal in each segmented ROI to transform it from the time domain to the frequency domain. This transformation emphasizes the low and high frequency components of the signal, which are important for the following analysis. When considering the standard DCT, the I-DCT gives higher importance to the trajectories of the system. Here, 2D-IDCT is applied to operate on the entire 2D data structure such as matrices representing spatial relationships. As a result, the system manages to differentiate even frequency components clearly. Low frequencies that represent delta and theta, slow brain wave patterns are captured well together with compact high frequency features that are characterized by sharp transients or rapid brain activities. The vital steps in the processing of EEG signals are the employment of the I-DCT which allows for feature extraction, classification and pattern recognition. The constructed method enables the operating distortion of the analysis focused on the raw signal time gaps by performing this signal shifting into the frequency domain. This improvement allows the better-quality analysis of the brain activity as especially weak brain activity refined changes could become visible that were not present in the time analysis.

In an \(\:8\times\:8\:\)DCT matrix, low frequencies reside in the upper left, mid-frequencies in the middle, and high frequencies in the lower right. This transformation is applied to \(\:E\left(t\right)\) (epochs) \(\:f(x,y)\) of size \(\:m\times\:n\) using an equation that rearranges pixel values to highlight different frequency components, a vital step in image compression and processing as per Eq. (4) and Eq. (5).

$$f\left( {u,v} \right) = \frac{2}{{\sqrt {mn} }}c\left( u \right)c\left( v \right)\mathop \sum \limits_{{x = 0}}^{{n - 1}} \mathop \sum \limits_{{y = 0}}^{{m - 1}} f\left( {x,y} \right) \times w\left( {x,y} \right) \times \cos \left[ {\frac{{\pi \left( {2x + 1} \right)u}}{{2n}}} \right]\cos \left[ {\frac{{\pi \left( {2y + 1} \right)v}}{{2m}}} \right]$$
(4)

for \(u~ = ~0,~1,~ \ldots ,~n - 1~and~v~ = ~0,~1,~ \ldots ,~m - 1,~~\)

$$\:c\left(A\right)=\{\frac{1}{\sqrt{2}},\:\:for\:a=0\:1,\:\:otherwise\:$$
(5)

Inverse DCT reverts transformed \(\:E\left(t\right)\), vital for signal reconstruction as per Eq. (6).

$$f\left( {x,y} \right) = \frac{2}{{\sqrt {mn} }}\mathop \sum \limits_{{x = 0}}^{{n - 1}} ~\mathop \sum \limits_{{y = 0}}^{{m - 1}} ~c\left( u \right)c\left( v \right)f\left( {x,y} \right) \times w\left( {x,y} \right) \times \cos \left[ {\frac{{\pi \left( {2x + 1} \right)u}}{{2n}}} \right]\cos \left[ {\frac{{\pi \left( {2y + 1} \right)v}}{{2m}}} \right]$$
(6)

Here, \(\:f\left(x,y\right)\) is represented as the Frequency Domain Transformation parameter. When processing EEG signals, the coefficient \(\:F\left(\text{0,0}\right),\) is known to carry a major portion of the energy of the signal and, thus, depicts the average strength of the signal over a period of time. This coefficient plays a very critical role in determining the overall background activity of the brain. The high-frequency coefficients usually found at the lower right section of the transformed matrix \(\:w\left(x,y\right)\) contain information regarding a short period of time and the detailed neural activities such as rapid spikes, oscillations, or brain information which is associated with advanced cognitive processes. The frequency domain signal (pre-processed EEG signal) is represented as \(\:{E}_{i}^{pre}\).

Diagnostic data preprocessing

The processes listed below describe the methods for preprocessing clinical data such as standardized, clean, and fit for applications employing ML or analysis.

Step 1: mean imputation for Missing Data Handling.

In order to ensure a complete dataset and prevent the loss of essential data, handling missing data is vital29. This step is especially essential in clinical datasets, where missing values can arise from incomplete patient records, equipment malfunctions, or human errors. This method maintains the dataset’s overall distribution and enables seamless analysis in the Eq. (7).

$$\:{Y}_{i}=\frac{1}{m}{\sum\:}_{i=1}^{m}\:{X}_{i}$$
(7)

where, \(\:{Y}_{i}\)represents the missing observations in the input data and the count of observations (samples) are denoted as \(\:n\).

Step 2: One-Hot Encoding is a technique used for converting categorical variables (such as gender, symptoms, and severity) into a numeric form which can be understood by the ML models30. Let the categorical values within the categories of \(\:{Y}_{i}\) be represented as \(\:{C}_{i}\). The on-hot encoding is applied to each of the instance in the Eq. (8).

$$\:OneHot\left({C}_{i}\right)=[0,.,1,0.0]$$
(8)

Where, 1 denotes the active category and all other positions are 0.

Step 3: Outlier Detection and Correction.

The dataset has undergone one-hot encoding, and now comprises only numerical dimensions (continuous and binary features). For the continuous dimensions, Z-score normalization is carried out. Z-score is employed in outlier detection. It quantifies an element’s distance from the mean of its feature in terms of standard deviations. If a Z-score is greater than 3 or less than − 3, the corresponding measurement is considered an anomaly and discarded.

Feature extraction

EEG feature extraction

Time-domain features

For EEG signal processing, specifically in the analysis of EEG signals, feature extraction is an important step in representing the raw data in a more useful form. Time features, which are the features taken directly from the time-series data without any transformation into frequency data are one of the most common. Such features, which include mean, variance and peak amplitude. As one of the features extraction methods in EEG analysis that is commonly employed, it is known that there is a simple way to present the features of the signal which describe the activity of the nervous system. Table 2 provides time features and their descriptions.

Table 2 Time features and their description.
Frequency-domain features - wavelet packet decomposition (WPD) for detailed frequency bands

With Wavelet Packet Decomposition (WPD), an advanced frequency-domain feature extraction technique, EEG data are separated into multiple frequency bands. While the standard wavelet transform only analyzes the low-frequency (approximation) components, WPD analyzes the low-frequency and high-frequency components at all levels. Specifically, it is helpful for studying the various brain rhythms (delta, theta, alpha, beta, and gamma) associated with different neurological and cognitive activities since it provides a more detailed representation of the EEG signal.

Key steps in WPD for EEG analysis

  1. 1.

    Decomposition Process: To enable complete analysis of an EEG signal’s \(\:{E}_{i}^{pre}\) frequency components, WPD progressively split the signal up into smaller sub-bands.

  2. 2.

    There are two components to the signal at each stage of decomposition:

  • Approximation (low-frequency) components, which signify general signal patterns in the Eq. (9).

$$A_{{i + 1}} \left[ m \right] = \mathop \sum \limits_{k}^{~} ~E_{i}^{{pre}} \left[ k \right].h\left[ {2m - k} \right]s~~~~$$
(9)

Where, \(\:{A}_{i+1}\left[m\right]\) and \(\:{D}_{i+1}\left[m\right]\) represents the Approximation coefficients and detailed at level \(\:i+1\), correspondingly. In addition, \(\:h\left(k\right)\) and \(\:g\left(k\right)\) symbolizes the Low-pass and high-pass filters, respectively.

  • Detail (high-frequency) components, which capture finer oscillations and transient details in the Eq. (10).

$$D_{{i + 1}} \left[ m \right] = \mathop \sum \limits_{k}^{~} ~E_{i}^{{pre}} \left[ k \right].g\left[ {2m - k} \right]~$$
(10)

After several levels of decomposition, WPD offers a complete depiction of the signal across multiple frequency bands, allowing the analysis of both low-frequency rhythms and higher-frequency events.

  1. 3.

    Frequency Bands: WPD select the right number of decomposition stages \(\:D\:\)to isolate particular frequency bands in an EEG data. By concentrating on frequency components that are associated to standard EEG cycles, this method allows a complete examination of the data. The following are the key frequency bands that WPD identified:

  • Delta (0.5–4 Hz): Deep sleep, unconscious states.

  • Theta (4–8 Hz): Light sleep, relaxation, meditative states.

  • Alpha (8–13 Hz): Relaxed wakefulness, creativity, meditation.

  • Beta (13–30 Hz): Active thinking, problem-solving, focus.

  • Gamma (30–100 Hz): High-level cognitive functions, memory recall, learning.

Each deconstructed component inside these bands is then analyzed for features relevant to monitoring brain states and identifying anomalies, providing valuable data on neural and cognitive activity.

  1. 4.

    Power spectral density (PSD): First, we calculate the Power Spectral Density (PSD) for each EEG signal employing Fourier transforms which is a signal feature and reflects the frequency characteristics of a signal. For \(\:a\left(t\right)\) and \(\:b\left(t\right)\), the PSD can be represented as \(\:{P}_{aa}\left(t\right),{P}_{bb}\left(t\right)\), correspondingly. At the \(\:{f}^{th}\)frequency, the power shared between \(\:a\left(t\right)\) and \(\:b\left(t\right)\) be represented as \(\:{P}_{ab}\left(t\right)\). The coherence function \(\:{E}_{ab}\left(f\right)\) can be calculated in the Eq. (11).

$$\:{E}_{ab}\left(f\right)=\frac{{\left|{P}_{ab}\left(f\right)\right|}^{2}}{{P}_{aa}\left(f\right).{P}_{bb}\left(f\right)}$$
(11)

Where, 0 denotes no coherence, 1 represents perfect coherence and the value of \(\:{E}_{ab}\left(f\right)\) ranges between [0,1],

  1. 5.

    Feature extraction: After decomposing the signal into frequency bands, various features can be calculated from each band. Frequency features and their descriptions are given in Table 3.

Table 3 Frequency features and their description.

WPD based analysis for feature extraction from EEG signals in frequency domain permits to go beyond the limitation of conventional EEG bands. It successfully encompasses both the somnolent states, which are sustained patterns of activity and discrete short-firing high-frequency events, such as cognitive operations or seizures.

Connectivity features

Measure inter-regional connectivity using coherence and cross-correlation.

  1. 1.

    (1) Coherence: Coherence is an index of the frequency domain linear correlation between two EEG signals. It measures the degree of overlap in power in the signals at given frequencies, indicative of the interaction between different regions of the brain.

  2. 2.

    (2) Cross-correlation.

Cross-correlation measures the similarity between two EEG signals as a function of the time-lag τ. This can indicate synchronous activity or temporal relationships between different brain regions. At \(\:{\tau\:}^{th}\) lag, the Cross-Correlation Function \(\:{R}_{ab}\left(\tau\:\right)\) can be given as per Eq. (12).

$$\:{R}_{ab}\left(\tau\:\right)={\sum\:}_{t}^{\:}\:a\left(t\right).b\left(t+\tau\:\right)$$
(12)

Diagnostic data feature extraction

Statistical features

Pattern recognition in data is enhanced by the use of statistical features like mean, variance, correlation between symptoms or scores. Such features are exploited from EEG signals to represent the important statistics of the signal i.e. central tendency, dispersion and its distribution. Such features are useful for further treatment, for classification or for diagnostics. Statistical features and their descriptions are provided in Table 4.

Table 4 Statistical features and their description.
Correlation analysis- Pearson correlation coefficient

Identify the relationships between demographic data (e.g., age, gender) and clinical variables (e.g., blood pressure, cholesterol levels). The Pearson correlation coefficient \(\:\left({s}_{YZ}\right)\), which is a statistical measure of the degree of linear correlation between two variables, \(\:Y\) and \(\:Z\), is named after the English mathematician and biostatistician Karl Pearson. It is defined as shown in Eq. (13):

$$\:{s}_{YZ}=\frac{cov(Y,Z)}{{\sigma\:}_{Y}\cdot\:{\sigma\:}_{Z}}$$
(13)

Where \(\:\:cov(Y,Z)\)covariance between \(\:Y\:and\:Z\), and the mathematical expression of \(\:cov(Y,Z)\) is shown in Eq. (14)

$$\:cov(Y,Z)=\frac{1}{o-1}\cdot\:{\sum\:}_{j=1}^{o}\:\left({Y}_{j}-\underset{\_}{Y}\right)\cdot\:\left({Z}_{j}-\underset{\_}{Z}\right)\:$$
(14)

\(\:{\sigma\:}_{Y}\cdot\:{\sigma\:}_{Z}\:\)standard deviation of \(\:Y\) and standard deviation of \(\:Z\). And their relative expression is shown in Eq. (15)

$$\:cov\left(Y,Z\right)=\frac{1}{o}\cdot\:{\sum\:}_{j=1}^{o}\:\left({Y}_{j}-\underset{\_}{Y}\right)\cdot\:\left({Z}_{j}-\underset{\_}{Z}\right)\:$$
(15)

By substituting (14) and (15) in Eq. (13), The following Eq. (16) can be used for the Pearson correlation coefficient:

$${S_{yz}}=\frac{{o \cdot \mathop \sum \nolimits_{{j=1}}^{o} ~{Y_j} \cdot {z_j} - \left( {\mathop \sum \nolimits_{{j=1}}^{o} ~{Y_j}} \right) \cdot \left( {\mathop \sum \nolimits_{{j=1}}^{o} ~{z_j}} \right)}}{{\sqrt {o \cdot \mathop \sum \nolimits_{{j=1}}^{o} ~~Y_{j}^{2} - {{\left( {\mathop \sum \nolimits_{{j=1}}^{o} ~{Y_j}} \right)}^2}} \cdot \sqrt {o \cdot \mathop \sum \nolimits_{{j=1}}^{o} ~z_{j}^{2} - {{\left( {\mathop \sum \nolimits_{{j=1}}^{o} {z_j}} \right)}^2}} }}$$
(16)

Several sources refer to the Pearson correlation coefficient as Pearson’s \(\:s\) rather than \(\:{s}_{YZ}\). The Greek letter \(\:\rho\:\), represented as \(\:{\rho\:}_{YZ}\), stands for the Pearson correlation coefficient when it is applied to the entire population as opposed to a sample.

Multi-modal feature fusion with attention mechanisms

In order to distinguish the most useful elements from each source for improved diagnosis or prognosis, multi-modal fusion with attention mechanisms allows for the incorporation of both Diagnostic as well as EEG. Equation (17) shows the result of concatenating these features into a single, cohesive feature vector.

$$\:{F}_{can}=\left[{f}_{diagnostic}-{f}_{EEG}\right]$$
(17)

This concatenation combines all extracted features (diagnostic and EEG data features) into one cohesive representation, making them available for processing by the attention mechanism.

Attention score calculation: Use learnable weight matrix Watt to compute attention scores for each feature in the concatenated feature set. The attention score \(\:{e}_{i}\)for each feature is calculated by a dot product between the feature vector \(\:{F}_{concat}\) ​ and the attention weight matrix Watt.

  • Weighted feature vector: Once the attention weights \(\:{\alpha\:}_{i}\) are calculated, it is used to re-weight the features in the concatenated vector. The weighted features are then used for the next step in the fusion process as per the Eqs. (18) and (19).

$$\:{F}_{att}={\sum\:}_{i}^{\:}\:{\alpha\:}_{i}{F}_{concat,\:i}$$
(18)
$$\:{\alpha\:}_{i}=\frac{exp\left({e}_{i}\right)}{{\sum\:}_{j}^{}\:exp\left({e}_{j}\right)}$$
(19)

Where, \(\:{F}_{att}\) represents the weighted feature vector after attention is applied.

In this step, the weighted features from both Diagnostic and EEG data are combined. The attention mechanism has already guaranteed that the most related features from both modalities are given higher rank. The final multi-modal fusion vector is obtained by adding the attention-weighted Diagnostic and clinical features together as per the Eq. (20).

$$\:{F}_{final}={F}_{att}^{EEG}+\:{F}_{att}^{diagnostic}$$
(20)

Where, \(\:{F}_{final}\) denoted as the final multi-model feature set, which captures the essential data from both behavioural and clinical data (EEG), \(\:\:{F}_{att}^{diagnostic}\) denoted as the attention-weighted behavioural features and \(\:{F}_{att}^{EEG}\)denoted as the attention-weighted clinical features.

Feature selection

Proposed RFO is a nature-inspired optimization algorithm based on the movement patterns of RhinoFish. It’s part of a family of algorithms that mimic the behaviors of natural systems to solve complex optimization problems, especially those involving multi-dimensional data or high variability. RFO works by simulating the searching behavior of RhinoFish in a multi-dimensional environment, exploring the search space for optimal solutions. The algorithm balances exploration (searching for new areas of the solution space) and exploitation (focusing on promising areas identified earlier), ensuring that the search for optimal solutions is both thorough and efficient.

ADHD detection typically involves processing vast amounts of data from multiple sources (EEG, fMRI, behavioral data). This data can be noisy, high-dimensional, and often requires feature selection to determine the most informative inputs for accurate classification. RFO is used to optimize the feature selection process. It intelligently selects which features (or patterns in brain signals) should be prioritized for classification, refining the data inputs and improving the performance of ADHD detection models. By optimizing the classification process, RFO helps ensure that the most relevant brain signals and behavioral patterns are used for ADHD diagnosis, improving both accuracy and efficiency.

At this point, the optimal feature sets \(\:{F}_{final}\:\)are picked from extracted features using proposed RFO. It combines the aquatic behaviors of BFO and survival nature of RSO.

Proposed RFO: In the proposed RFO, the cooperative, adaptive search patterns of fish and the social and collaborative behaviors of Rhinopithecus monkeys are used to provide effective exploration and exploitation in solution spaces, and convergence stability. Firstly, the proposed RFO begins with defining the population of solutions (features), where each solution corresponds to a set of selected features from the dataset as referred in Eq. (21), where \(\:{\chi\:}_{i}\) denotes \(\:{i}^{th}\) solution (feature subset) in the population, \(\:{x}_{ij}\)​ signifies whether feature \(\:j\) is selected, and \(\:{x}_{in}\) stands for group members count.

$$\:{\chi\:}_{i}=\left[{x}_{i1},{x}_{i2},\dots\:,{x}_{ij},\dots\:,{x}_{in}\right]$$
(21)

The proposed Rhino Fish employs the aquatic nature of Bitterling fish and survival strategy of Rhinopithecus. The position for Rhino Fish is initialized in Eq. (22), where \(\:{\chi\:}_{i,j}\left(t\right)\) point to Rhino Fish’s position at iteration \(\:t\), \(\:U\) and \(\:L\) represents upper and lower limits, \(\:n\) denotes population size, \(\:r\) indicates arbitrary number in \(\:\left(\text{0,1}\right)\) and \(\:d\) points to dimension.

$$\:{\chi\:}_{i,j}\left(t\right)=L+r\times\:\left(U-L\right),\:i=\text{1,2},\dots\:,n\:and\:j=\text{1,2},\dots\:,d$$
(22)

Each Rhino Fish specifies a candidate solution and the fitness for every solution is estimated via Eq. (23), in which \(\:{\chi\:}_{i}\) denotes the set of optimal features and the objective is to reach maximum accuracy.

$$\:f\left({\chi\:}_{i}\right)=\left(Accurac{y}_{i}\right)\:$$
(23)

In BFO, the Bitterling fish searches best mating oysters, on the other hand, Rhinopithecus aims to choose best leader among the swarm. Using these strategies, the proposed RFO aims to choose optimal features based on combining search patterns of best-mating oysters and leader selection. The mathematical model of such design is defined as follows. A parameter \(\:S\) is used to indicate the step rate of the Rhino Fish to approach the best feature and it minimizes over iteration so as to move the algorithm to local search from global search as represented in Eq. (24), where \(\:S\left(1\right)\) and \(\:S\left(t\right)\) indicate every Rhino Fish’s step rate in the beginning, and iterations \(\:t\) over time, \(\:t\) and \(\:{M}_{t}\) denotes present and maximum iterations, \(\:R\left(t\right)\) signifies arbitrary function to produce arbitrary sequences as given in Eq. (25).

$$\:S\left(t\right)=\left(S\left(1\right)-\frac{s\left(1\right)\cdot\:t}{{M}_{t}}\right)\cdot\:R\left(t\right)$$
(24)
$$\:R\left(t+1\right)=coscos\:\left(t\times\:\left(R\left(t\right)\right)\:\right)\:$$
(25)

Assume \(\:R\left(1\right)=1\). Now, combine Eq. (24) with (25) which gives Eq. (26). It is performed to decrease the parameter \(\:M\) over iterations, and this \(\:M\) is used to decide the position update of Rhino Fish.

$$\:S\left(t+1\right)=\left(S\left(1\right)-\frac{s\left(1\right)\cdot\:t}{{M}_{t}}\right)\cdot\:coscos\:\left(t\times\:\left(R\left(t\right)\right)\:\right)\:$$
(26)

Equation (27) gives the parameter \(\:M\) estimation, where \(\:a\) denotes reduction parameter and \(\:a=\left\{0.1\:or\:0.5\:or\:0.9\right\}\). The lower the \(\:a\) value, the lower the \(\:M\) parameter.

$$\:M=\left|1-\frac{t}{\sqrt{1+{t}^{2}}}\right|+\frac{r}{{t}^{a}}$$
(27)

Based on these parameters, Eq. (28) computes the position update of Rhino Fish, where \(\:{\chi\:}_{i}\left(t\right),\) and \(\:{\chi\:}_{i}\left(t+1\right)\) refer to Rhino Fish present and new location at iteration \(\:t\) and \(\:t+1\), \(\:D\) signifies an arbitrary number in \(\:\left(\text{0,1}\right)\), and \(\:{\chi\:}^{+}\) stands for worth of the best features selected.

$$\:{\chi\:}_{i}\left(t+1\right)=\:S\cdot\:{\chi\:}_{i}\left(t\right)+\left({\chi\:}^{+}-S\cdot\:{\chi\:}_{i}\left(t\right)\right)\cdot\:D$$
(28)

On the other hand, the Rhino Fish chooses best survival position based on king \(\:\left(K\right)\) and mature \(\:\left(M\right)\) and adolescent \(\:A\) Rhino Fish. Here, each Rhino Fish update their position based on king \(\:\left(K\right)\) Rhino Fish using Eq. (29), in which \(\:Gaussian\) addresses Gaussian distribution having expectation \(\:a\) and variance \(\:b\).

$$\:{\chi\:}_{i}\left(t+1\right)=Gaussian\left(a,b\right)$$
(29)

Based on \(\:M\), the Rhino Fish location update is performed using Eq. (30).

$$\:{\chi\:}_{i}\left(t+1\right)=\{S\cdot\:{\chi\:}_{i}\left(t\right)+\left({\chi\:}^{+}-S\cdot\:{\chi\:}_{i}\left(t\right)\right)\cdot\:D\:if\:\gamma\:>M\:\:Gaussian\left(a,b\right)\:if\:\gamma\:\le\:M\:$$
(30)

Equations (31) and (32) gives the estimation of \(\:a\) and \(\:b\), in which \(\:a,b\in\:\left(\text{0,2}\right)\), \(\:{K}_{\alpha\:}\) and \(\:{M}_{\beta\:}\) signify king and mature Rhino Fish locations.

$$\:a=\frac{{{K}_{\alpha\:}+M}_{\beta\:}}{2}$$
(31)
$$\:b={{K}_{\alpha\:}-M}_{\beta\:}$$
(32)

In the exploitation phase, the proposed RFO update’s location using escaping strategy as defined in Eq. (33).

$$\:{\chi\:}_{i}\left(t+1\right)=S\cdot\:{\chi\:}_{i}\left(t\right)+\left({\chi\:}^{*}-S\cdot\:\rho\:\right)\cdot\:D$$
(33)

Here, \(\:\rho\:\) indicates population gravity point and is estimated in Eq. (34).

$$\rho =\mathop \sum \limits_{{i=1}}^{n} \frac{{{\chi _i}\left( t \right)}}{n}$$
(34)

Equation (35) describes the location update of Rhino Fish based on learning adolescent Rhino Fish from mature ones, where \(\:\left(c,e\right)\) defines expectation, and \(\:\left(d,f\right)\) provides variance of Gaussian distribution Gaussian.

$$\:{\chi\:}_{i}\left(t+1\right)=\frac{Gaussian\left(c,d\right)+Gaussian\left(e,f\right)}{2}$$
(35)
$$\:c=\frac{{{K}_{\alpha\:}+A}_{\delta\:}}{2}$$
(36)
$$\:e=\frac{{M}_{\beta\:}+{A}_{\delta\:}}{2}$$
(37)
$$\:d=\left|{K}_{\alpha\:}-{A}_{\delta\:}\right|$$
(38)
$$\:f=\left|{M}_{\beta\:}-{A}_{\delta\:}\right|$$
(39)

Based on parameter \(\:M\), the location update of Rhino Fish takes place as defined in Eq. (40).

$$\:{\chi\:}_{i}\left(t+1\right)=\{\frac{Gaussian\left(c,d\right)+Gaussian\left(e,f\right)}{2}\:if\:\gamma\:\ge\:M\:\:S\cdot\:{\chi\:}_{i}\left(t\right)+\left({\chi\:}^{*}-S\cdot\:\rho\:\right)\cdot\:D\:if\:\gamma\:<M\:$$
(40)

After \(\:{M}_{t}\) iterations, the solution \(\:{\chi\:}_{i}\) converges to the optimal set \(\:{F}_{opt\:}\)of refined features by minimizing the classification error and maximizing accuracy. Algorithm 1 demonstrates the pseudocode of proposed RFO.

Algorithm 1
figure a

Pseudocode of proposed RFO.

ADHD-AttentionNet based ADHD detection

NeuroDCT-ICA is responsible for extracting clean, domain-specific features from complex brain signals. It enhances the signal processing and ensures that only the most relevant information is fed into the detection model. RFO works alongside NeuroDCT-ICA to fine-tune the feature selection process. It automatically identifies which features are most important for detecting ADHD and helps optimize the model’s performance by ensuring that these features are correctly weighted in the final classification.

In this phase, the early ADHD detection is carried out via the proposed model which integrates Attention-Based DenseNet and Deep RCABP with soft attention mechanisms and GoogleNet and Squeezenet. This model is trained with the optimal features \(\:{F}_{opt\:}\).

Proposed Channel-Aware DeepNet: In ADHD, differentiating various stages of disease is a complex task. In order to attain enhanced accuracy and minimize false positives, this research introduces Channel-Aware DeepNet which carefully analyzes the features using combined deep architecture. It uses DenseNet architecture with 3 dense blocks and 3 transition layers. Each Dense Block is made up of several convolutional layers, where each layer receives inputs from all previous layers in the block. This promotes feature reuse and helps mitigate the vanishing gradient problem. Each Dense Block consists of \(\:L\) convolutional layers where each layer \(\:l\) has \(\:k\) filters. The output of each layer is concatenated with the inputs to subsequent layers as defined in Eq. (41), in which \(\:{\chi\:}_{0}\) denotes input to the block (optimal features), and \(\:{H}_{l}=Conv\left({X}_{l-1}\right)\).

$$\:{X}_{dense}=\left[{\chi\:}_{0},{H}_{1},{H}_{2},\dots\:,{H}_{L}\right]$$
(41)

Transition Blocks are used to reduce the spatial dimensions (downsampling) and the number of feature maps. It comprised of convolution followed by a pooling layer. A Transition Block \(\:{X}_{tl}\) reduces feature maps \(\:C\) count and the spatial dimensions \(\:H\times\:W\) using a \(\:1\times\:1\) convolution followed by a \(\:2\times\:2\) average pooling layer as shown in Eq. (42), in which \(\:Conv\) refers to the \(\:1\times\:1\:\)convolution.

$$\:{X}_{tl}=Pooling\left(Conv\left({X}_{dense}\right)\right)$$
(42)

At this point, the RCABP blocks enhance feature representation by focusing on the most informative channels using an attention mechanism while allowing residual connections for gradient flow. The RCABP enhances the feature representation using channel attention and spatial attention. Let \(\:X\) be the input to the RCABP. The attention weights for channels are computed in Eq. (43), where \(\:{W}_{c}\) means to learned weights, and \(\:\sigma\:\) represents sigmoid activation.

$${A_c}=\sigma \left( {{W_c} \cdot ~X} \right)$$
(43)

Equation (44) expresses the output of the RCABP, where \(\:\) represents element-wise multiplication (channel-wise scaling).

$$Y=X+{A_c} \odot X$$
(44)

Now, soft attention mechanism highlights important spatial regions in the feature maps and further refining the feature representation before the final classification. Soft attention refines the feature maps by generating attention weights based on spatial features as computed in Eq. (45), where \(\:{W}_{s}\) points to learned weights for spatial attention.

$${A_s}=Sigmoid\left( {{W_s} \cdot ~X} \right)$$
(45)

Followed by soft attention, a GAP layer is used. GAP minimizes the feature maps spatial dimensions to a single vector per feature map, efficiently summarizing the information. Also, GAP converts the feature maps \(\:X\) into a vector by averaging each feature map as represented in Eq. (46), in which

$${X_{gap}}=\frac{1}{{H \times W}}\mathop \sum \limits_{{i=1}}^{H} ~\mathop \sum \limits_{{j=1}}^{W} ~X\left( {i,j} \right)$$
(46)

Finally, the output from GAP is fed into an FC layer with softmax activation for classification that produces the final class probabilities as shown in Eq. (47), in which \(\:{W}_{fc}\) and \(\:{b}_{fc}\) signify weights and biases of the FC layer.

$${Y_{output}}=Sigmoid\left( {{W_{fc}} \cdot ~{X_{gap}}+{b_{fc}}} \right)$$
(47)

Furthermore, a custom loss function is used by combining dice and focal loss for effective classification. Dice loss function measures the overlap between the predicted and true class labels to make it ideal for classification tasks. This approach ensures that each class, especially minority classes contribute to the loss function as given in Eq. (48), where \(\:C\) denotes classes count, \(\:\:{\rho\:}_{i,c}\) refers to predicted probability for instance \(\:i\) in class \(\:c\), \(\:{\tau\:}_{i,c}\) stands for true class labels (1 if instance \(\:i\) belongs to class \(\:c\), 0 otherwise), and \(\:N\) signifies total instances count.

$$\:{L}_{dice}=1-\frac{2{\sum\:}_{c=1}^{C}\:{\sum\:}_{i=1}^{N}\:{\rho\:}_{i,c}\cdot\:{\tau\:}_{i,c}\:\:\:\:}{{\sum\:}_{c=1}^{C}\:{\sum\:}_{i=1}^{N}\:{\rho\:}_{i,c}\:+{\sum\:}_{c=1}^{C}\:{\sum\:}_{i=1}^{N}\:{\tau\:}_{i,c}\:\:\:}$$
(48)

On the other hand, focal loss in a multi-class setup reduces the influence of easily classified samples and focuses on those that are harder to classify. For each class, it applies a scaling factor to down-weight well-classified examples to make it especially effective in handling class imbalance in a multi-class context as defined in Eq. (49), in which \(\:{\rho\:}_{t,c}\) denotes predicted probability of the true class for class \(\:c\), \(\:{\alpha\:}_{c}\) means to balancing factor for each class, and \(\:\gamma\:\) addresses focusing parameter.

$$\:{L}_{focal}=-{\sum\:}_{c=1}^{C}\:{\alpha\:}_{c}{\left(1-{\rho\:}_{t,c}\right)}^{\gamma\:}loglog\:\left({\rho\:}_{t,c}\right)\:\:$$
(49)

The combined loss function integrates multi-class Dice and Focal Loss to ensure the model pays attention to both classes overlap and hard-to-classify examples. This approach is especially valuable in medical imaging applications where each class requires distinct treatment. Equation (50) shows the combined loss function, in which \(\:{\lambda\:}_{1}\) and \(\:{\lambda\:}_{2}\) stands for weights controlling the contribution of each loss term. Figure 2 represents the architecture of suggested Channel-Aware DeepNet.

$${L_{combined}}={\lambda _1} \times {L_{dice}}+{\lambda _2} \times {L_{focal}}$$
(50)
Fig. 2
figure 2

displays the architecture of proposed Channel-Aware DeepNet.

Performance metrics

Several performance metrics are used in this research to verify the competence of the proposed in detecting ADHD as shown in Table 5.

Table 5 Performance metrics.

Result and discussion

Experimental setup

The proposed approach has been implemented in Python. The proposed approach is validated through two databases, namely ADHD diagnosis database (https://www.kaggle.com/datasets/arashnic/adhd-diagnosis-data) and EEG database (https://www.kaggle.com/datasets/inancigdem/eeg-data-for-mental-attention-state-detection). The simultaneous input of EEG data and diagnostic data corresponding to the same patients and controls is vital for ensuring that the model can effectively identify and compare the unique patterns associated with ADHD. This alignment enables the model to leverage multimodal data in a way that maximizes its diagnostic capabilities, offering a more comprehensive and accurate method for ADHD detection. In both the databases, the proposed approach and existing approaches like Att-AENet18, SVM19, 2D-CNN22, and FCNN23 are compared with respect to the metrics such as precision, NPV, FNR, recall, FPR, accuracy, F-score, MCC, and specificity to evaluate their performance in terms of ADHD detection.

Performance analysis based on ADHD diagnosis database

The performance of the proposed method is compared with other techniques such as Att-AENet18, SVM19, 2D-CNN22, and FCNN23 based on the ADHD diagnosis database is presented in Table 6. Precision, NPV, FNR, recall, FPR, accuracy, F-score, MCC, and specificity are the evaluation metrics used in this analysis. The accuracy results show that the proposed approach outperforms the existing ones, especially because of the newly developed RFO. The accuracy of the proposed model is 98.52%, which is higher than the existing techniques like FCNN23 with 96.97% and Att-AENet18 with 96.91%. The suggested method also outshines the existing methods in terms of recall and MCC metrics as well. It obtained a recall of 98.05% and in the MCC metrics, the proposed model attained 96.32%, which is higher than the 93.96% of FCNN23 and 93.52% of 2D-CNN22. Furthermore, the suggested approach attained the FNR of 0.0195, which is lower than the others. This analysis also shows that the proposed method has better results than the previous methods used in detecting ADHD due to the introduction of the novel ADHD-AttentionNet.

Table 6 Performance analysis based on ADHD diagnosis database.

Performance analysis based on EEG database

In Table 7, the performance analysis between the stated and current methods such as Att-AENet18, SVM19, 2D-CNN22, and FCNN23 are evaluated based on the EEG database. The evaluation measures used in this evaluation includes precision, NPV, FNR, recall, FPR, accuracy, F-score, MCC, and specificity. The accuracy results reinforce the advantage of the proposed approach that attained accuracy of 97.89% compared to other techniques with DCNN16 having the nearest accuracy of 95.83%. The proposed model also performs well in precision metric with the highest precision rate of 97.84% and other approaches not even close to 95%, this major performance difference is due to the new NeuroDCT-ICA. In terms of specificity the stated approach achieved highest with 97.83% and meanwhile SVM19 with 94.48%. Furthermore, the proposed approach provided the lowest FPR of 0.0217, while FCNN23 had the highest FPR of 0.0595. These results demonstrate the superiority of the proposed model over the previous methods in the performance of ADHD detection. The comparison analysis between the suggested and current approaches are graphically represented in Fig. 3.

Table 7 Performance analysis based on EEG database.
Fig. 3
figure 3

Graphic representation of (a) Accuracy, (b) Precision, (c) Recall, (d) F1-score, (e) Specificity, (f) MCC, (g) NPV, (h) FPR, (i) FNR for proposed and existing models based on the two databases.

K-fold cross validation

K-Fold Cross Validation divides the dataset into ‘K’ equal subsets or folds in order to evaluate the performance of a ML model. The method is repeated ‘K’ times, each time using a different fold as the test set. The model is trained on ‘K-1’ folds and tested on the remaining fold. Over all ‘K’ iterations, the ultimate performance metric is averaged. This approach offers a more accurate measure of model performance while assisting in the reduction of overfitting.

Table 8 Performance analysis on K-fold cross-validation (5 folds) for different datasets.

The performance study of several methods employing K-fold cross-validation with five folds for two distinct datasets is shown in Table 8. The Proposed approach outperforms other models, including Att-AENet (mean accuracy 0.9691), SVM (mean accuracy 0.9686), 2D-Convolutional CNN (mean accuracy 0.9675), and FCNN (mean accuracy 0.9697), with the greatest mean accuracy of 0.9852 for Dataset (1) With a mean accuracy of 0.9792, the Proposed approach consistently outperforms Att-AENet(mean accuracy 0.9592), SVM (mean accuracy 0.9542), 2D-Convolutional CNN (mean accuracy 0.949), and FCNN (mean accuracy 0.9478) for Dataset (2) These outcomes demonstrate the Proposed method’s efficacy in comparison to other state-of-the-art techniques by highlighting its consistent and superior performance across both datasets.

Another comparative analysis of different existing models is provided with the K-fold cross validation for two datasets which is compared with several existing models such as variational mode decomposition-Hilbert transform (VMD-HT), Empirical mode decomposition-discrete wavelet Transform (EMD-DWT), CNN-Autoencoder (CNN-AE. The analysis using K-fold cross-validation is represented in Tables 9 and 10.

Table 9 Performance analysis on K-fold cross-validation (5 folds) for dataset 1 on another existing models.

The performance study of several methods utilizing K-fold cross-validation with five folds on Dataset 1 is shown in Table 8. The accuracy of the Proposed approach, VMD-HT, EMD-DWT, CNN-AE is compared over the five folds in the table. The accuracy of the Proposed approach is consistently high; Fold 5 has the greatest accuracy of 0.987, with a mean accuracy of 0.9852. The EMD-DWT approach yields a mean accuracy of 0.9686, whereas the VMD-HT method performs somewhat worse with a mean accuracy of 0.9691. The CNN-AE approach has a mean accuracy of 0.9675 and consistently performs well. A mean accuracy of 0.9697 indicates that the 3D-CNN model performs well, surpassing both VMD-HT and EMD-DWT. According to these findings, CNN-AE and EMD-DWT show somewhat lower accuracies, whereas the Proposed technique perform best overall across all folds. The data provided demonstrates that the suggested model provides the best accuracy and stability throughout the dataset’s many folds.

Table 10 Performance analysis on K-fold cross-validation (5 folds) for dataset 2 on another existing models.

The performance study of various methods utilizing K-fold cross-validation with five folds on Dataset 2 is shown in Table 9. All folds show consistently high accuracy using the Proposed approach, with Fold 5 producing the highest value of 0.9794, for a mean accuracy of 0.9792. While the EMD-DWT method has a mean accuracy of 0.9542, suggesting slightly weaker performance, the VMD-HT approach performs slightly worse with a mean accuracy of 0.9592. The 2CNN-AE method performs more modestly with a mean accuracy of 0.949. These results reveal that the Proposed technique performs better than the other models on Dataset 2 in terms of overall accuracy and consistency, while the other methods, including VMD-HT and EMD-DWT, perform similarly but comparatively poorly. Table 11 shows the processing time and memory usage for both the datasets.

Table 11 Processing time and memory usage.

According to Table 10, the proposed method demonstrates the shortest processing time and lowest memory usage across both datasets, with times of 0.8 s (ADHD) and 1.0 s (EEG) and memory usage of 110 MB (ADHD) and 120 MB (EEG). In contrast, Att-AENet uses a lot more memory (250 MB and 210 MB, respectively) and takes a lot longer (2.0 s for ADHD and 2.3 s for EEG). SVM uses comparatively less memory (150 MB in EEG) but has the longest processing time (2.5 s) in the ADHD dataset. Additionally, compared to the suggested technique, the 2D-Convolutional CNN and FCNN exhibit longer processing times and higher memory usage. The outcomes demonstrate how effectively the suggested model uses time and memory.

Conclusion

In this research, DL based ADHD detection model has been introduction with a primary aim of an approach with higher accuracy, effective data handling, and minimal complexity. To attain these goals, the work majorly presents a new module called NeuroDCT-ICA which was aimed at preprocessing raw EEG data and filtering out noise as well as extracting necessary features. It also included a new RFO optimization approach for feature selection to improve data analysis speed and system reliability. At the center of this is the ADHD-AttentionNet, a DL model that enhances the detection accuracy of approach. The proposed approach was implemented in Python, the validation results stated the outperformance of the suggested approach. With accuracy of 98.52%, precision of 98.27% and recall of 98.04%. ADHD detection’s future is in the use of wearables for continuous tracking, creating individualized and multiple approach diagnostic models, and the use of DL techniques to increase detection accuracy. Cultural models, detection tools and Explainable AI will advance the availability, clinical utilisation and dependability of ADHD diagnosis.