Introduction

Climate change impacts soil health, plant growth, and agricultural ecosystems, altering their productivity due to variations in temperature and precipitation patterns. Drought, mudslides, and floods are among the severe natural disasters that can result from periods of exceptionally high or low temperatures and rainfall1. Crop yields are declining as a result of these changes. One of the primary causes of trees growing taller every year is climate change. While relative humidity rises with increasing daily maximum temperatures, tree growth decreases with rising temperatures2,3. Compared to values observed between 1850 and 1900, the average yearly air temperature globally in 2023 is 1.45 ± 0.12 °C higher. Data from the World Meteorological Organization shows that the average air temperature from 2014 to 2023, with a standard deviation of 0.12 degrees Celsius, is 1.20 °C higher than the temperature during 1850–1900. Limiting the increase to 1.5 °C is one of the primary goals of the Paris Agreement, which also seeks to keep the rise below 2 °C relative to pre-industrial levels. Avoiding the greatest climate disasters is still within reach if we drastically cut greenhouse gas emissions and accelerate progress toward carbon neutrality4. The development of policies for controlling climate change relies on accurate long-term weather forecasts5,6. Timely decisions in areas like agriculture, urban planning, energy, and reservoir management can benefit from accurate predictive information. The social relevance of predicting monthly averages is far higher than that of seasonal forecasting7. However, current climate model predictions remain unsatisfactory. Consequently, solving the problem of reliable climate prediction is a top priority for the scientific community8. The subseasonal time scale, spanning two weeks to three months, is often described as a “predictable desert” due to its low forecasting accuracy.

AI models have been extensively utilized across various disciplines. Research has introduced AI models into climate modeling, enabling time series analysis in context9,10. Machine Learning (ML) algorithms provide a fresh perspective on weather predictions. For instance, outdoor air temperature predictions for four European towns have been made using ANN and RNN, which are adapted versions of each other. The Brazilian Legal Amazon uses ANNs for precipitation estimation11. Neural network training has been validated by comparing anticipated and measured air temperatures, although ANNs lack linear generalizability and predictive capabilities. Deep learning algorithms outperform multi-model approaches for predicting the El Niño-Southern Oscillation (ENSO) six months in advance12. For statistical downscaling, deep learning models like LSTM are employed, capable of understanding nonlinear interactions with long-term dependencies. A convolutional LSTM is trained for precipitation nowcasting. Compared to dynamic prediction schemes, CNNs show superior correlation skills for the Nino3.4 index. CNNs also excel in predicting tropical cyclone wind radii, precipitation likelihood, and storm surge heights by extracting spatial information from high-resolution data13. The MTL-NET model predicts IOD more effectively than most climate dynamical models, which fail to forecast IOD seven months in advance14. Novel approaches, such as the Precipitation Nowcasting Network via Hypergraph Neural Networks (PN-HGNN) and the DEUCE framework combining Bayesian neural networks and U-Net, improve precipitation forecasts. Climate-invariant ML methods and latent deep operator networks (L-DeepONet) have been proposed for weather and climate predictions15. Additionally, visibility forecasts are enhanced using the GCN-GRU architecture. However, establishing robust AI models for climate prediction remains challenging due to the complex nature of climate data.

Accurate climate prediction supports informed decision-making and readiness for weather-related occurrences, benefiting urban planning, disaster management, and agriculture16. Traditional approaches rely on computationally intensive physical models that often fail to capture regional weather patterns effectively17. This issue is pronounced in geographically diverse and dynamic regions like Delhi, where predictions suffer from poor scalability and reliability. To address these challenges, a scalable prediction model leveraging advanced deep learning techniques is essential. Such a model must incorporate spatial-temporal dynamics, handle diverse and large-scale data, and optimize feature selection for accurate and reliable predictions with minimal computational costs. This research aims to design a deep learning framework using Transformers and a Hierarchical Graph Representation and Analysis module (HGRA) to predict climate variables.

  • The proposed hybrid feature selection method, HWOA-TTA, combines the Whale Optimization Algorithm (WOA) with the Tiki-Taka Algorithm (TTA) to minimize computational costs, enhance model efficiency, and ensure optimal feature selection.

  • The Dynamic Temporal Graph Attention Mechanism (DT-GAM) and HGRA modules dynamically model interdependencies across time, localized interactions, and global patterns in past climate data to improve prediction precision across geographical and temporal scales.

  • Robust preprocessing procedures such as data cleaning, normalization, and sequence construction ensure high-quality inputs for the predictive model.

  • Performance comparisons with conventional methods using multiple metrics will determine whether the proposed model improves prediction capabilities, supports climate-related decision-making, and addresses climate variability and preparedness for weather-related events.

The rest of the paper is organized as follows: Sect. 2 reviews pertinent literature, Sect. 3 outlines the methodology, Sect. 4 discusses the results, and Sect. 5 concludes the work.

Related work

Three case studies were used by Pasche et al.18 to compare ML weather prediction models with ECMWF’s high-resolution studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the 2021 North American winter storm. During the Pacific Northwest’s record-breaking heatwave, ML weather prediction models outperformed HRES when applied locally but performed poorly when exposed to variables that varied across time and space. However, their prediction of the compound winter storm was far more accurate. To thoroughly evaluate the health hazards associated with the humid heatwave of 2023, ML projections were found to lack key variables. The spatial patterns of prediction errors revealed that ML models underestimated the maximum risk levels over Bangladesh when using a potential substitution variable. The consensus is that impact-centric evaluations guided by case studies can supplement current research, boost public trust, and facilitate the creation of trustworthy ML weather prediction models.

To predict potential daily rainfall situations for the next nine days, Tsoi et al.19 utilized an Analogue Forecast System (AFS) for precipitation. This system identifies historical cases with similar weather patterns to the latest output. Thanks to advancements in machine learning, more complex models can be trained with historical data, enabling better representation of severe weather patterns. Consequently, an autoencoder deep learning technique was used to produce an improved AFS. By comparing the datasets of the fifth-generation ECMWF Reanalysis (ERA5) with those of the present AFS, it was found that ERA5 offers more meteorological features at higher horizontal, vertical, and temporal resolutions. The daily rainfall class forecasts generated by the enhanced AFS followed four steps: (1) preprocessing gridded ERA5 and ECMWF model forecasts, (2) feature extraction using a pretrained autoencoder, (3) case identification, and (4) calculation of a weighted ensemble of top simulations. During the 2019–2022 verification period, the upgraded AFS consistently outperformed the previous AFS, especially in capturing occurrences of heavy rain. The study details the improved AFS and discusses its benefits and drawbacks as a tool for Hong Kong precipitation forecasting.

Zhang et al.20 developed a hybrid ML model (RF-LSTM-XGBoost) to forecast permafrost tables. Initially, the Spearman correlation coefficient method was used to identify key influencing factors along the Tuotuo River section of the Qinghai-Xizang Highway by evaluating climatic and ground temperature data at various depths and locations. Artificial permafrost tables were predicted using models like Extreme Gradient Boosting (XGBoost). Grid search and cross-validation were utilized to tune the hyperparameters of each model. Once the models were merged using a linear weighted combination approach based on the smallest cumulative absolute error, their performance was evaluated against individual RF, LSTM, and XGBoost models. Key variables influencing the model were identified, including subgrade surface ground heat flux, daily average air temperature, and artificial permafrost table fluctuations. Surface ground temperature was found to be the least significant predictor, while flux and daily atmospheric temperature were more impactful. The combined model outperformed others in prediction accuracy, achieving R² values of 0.989, with MAE, RMSE, MSE, and RMSE values of 0.003, 0.052, 0.0085, and 0.029, respectively. Subgrade stability studies in challenging permafrost conditions can benefit from this model’s quick, accurate, and dependable permafrost table predictions.

Price et al.21 observed that the reliability and precision of MLWP were significantly diminished compared to recent NWP ensemble estimates. In terms of speed and accuracy, the probabilistic weather model GenCast outperformed the state-of-the-art operational medium-range weather forecast, ENS. Trained using decades of reanalysis data, GenCast generates 15-day global projections for 80 surface and air variables within eight minutes, achieving a resolution of 0.25° latitude-longitude and 12-hour intervals. GenCast excelled in severe weather, tropical storm routes, and wind power generation, outperforming ENS on 97.2% of the 1,320 tested objectives. This advancement has enhanced operational weather forecasting and improved decision-making based on weather data.

Zhao et al.22 tested how well an architecture predicted the interior environment of a real-life office building 24 h in advance, comparing it with previous prediction models. Tests were conducted using both the full training set and a shortened set. Regardless of the scenario, the study’s multi-task prediction architecture outperformed or matched prior models. Using multi-task prediction algorithms to forecast interior building conditions proved highly beneficial.

Rampal et al.23 employed a conditional GAN (cGAN) to downscale daily precipitation as an RCM emulation. A deterministic deep learning algorithm generated predictable expectation states, while the cGAN added credibility by producing reliable residuals. The performance of cGANs is highly dependent on the loss hyperparameter. By applying a suitable loss function, consistent results were achieved across performance metrics. CGANs better captured the spatial characteristics and distributions of extreme occurrences, making them adaptable to various contexts and periods.

Valipour et al.24 used AI models like WPSOANFIS, WGMDH, and WLSTM to make short-term daily precipitation predictions at 28 sites in the contiguous US from 1995 to 2019. The WLSTM model outperformed others with mean absolute errors of 0.65 mm/d and an R² value of 0.91, demonstrating deep learning techniques’ superior accuracy for short-term forecasts.

Guo et al.25 proposed AI models like CNN-LSTM for predicting the climatic characteristics of Jinan, China, showcasing significant error reduction compared to conventional methods. Gerges et al.26 introduced DeepSI, a Bayesian deep learning framework for predicting daily solar irradiance. Du et al.27 validated machine learning models for ET estimation, highlighting their variability across climate zones. Youssef et al.28 confirmed ML methods’ accuracy in ETo predictions, emphasizing their efficiency. Wi & Steinschneider29 demonstrated the effectiveness of LSTM models in long-term hydrological forecasts.

Research gap

Despite significant advancements in machine learning and deep learning for climate prediction, several challenges remain. Many existing models fail to generalize across diverse spatial and temporal scales, leading to inaccuracies during extreme weather events. Recent research has explored graph-based approaches for climate forecasting, yet limitations persist.

Unlike25, which employed a single-layer Graph Neural Network (GNN) for climate forecasting, our model leverages a hierarchical graph-based structure that captures both localized and global climate patterns. While26 focused on CNN-LSTM hybrid models, these approaches lack structured graph representations, limiting their ability to model complex climate interdependencies. Additionally30,31, integrated an Attention-based Graph Neural Network, but struggled with computational efficiency and scalability when handling high-resolution climate datasets.

To address these limitations, our research introduces a Hierarchical Graph Representation and Analysis (HGRA) module, combined with a Dynamic Temporal Graph Attention Mechanism (DT-GAM) within a Transformer-based framework. This approach enhances climate prediction by:

  • Capturing spatial-temporal dependencies across multiple hierarchical levels.

  • Improving scalability and computational efficiency using hybrid optimization (HWOA-TTA).

  • Enhancing accuracy compared to traditional graph-based models and CNN-LSTM architectures.

By integrating these innovations, our model bridges the gap between traditional deep learning models and graph-based climate forecasting, providing a scalable, interpretable, and efficient solution for regional climate prediction.

Proposed methodology

To identify climate change, an optimizer-based feature selection model with a deep learning component was employed. The next section provides a concise explanation of the components of this model, as shown in Fig. 1.

Fig. 1
figure 1

Workflow of research model.

Data collection and preprocessing

To perform the experiment, daily climate data for Delhi from 2013 to 2017 was obtained from https://www.wunderground.com/history/monthly/in/new-delhi32. This dataset includes weather information for Delhi, India, covering the period from January 1, 2013, to April 24, 2017. The dataset contains four key constituents: mean temperature, humidity, wind speed, and mean pressure. The following preparation procedures were carried out to prepare the data for the proposed model:

  • Data Cleaning: Interpolation was applied to handle missing values, and statistical thresholds were used to identify and remove outliers.

  • Feature Selection: Correlation analysis and domain expertise were utilized to select pertinent features influencing daily climate conditions.

  • Normalization: To facilitate faster convergence during training, the data was normalized using Min-Max scaling, which ensured all features were on a similar scale.

  • Time Series Transformation: The data was transformed into a time series format for analysis.

The visualization of temperature, humidity, wind speed, and pressure is presented in Figs. 2, 3, 4, and 5.

Fig. 2
figure 2

Temperature analysis.

Fig. 3
figure 3

Humidity study.

Fig. 4
figure 4

Wind speed analysis.

Fig. 5
figure 5

Wind study.

Feature selection using HWOA-TTA

The hybrid optimiser33 is used in this study to choose the best features for detecting climate change based on the pre-processed input data.

Inspiration of Whale optimization algorithm (WOA)

For problems requiring continuous optimization, the swarm intelligence technique known as the Whale Optimization Algorithm (WOA) was proposed. It has been demonstrated that this algorithm outperforms several other currently used algorithmic strategies. The hunting behaviors of humpback whales inspired WOA. In this algorithm, every solution is considered a whale, with the best-performing solution representing the group’s leader. The approach involves a whale attempting to reach a new position within the search space. Whales undergo two distinct stages during their hunt: finding prey and attacking. The first stage involves encircling the prey, while the second stage creates bubble nets. Accordingly, whales explore the area for prey and utilize the information gathered to execute their attacks. WOA replicates these social and cooperative behaviors of whales, particularly their bubble-net hunting strategy34.

Human emotion, thought, and social interaction are governed by specialized neurons, which differentiate humans from other animals. Interestingly, whales possess twice as many spindle-shaped cells as adult humans, making them significantly intelligent. Although their IQ is lower than that of humans, whales demonstrate similar capacities for thought and emotion. Killer whales, in particular, have shown evidence of developing a language. The social behavior of whales is extraordinary; they may form small groups or remain solitary. However, they are often observed in groups. For instance, killer whales stay with their families throughout their lives. The humpback whale (Megaptera novaeangliae), the largest baleen whale, can grow to the size of a school bus. Notably, only humpback whales engage in the unique behavior of bubble-net feeding34.

Swarming the prey

A humpback whale’s hunting strategy involves circling its prey after locating it. Since the exact position of the optimal solution in the search space cannot be determined in advance, the Whale Optimization Algorithm (WOA) operates under the assumption that the current best solution is either the target or very close to it. Once the top search agent is identified, the remaining search agents attempt to align their positions with the top search agent. This behavior can be mathematically expressed through the following equations:

$$\:\overrightarrow{D}=\left|\overrightarrow{C}.{\overrightarrow{X}}^{*}\left(t\right)-{\overrightarrow{X}}_{t}\right|$$
(1)
$$\:\overrightarrow{X}\left(t+1\right)={\overrightarrow{X}}^{*}\left(t\right)-\overrightarrow{A}.\overrightarrow{D}.$$
(2)

where \(\:t\) is the current repetition, \(\:\overrightarrow{A}\) and \(\:\overrightarrow{C}\) vectors, X is the site vector of the greatest solution achieved so far, \(\:\overrightarrow{X}\) multiplied by each element in turn, | | represents the absolute value, and is the position vector. Keep in mind that if a better solution becomes available at each iteration, \(\:{\overrightarrow{X}}^{*}\) should be updated.

Here is how determine the \(\:\overrightarrow{A}\) and \(\:\overrightarrow{C}\) vectors:

$$\:\overrightarrow{A}=2.\overrightarrow{a}.\overrightarrow{r}-\overrightarrow{a}$$
(3)
$$\:\overrightarrow{C}=2.\overrightarrow{r}.$$
(4)

where \(\:\overrightarrow{a}\) is a random vector in the intermission [0, 1] and has reduced linearly from 2 to 0 during iterations.

Applying a comparable concept to an n-space could mean that search agents instead move around in hypercubes centered on the optimal solution at any given time. Humpback whales, as mentioned before, hunt using a bubble-net technique. This approach can be mathematically expressed in the subsequent way.

Attacking with a bubble net (exploitation phase)

Humpback whales’ bubble-net behavior can be quantitatively described in two ways. Modulating the value of a in Eq. (3) causes the encircling mechanism to shrink. It should be noted that a also decreases the amplitude of variation for A. In other words, A will take on an arbitrary value between − a and + a during the iterations, as a is progressively decreased from 2 to 0. Using a random selection of values within the interval [− 1,1], a search agent can be placed anywhere between its starting point and the optimal agent’s position.

To mimic the helix-like movement of humpback whales, a spiral equation linking the prey’s position to the whale’s position is devised:

$$\:\overrightarrow{X}\left(t+1\right)={\overrightarrow{D}}^{{\prime\:}}.{e}^{bl}.\text{cos}\left(2\pi\:l\right)+{\overrightarrow{X}}^{*}\left(t\right).$$
(5)

where b is a constant determining the form of the logarithmic random integer in1, and (.) is a multiplication, and \(\:{\overrightarrow{D}}^{{\prime\:}}=\left|{\overrightarrow{X}}^{*}\left(t\right)-\overrightarrow{X}\left(t\right)\right|\) shows the distance of the i-whale to the prey (best answer discovered so far).

The spiral model or the decreasing surrounding mechanism, which involves a spiral-shaped route and a reducing circle, are used to update the whales’ locations during optimization in order to simulate this synchronous behaviour. The mathematical model can be spoken as follows.

$$\:\overrightarrow{X}\left(t+1\right)=\left\{\begin{array}{c}{\overrightarrow{X}}^{*}\left(t\right)-\overrightarrow{A}.\overrightarrow{D},\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:if\:p>0.5\\\:\overrightarrow{{D}^{{\prime\:}}}\:.{e}^{bl}.\text{cos}\left(2\pi\:l\right)+{\overrightarrow{X}}^{*}\left(t\right)\:\:\:\:\:\:if\:p\ge\:0.5\end{array}\right..$$
(6)

p is an arbitrary integer between zero and one.

The humpback whale uses both a bubble net and a more haphazard approach to hunting. The mathematical model used for the search is as shadows:

Prey detection (exploration phase)

Regardless of the variance in the A vector (exploration), the same technique is employed to locate food. Instead of searching systematically, humpback whales search randomly in relation to one another. To redirect the search agent’s reference whale, random values for A outside the range of [− 1,1] are used. During the exploration phase, the search agent is updated using a randomly selected search agent rather than the best one. This technique, along with the condition A>1, enables the WOA algorithm to perform global searches effectively. The mathematical model for this behavior is expressed through the following equations:

$$\:D=\left|C.\overrightarrow{{X}_{rand}}-\overrightarrow{X}\right|$$
(7)
$$\:\overrightarrow{X}\left(t+1\right)=\overrightarrow{{X}_{rand}}-\overrightarrow{A}.\overrightarrow{D}.$$
(8)

where \(\:{X}_{\text{r}\text{a}\text{n}\text{d}}\) is a site vector (a whale picked at random) from the set of all potential options. The ability to discover and take advantage of opportunities allows WOA to be seen as a global optimizer. Also, extra search agents can use the best practice right now because the proposed hypercube method limits the search area to a region near the optimal solution.

Inspiration in the Tiki-Taka algorithm (TTA)

Barcelona Football Club (BCF) and the Spanish national team are renowned for their tiki-taka style of play. The defining characteristics of this approach include quick passing, constant movement, and maintaining control of the ball. A football team can gradually build up offensive plays from a defensive position using the tiki-taka technique. This strategy was developed by Johan Cruyff and popularized by Pep Guardiola. Between 2008 and 2012, BCF utilized this approach and won 14 of their 19 titles. Additionally, the tiki-taka style influenced Luis Aragones’s 2008 UEFA European Football Championship-winning team and Vicente del Bosque’s 2010 FIFA World Cup-winning team for Spain.

Physical football, which emphasizes an opponent’s strength, speed, and ability to man-mark, contrasts sharply with the tiki-taka style. Only a few astute players are required to master the tiki-taka game if they can move quickly and position teammates effectively. These players become the key members of the team, setting the match’s tempo. Xavi Hernandez, Andrés Iniesta, and Lionel Messi were notable players for BCF who excelled in this style. To develop offensive plays, players constantly seek opportunities to pass the ball to crucial teammates. Tiki-taka tactics aim to outmaneuver opponents by maintaining ball possession through tactical dominance and agility.

Mathematical representation

The Tiki-Taka Algorithm (TTA) draws inspiration from two primary elements of the tiki-taka tactic: concise passing and dynamic player mobility. During an actual game, participants form a triangular formation of three players who continuously exchange the ball. As soon as opponents move to intercept the current triangle, the players strategically reconfigure into a new triangular formation by occupying more advantageous positions. The TTA employs a short passing method, wherein the player (representing a possible solution) transfers the ball to a neighboring player. Subsequently, the player seeks a more advantageous position based on the current situation.

Key characteristics of the tiki-taka football style include short passing and possession, which utilize tactical dominance to control and outmaneuver opponents. Similarly, the TTA aims to optimize short passing and player placement.

The TTA mirrors the genuine tiki-taka approach by incorporating critical player positions into its organizational structure. It uses multiple important players (leaders) to enhance solution diversity and prevent the algorithm from being trapped in local optima35. Consider a football squad consisting of n players. Player positions, represented in d-dimensions, are randomly generated within defined bounds to depict potential solutions. The number of crucial players, denoted as nk​, is determined and includes at least three significant players or approximately 10% of all participants. The concept of multiple important players (leading solutions) in the TTA mimics the authentic tiki-taka strategy, maintaining diversity as different leaders influence candidate solutions.

A matrix B is constructed to represent the ball’s location. The vector that directs a player’s movement during the exploitation phase is represented by the ball position. For the initial solution, B = P.

$$\:B = \left( {\begin{array}{*{20}l} {b_{{{\text{1,1}}}} } \hfill & {b_{{{\text{1,2}}}} } \hfill & \cdots \hfill & {b_{{1,d}} } \hfill \\ {b_{{{\text{2,1}}}} } \hfill & {b_{{{\text{2,2}}}} } \hfill & \cdots \hfill & {b_{{2,d}} } \hfill \\ \vdots \hfill & \vdots \hfill & \vdots \hfill & \vdots \hfill \\ {b_{{n,1}} } \hfill & {b_{{b,2}} } \hfill & \cdots \hfill & {b_{{n,d}} } \hfill \\ \end{array} } \right).$$
(9)

The objective function is used to assess the starting player position, P. When it comes to the critical player archive, the top \(\:{n}_{\text{k}}\) players are updated. At cycle, the key player archive will have the most up-to-date values.

Update the position of the ball

The main typical of tiki-taka is the usage of brief passing. The algorithm that is being considered fully supports this concept. The ball is passed from one player to the next. The opposite team can still lose possession even though tiki-taka passes are usually successful. In this study, the probability of losing possession of the ball varies between 10% and 30% of the total passes. Short and to the point, the user’s writing is perfect. The updated ball location is given by

$$\:{b}_{i}^{{\prime\:}}=\left\{\begin{array}{c}rand\left({b}_{i}-{b}_{i+1}\right)+{b}_{i},\:\:\:\:\:\:\:\:\:{r}_{p}>{prob}_{lose}\\\:{b}_{i}-\left({c}_{1}+rand\right)\left({b}_{i}-{b}_{i+1}\right)\:\:\:\:{r}_{p}\le\:{prob}_{lose}\end{array}\right..$$
(10)

where \(\:{r}_{p}\) is a random sum (0, 1). For a positive pass (\(\:{r}_{p}>{prob}_{lose}\)), At random, the ball will be passed to the player closest to the center. As seen in Eq. (10), if a pass is incomplete, the ball should be intercepted and sent behind the player. The amount of reflection of the ball during an unsuccessful pass is affected by the coefficient c1 in this equation. The term (\(\:{b}_{\text{i}}\)\(\:{b}_{\text{i}+1}\)) represents the distance among the ith ball site to the ith+1 ball. For site term \(\:{b}_{\text{i}+1}\) is substituted with \(\:{b}_{1}\).

Modifying players’ roles

The receiver of the ball must then adjust his position inside the formation for optimal throwing. The ball, essential players, and opponent positioning affect how tiki-taka players move. Nevertheless, as seen in35, the update ball and player positions, this algorithm considers the ball’s positions and crucial players (h). Because so many significant persons are part in the updating process, their selection is completely random and includes more than one. This method helps the algorithm to keep its player mobility is not dependent on the position of a single best player35.

The player updating procedure accepts the subsequent formula:

$$\:{p}_{i}^{{\prime\:}}={p}_{i}=rand*{c}_{2}*\left({b}_{i}^{{\prime\:}}-{p}_{i}\right)=rand*{c}_{3}*(h-{p}_{i}).$$
(11)

where h characterises key player’s site. \(\:{c}_{2}\) and \(\:{c}_{3}\) are the location relative to the ball and the crucial player. The position is analyzed, and the critical player position is recorded in the archive.

The brief passing mechanism for the solution exploitation makes the suggested method unique. The next player to approach and the presence of many essential players might affect the TTA’s solution update process by directing the search path35.

$$\:{p}_{ij}^{t+1}=\theta\:*(\omega\:{p}_{ij}^{t}+{c}_{1}{rand}_{1}\left({b}_{i}^{{\prime\:}}-{x}^{t}\right)+{c}_{2}{rand}_{2}({Whale\:postion}^{t}-{x}^{t}\left)\right).$$
(12)

A method for hybrid WOA-TTA

The WOA algorithm is a relatively new optimization technique that has succeeded in resolving a number of optimization problems. The native technique takes advantage of the conditional and operated solutions using a blind operator without considering their fitness value. used a local search instead of this operator, which starts with a solution, makes it better, and then replaces it. Here, the algorithms for global search (WOA) and local search (TTA) were brought together. The best of both worlds, WOA and TTA, come together in a hybrid set. The constant inertia weight of TTA limits its search region, which is a drawback when dealing with higher-order or complex design problems. This problem can be solved by combining the greatest elements of WOA and TTA to create a hybrid WOA-TTA.

During the exploration phase, the whale optimizer method is used because it covers a larger region in the uncertain search space using the function. Throughout the calculation from the beginning to the most outstanding iteration limit, refer to the space as an uncertain search as both methods are randomization approaches. During the exploration phase, the algorithm can test out many potential solutions. A whale’s position is very efficient in moving the solution towards the ideal one, so it replaces the particle’s location, which is responsible for finding the optimal solution to the complicated nonlinear problem. By reducing calculation time, WOA accelerates particle directional optimization. Combining the best features of WOA exploration and TTA exploitation ensures that the best possible optimal solution to the problem is obtained while avoiding local stagnation or optima. TTA is a well-known algorithm that feats the best possible space. A hybrid WOA-TTA approach aims to find solution by combining the strengths of TTA and WOA during the exploration phase, where Whale \(\:postio{n}_{t}={X}^{*}\) (the best search agent), and in order to analyze convergence behaviour, the constriction factor has been created as follows:

$$\:\theta\:=\frac{2}{\left|2-\phi\:-\sqrt{{\phi\:}^{2}-4\phi\:}\right|}.$$
(13)

φ must be more than 4.0 and is set to \(\:\theta\:=0.729\)

Despite its simplicity, resilience, and ease of implementation, the TTA approach has the drawback of becoming stuck in local minima when faced with significant constraints. In contrast, WOA maintains a healthy equilibrium between exploration and exploitation, which helps it escape local traps. therefore, it combines the two outstanding features of TTA and WOA.

Classification of climate change prediction

Background of deep learning model

A powerful alternative to traditional machine learning approaches is the use of deep learning models. In recent years, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have gained significant popularity. RNNs and LSTMs excel at modeling temporal relationships due to their ability to handle time-series data as sequences, while CNNs are adept at capturing spatial patterns in time-series data by treating multichannel signals as images or matrices.

Recent studies have explored hybrid models that combine the strengths of both approaches to learn spatial and temporal features simultaneously. One such model is the CNN-LSTM architecture. While these technologies outperform traditional machine learning methods, they are not without challenges. A primary limitation is their inability to fully capture the complex spatiotemporal relationships inherent in certain datasets. Additionally, training these models often requires large amounts of labeled data, which can be difficult to obtain in real-world clinical contexts. Furthermore, despite their sophistication, deep learning models often function as “black boxes,” providing little transparency into their decision-making processes.

Graph-based approaches

The ability of graph-based methods to replicate the intricate network architecture of the brain has led to their rapid rise in popularity. Traditional approaches, and even some deep learning techniques, struggle to capture the spatial dependencies and interactions within the brain. Graph-based methods, however, excel in this regard. These techniques can dynamically model the interactions of multiple brain regions over time, offering deeper insights into the complex neurological pathways associated with mental health issues.

A significant advantage of graph-based methods is their interpretability. They can pinpoint specific brain regions or connections that are crucial for a given task, providing valuable insights for research and clinical applications. However, conventional graph-based approaches typically handle static representations, which poses challenges in integrating spatial graph structures with temporal information. Despite these limitations, recent advancements have begun addressing this issue by incorporating temporal dynamics into graph models. With their high accuracy and interpretability, graph-based methods show immense potential for future studies in mental health monitoring. Although further work is required to realize their full capabilities, these methodologies represent a promising avenue for advancing our understanding of brain function and mental health.

Let \(\:D\:=\:{\left\{\right({X}_{i},\:{y}_{i}\left)\right\}}_{i=1}^{N}\) represent a dataset of climate data, where \(\:{X}_{i}\in\:{R}^{C\times\:T}\) denotes steps, and \(\:{y}_{i}\:\in\:\:\{1,\:.\:.\:.\:,\:K\}\) (where K is the total number of classes) corresponds to the related mental health condition label. Finding the right weather condition from data is the main objective of learning this function.

$$\:f:{R}^{C\times\:T}\to\:\left\{1,\dots\:,K\right\}.$$
(14)

To facilitate the learning process, the data \(\:{X}_{\text{i}}\) is and artifacts, resulting signal \(\:{\stackrel{\sim}{X}}_{i}\). This preparation stage comprises band-pass filtering, Independent) to remove artifacts, and normalizing. The preprocessed signal \(\:{\stackrel{\sim}{X}}_{i}\) is then segmented into specified size, each representing a smaller time frame of the data.

$$\:{\stackrel{\sim}{X}}_{i}^{j}=\in\:{R}^{C\times\:W}.$$
(15)

denote the jth window of the ith data, where W is size.

The processes each window \(\:{\stackrel{\sim}{X}}_{i}^{j}\) to independently capture the data’s geographical and temporal interdependence through a sequence of modifications. A mechanism that can be mathematically expressed as spatio-temporal is utilized by the model.

$$\:{A}_{i}^{j}=softmax\left(\frac{\left({Q}_{i}^{j}\right){\left({K}_{i}^{j}\right)}^{\text{{\rm\:T}}}}{\sqrt{{d}_{k}}}\right){V}_{i}^{j}.$$
(16)

Here, \(\:{Q}_{i}^{j}\), \(\:{K}_{i}^{j}\), \(\:{V}_{i}^{j}\) are value matrices obtained from window \(\:{\stackrel{\sim}{X}}_{i}^{j}\), and dk is the keys. The attention mechanism figures the weighted values\(\:{V}_{i}^{j}\), which are weighted according to how similar the queries and keys are to one another.

Afterwards, a complete picture of the whole signal is created by aggregating the attention mechanism’s outputs for each meteorological data window. The next step is to feed this data into a graph neural network (GNN), which can simulate the connections among various parts of the brain. In the GNN, V is the collection of brain areas connections (edges) between them. This is represented on the graph G = (V, E). At each GNN layer, the graph convolution operation is written as

$$\:{H}^{(l+1)}=\sigma\:\left({D}^{-\frac{1}{2}}{AD}^{-\frac{1}{2}}{H}^{\left(l\right)}{W}^{\left(l\right)}\right).$$
(17)

where \(\:{H}^{\left(l\right)}\) signifies the node features at layer, A is the graph, D is matrix, \(\:{W}^{\left(l\right)}\) is a non-linear activation function, and is the trainable weight matrix. Combining the GNN’s final output with the Transformer’s temporal feature extraction yields a representation of the spatial dependencies between various brain areas.

The probability distribution over the classes of weather conditions is then generated by feeding the combined features into a layer and then applying a softmax function.

$$\:{\widehat{y}}_{i}=softmax({W}_{f}{h}_{i}+{b}_{f}).$$
(18)

where \(\:{W}_{\text{f}}\) and \(\:{b}_{\text{f}}\) are the weights layer, and \(\:{h}_{\text{i}}\) is feature -entropy loss among the predicted labels \(\:{\stackrel{\sim}{y}}_{i}\) and the true labels \(\:{y}_{\text{i}}\):

$$\:\mathcal{L}\left(\theta\:\right)=-\frac{1}{N}\sum\:_{i=1}^{N}\sum\:_{k=1}^{K}{y}_{i,k}\text{l}\text{o}\text{g}({\widehat{y}}_{i},k).$$
(19)

where θ characterizes all the model, and \(\:{y}_{i,k}\) represents the proper classification for sample i as a binary indicator (0 or 1), where k is the class label. The next sections will go into the proposed architecture and its components in detail, building upon this formalization.

Hybrid optimization with WOA-TTA framework

The proposed model incorporates signals within a framework based on Transformers, marking a significant first in climatic monitoring. This model is designed to leverage the spatial and temporal properties of data. To overcome the limitations of conventional methods, the proposed model integrates recent advancements in graph-based deep learning models and multimodal spatiotemporal attention mechanisms. It offers a scalable, interpretable, and flexible solution, making it well-suited for both clinical and practical applications, especially in real-time environments.

This Transformer-based technique excels at capturing complex patterns and long-range relationships in data, addressing challenges posed by signal unpredictability and the growing demand for personalized models. The model focuses on the most critical elements during training by employing spatiotemporal attention mechanisms. Additionally, by incorporating graph neural networks, it provides deeper insights into the interactions between different components, enhancing the accuracy of inferences.

With its non-invasive, reliable, and adaptable approach to early identification and ongoing mental health assessment, this breakthrough is poised to have a significant impact on the field.

Modules of the proposed architecture

Dynamic Temporal graph attention mechanism (DT-GAM): In order to record data’s ever-changing temporal dependencies, built the DT-GAM module. To prioritize important aspects within specified time periods, DT-GAM employs a graph attention mechanism, which is different from typical temporal modeling approaches. This mechanism adaptively adjusts relationships between temporal nodes. The model’s capacity to capture temporal information is improved by this architecture, leading to more accurate calculations of various mental health conditions.

Hierarchical graph representation and analysis module (HGRA): A multi-level graph structure is built using the proposed HGRA module to better mimic the intricate connections between various brain areas. In order to capture both local and global spatial interdependence, HGRA aggregates information across several hierarchical levels. This improvement not only makes it easier to see the relevance of various brain areas in monitoring, but it also improves the model’s ability to understand brain structures.

Spatial-Temporal fusion module (STFM): The STFM module is introduced in the proposed paradigm to allow for the seamless combination of spatial and temporal information. This module creates a complete representation of the signal by thoroughly integrating spatial and temporal data. The suggested model is able to acquire a more complete grasp of the intricate changes linked to states by incorporating STFM, which greatly enhances its depth of interpretation in comparison to conventional methods that depend only on spatial or temporal aspects.

Integrated interpretability and adaptability design: Its modular design not only offers improved interpretability and extensibility but also significantly improves classification performance. Our method is flexible enough to accommodate various datasets and scenarios, and it provides visual clarifications for every module, which makes the model ideal for individualized remote monitoring.

  1. A)

    Dynamic temporal graph attention mechanism (DT-GAM).

    Central to the suggested model is the Dynamic Temporal Graph Attention Mechanism (DTGAM), which allows it to successfully capture data’s complicated temporal connections. Accurate monitoring relies on this mechanism’s capacity to improve the model’s emphasis on the most relevant temporal features. In DT-GAM, the EEG data is represented as a graph across time, with each node standing for a channel and the edges showing the interactions between these channels at different time intervals. This system is dynamic, thus it can change its structure depending on the changing patterns; this way, the most important times are prioritized. One possible mathematical expression for the temporal attention mechanism is:

    $$\:Temporal-Attention({Q}_{t},{K}_{t},{V}_{t})=softmax\left(\frac{{Q}_{t}{K}_{i}^{\text{{\rm\:T}}}}{\sqrt{{d}_{t}}}\right){V}_{t}$$
    (20)

    .

    where \(\:{Q}_{t},{K}_{t},{V}_{t}\) This aggregated illustration is then transmitted back to each level to reinforce local representations with global model’s potential to capture both local and global relationships in data. The dimension \(\:{d}_{\text{t}}\) helps to stabilize the gradients while training by acting as a scaling factor. By using this attention mechanism, the model can learn to prioritize the temporal patterns that are most predictive of particular mental health disorders by dynamically weighing the value of distinct time steps.

    In DT-GAM, the temporal dependencies are modelled as a graph \(\:{G}_{t}=\:({V}_{t},\:{E}_{t})\), where each node v Vt Every edge eE_t stands for a time step, and they all indicate the temporal relationship between readings taken at different periods. The mechanism revises the weight of these links by using:

    $$\:{A}_{t}=softmax\left(\frac{{Q}_{t}{K}_{i}^{\text{{\rm\:T}}}}{\sqrt{{d}_{t}}}\right)$$
    (21)

    .

    where \(\:{A}_{t}\) symbolizes the temporal attention modify the impact of every time step according to how relevant it is to the job at hand. By adjusting the scores, the model is able to adaptively zero in on the most instructive parts of the data by modulating the interaction between nodes (time steps).

    The temporal graph convolutional layer takes the output of the mechanism and uses it to improve the embeddings by combining data from the appropriate time steps.

    $$\:{H}_{t}^{(l+1)}=\sigma\:\left({A}_{t}{H}_{t}^{\left(l\right)}{W}_{t}^{\left(l\right)}\right)$$
    (22)

    .

    Here, \(\:{H}_{t}^{\left(l\right)}\) denotes the layer l, and \(\:{W}_{t}^{\left(l\right)}\) in the temporal graph convolution layer that can be learned. This process is repeated across various levels of the model, allowing it to grasp more complex temporal relationships within the data.

    Next, a fully linked layer is used to process the combined outputs, which are a result of integrating the refined temporal characteristics from DT-GAM features using a fusion technique.

    $$\:{h}_{f}=ReLU({W}_{f}\left[{h}_{t};{h}_{s}\right]+{b}_{f})$$
    (23)

    .

    where \(\:{h}_{t}\) and \(\:{h}_{s}\) are the feature vectors for time and space, respectively, and the symbol [·; ·] stands for joining them. A complete picture of the data is created by merging the spatial structure with the temporal dynamics recorded by DT-GAM in the fusion layer.

    The probability distribution over the classes of weather conditions is then generated by passing the fused representation layer.

    $$\:{\widehat{y}}_{i}=softmax({W}_{o}{h}_{f}+{b}_{o})$$
    (24)

    .

    where \(\:{W}_{o}\) and \(\:{b}_{o}\) are the biases of layer.

    As a result, the suggested model can use the DT-GAM to focus on the most important temporal aspects in real time, which improves prediction accuracy and sheds light on the dynamics of situations over time.

  2. B)

    Hierarchical Graph Illustration and Study.

    To accurately represent the data’s multi-scale dependencies, the suggested model uses a Hierarchical Graph Representation module. In order for the model to learn both patterns linked with conditions, this module is built to take advantage of brain areas and how they interact.

    At the core of the HGRA module \(\:G\:=\:\{{G}_{1},\:{G}_{2},\:.\:.\:.\:,\:{G}_{L}\}\), where each graph \(\:{G}_{l}=\:({V}_{l},\:{E}_{l})\) corresponds to a diverse architecture. The graph \(\:{G}_{1}\) Separate EEG channels are represented as nodes, with edges indicating connections among them. Higher levels. \(\:{G}_{2},\:.\:.\:.\:,\:{G}_{L}\) integrate these channels into bigger networks or regions, capturing fewer concrete connections between brain regions.

    The node embeddings \(\:{H}_{l}^{\left(0\right)}\) at apiece level l get their start from data features, with more abstract representations given to higher levels and finer-grained features given to lower levels. At each level, the graph convolutional operations are defined as:

    $$\:{H}_{l}^{(k+1)}=\sigma\:\left({D}_{l}^{-\frac{1}{2}}{A}_{l}{{D}_{l}}^{-\frac{1}{2}}{H}_{l}^{\left(k\right)}{W}_{l}^{\left(k\right)}\right)$$
    (25)

    .

    where Al is the contiguity matrix of graph \(\:{G}_{\text{l}}\), \(\:{D}_{\text{l}}\) is degree matrix, \(\:{W}_{l}^{\left(k\right)}\) ϝ is a non-linear activation function, and is the weight matrix for level l. In order for the model to grasp the hierarchical dependencies in the data, this method iteratively improves the node embeddings by collecting data from nearby nodes.

    The HGRA module employs a pooling technique to combine nodes from lower levels and pass them to higher levels, enabling the integration of information across multiple levels of the hierarchy. One way to formalize this pooling procedure is as follows:

    $$\:{H}_{l+1}^{\left(0\right)}=Pool\left({H}_{l}^{\left({K}_{l}\right)}\right)$$
    (26)

    .

    here \(\:{H}_{l+1}^{\left(0\right)}\) represents the initial embeddings level, \(\:{H}_{l}^{\left({K}_{l}\right)}\) level l’s last embeddings, and If you want to aggregate information from lower-level nodes, you can use the pooling function pool(). Common approaches to pooling resources include maximum and average pooling as well as more complex attention-based approaches that consider the relative value of each node’s input.

    Principal aggregation and distribution (PAD) layer: A PAD layer is also included in the HGRA module to improve the hierarchical graph structure. To enable a global representation, the PAD layer aggregates information from all levels, incorporating data from different scales. This aggregated illustration is then transmitted back to each level to reinforce local context, refining the model’s potential to capture both local and global relationships in data.

    In the PAD layer, the aggregation operation is described as:

    $$\:{H}_{global}=Aggregate\left(\bigcup\:_{l=1}^{L}{H}_{l}^{\left({K}_{l}\right)}\right)$$
    (27)

    .

    where \(\:{H}_{global}\) characterizes the global combined across all levels \(\:l\:=\:1,\:.\:.\:.\:,\:L\), and \(\:{H}_{l}^{\left({K}_{l}\right)}\:\)is the last embedding level for nodes. You can use the Agb (·) function to any hierarchical level as a total, mean, or attention-based aggregation.

    The global representation \(\:{H}_{global}\) is then transmitted back to each embedding with global context. This might be expressed as:

    $$\:{H}_{l}^{enhanced}={H}_{l}^{\left({K}_{l}\right)}+{W}_{PAD}{H}_{global}$$
    (28)

    .

    where \(\:{H}_{l}^{enhanced}\) symbolizes the improved level l, and WPAD is a that local depiction at each level.

    The model’s capacity to grasp both - data relationships is enhanced by these actions, which allow the PAD layer to generate a representation. The top-level embeddings in their ultimate form \(\:{G}_{\text{L}}\) Capture both local and global info from generate the final prediction.:

    $$\:{h}_{f}=ReLU({W}_{h}\left[{h}_{L};{h}_{t}\right]+{b}_{h})$$
    (29)

    .

    where \(\:{h}_{L}\) is the final module, \(\:{h}_{\text{t}}\) is feature vector, and \(\:{W}_{h}\) and \(\:{b}_{h}\) are the biases of the layer.

    The output from this layer is then processed function to construct the probability delivery over the climatic condition classes.:

    $$\:{\widehat{y}}_{i}=softmax({W}_{o}{h}_{f}+{b}_{o})$$
    (30)

    .

    where \(\:{W}_{o}\) and \(\:{b}_{o}\) are the weight vector of layer, correspondingly.

    This organized representation is in line with the group of the brain and also improves the model’s capacity to capture the complex data. The representation’s ability to distinguish among different conditions is greatly enhanced by including this knowledge into the proposal, rendering it a potent tool for both monitoring purposes. Incorporating the PAD layer into the Hierarchical Graph module guarantees that the suggested model can make good use of multi-scale data, which is essential for representing the dispersed, complicated character of the action. A major step forward in the area of monitoring using climate data has been achieved by incorporating previous information through a structured graph-based approach.

    In order to make the model more interpretable, the DT-GAM dynamically highlights certain parts of the data in terms of time. The model is able to prioritize and emphasize important temporal events, including changes in wave patterns, through this approach. Clinicians can learn which times in the signals are most suggestive of weather conditions thanks to DTGAM’s ability to identify these critical temporal dependencies, which shed light on the temporal dynamics that may correspond with particular states. The HGRA module does the same thing by simulating the brain’s hierarchical organization; this improves interpretability. This method captures both localized and global dependencies, allowing the model to show significant spatial relationships among brain areas. Research and therapeutic applications can benefit greatly from the information uncovered by HGRA’s hierarchical approach, which can pinpoint the areas or links most affected by specific illnesses. Both DT-GAM besides HGRA are built to scale, so they can adapt to clinical contexts. The suggested model’s modular design makes it easy to modify the attention processes and graph layers, which in turn makes it more practical to use. The model is positioned for extensive usage in monitoring because to its robustness and scalability, which are enhanced by the interpretability provided by DT-GAM and HGRA.

    Using neural network dynamics and regional interactions as a basis, to theoretically investigate the connection networks. A connectivity matrix, which is defined as that which represents the functional brain networks,

    $$\:{C}_{ij}=corr({A}_{i},{A}_{j})$$
    (31)

    .

    where \(\:{C}_{ij}\) denotes the among brain regions i and j, besides \(\:{A}_{i}\) besides \(\:{A}_{j}\) characterise the activation patterns in these areas. Similarly, module mimics such communications using a matrix:

    $$\:{W}_{ij}=f\left({x}_{i},{x}_{j},\theta\:\right),$$
    (32)

    .

    where \(\:{x}_{i}\) and \(\:{x}_{j}\) Θ stands for the module, f for function, and are the qualities that nodes i and j offer as input.

    In order to statistically evaluate how much the HGRA module is like networks, a graph similarity metric is introduced.

    $$\:S=\frac{{\sum\:}_{i,j}\left({C}_{ij}.{W}_{ij}\right)}{\sqrt{\sum\:_{i,j}{C}_{ij}^{2}}.\sqrt{\sum\:_{i,j}{W}_{ij}^{2}}}$$
    (33)

    .

    where S [0, 1] denotes a measure of average similarity. The two networks’ topologies start to converge sharply when S gets close to 1.

    In order to mimic the operation of a neural network, the HGRA module adjusts its parameters. In order to reduce the alteration between the weight matrix and the biological connectivity matrix C, the dynamic learning process is applied to a loss function.

    $$\:{\mathcal{L}} = \left\| {C - W} \right\|_{F}^{2}$$
    (34)

    .

    Through this optimization procedure, to guarantee that the HGRA unit will adjust to fit the architecture of neural networks. It is also possible to quantitatively evaluate the modularity of biological brain networks with the modularity score.

    $$\:Q=\frac{{\sum\:}_{i,j}\left[{W}_{ij}-\frac{{k}_{i}{k}_{j}}{2m}\right]\delta\:\left({g}_{i},{g}_{j}\right)}{2m}$$
    (35)

    .

    where \(\:{k}_{\text{i}}\) and \(\:{k}_{\text{j}}\) signify the grades of bulges i besides j, m characterizes the entire sum of edges, besides \(\:\delta\:\left({g}_{i},{g}_{j}\right)\) is function that indicates if nodes i other than j are part of the identical module.

    Theoretical research shows that the HGRA component appropriately models multi-scale structures after the dynamic connections observed in neural networks in living organisms. When the modularity metric Q and the graph similarity index S are used, it is possible to quantitatively verify networks. By establishing a firm foundation for understanding the HGRA module’s relevance in regard to computational concepts inspired by the brain, this structure enhances its biological interpretability.

  3. C)

    Spatial-Temporal Fusion Module (STFM).

    The proposed model integrates besides harmonizes the temporal features collected from data using the Spatial-Temporal. This module must be able to capture the intricate relationships across areas over time for the model to be able to accurately recognize situations. The spatial properties of data, which include the connections and linkages between different brain areas, are often expressed by connectivity between channels. Temporal hand, document the changing patterns of brain activity as time passes. The STFM effectively combines spatial and temporal data to generate a comprehensive illustration that utilizes both types of information. The STFM starts working with the results module, which provides a detailed geographical picture of the action, and the Dynamic Temporal, which temporal characteristics. output of the temporal features \(\:{h}_{t}\) and the output of the spatial features \(\:{h}_{s}\). A completely linked layer and a concatenation-based approach are used by the STFM to combine these properties. The formula for the joining operation is this:

    $$\:{h}_{st}=\left[{h}_{t};{h}_{s}\right]$$
    (36)

    .

    where \(\:\left[{h}_{t};{h}_{s}\right]\) characterizes the vector \(\:{h}_{t}\) and the spatial feature vector \(\:{h}_{s}\). In order to further combine these temporal and spatial data and build a fused representation suitable for categorization, a fully linked layer is used to feed this concatenated Dynamic temporal patterns and spatial linkages between diverse brain areas. Detailed here is the procedure as:

    $$\:{h}_{f}=ReLU({W}_{f}{h}_{st}+{b}_{f})$$
    (37)

    .

    Here, \(\:{W}_{f}\) and \(\:{b}_{f}\) are the trainable bias layer, correspondingly, besides The model incorporates non-linearity through the use of the Rectified function, denoted as ReLU(·). The final product, \(\:{h}_{s}\), incorporates both the geographical and temporal dimensions of the data into a feature vector. This fused representation \(\:{h}_{s}\) allows the model to account for the temporal dynamics impacted by space-related brain structures and the changes in spatial configurations across time. This integrated tactic is essential the underlying complex besides non-linear relationships. Finally, a softmax final output is applied to the \(\:{h}_{s}\) fused features vector.

    $$\:{\widehat{y}}_{i}=softmax({W}_{o}{h}_{f}+{b}_{o})$$
    (38)

    .

    where \(\:{W}_{o}\) and \(\:{b}_{o}\) are the layer, and \(\:{\widehat{y}}_{i}\) classes of climate-related diseases. Hence, the STFM is fundamental to the proposed model as it ensures the classical effectively utilizes both spatial besides temporal data. This fusion makes the model’s predictions more comprehensible, but it also provides insight into the ways in which data is impacted by temporal interactions between regions. The STFM-based paradigm is well-suited for applications that necessitate sympathetic the complex, dynamic dynamics original time-series states.

Results and discussion

The research was conducted using MATLAB’s Deep Learning Toolbox, and the proposed model was developed on MATLAB version 2014a. An NVIDIA Quadro P4000 GPU with 8 GB RAM was utilized for testing and training purposes. The assessment process employed a 10-fold cross-validation technique, which involves splitting the benchmark datasets into training and test sets. This approach allows for a comprehensive evaluation of the proposed model’s performance.

Several hyperparameters were configured to optimize the architecture during the prediction process. These hyperparameters include epochs, learning rate, dropout, and batch size. Proper tuning of these parameters is essential to achieve optimal performance from the proposed architecture.

Validation analysis of proposed model in terms of Date-Wise classification

In this section, the performance of the proposed model is compared with models such as RF-LSTM-XGBoost20, cGAN23, CNN + LSTM25, and MC-LSTM29 using various metrics through a year-wise analysis. Figures 6, 7 and 8 provide a proportional analysis of these models, with results averaged since the baseline models were implemented in this research work using the considered dataset.

Fig. 6
figure 6

Visual analysis of various models.

Fig. 7
figure 7

Graphical description of proposed model.

Fig. 8
figure 8

Timing analysis of proposed classical with existing procedures.

Figure 6 presents the performance of various models in terms of accuracy and precision across the years 2014, 2015, and 2016. The models evaluated include RF-LSTM-XGBoost, cGAN, CNN + LSTM, MC-LSTM, and the Proposed model. In 2014, the RF-LSTM-XGBoost model achieved an accuracy of 0.8398 and a precision of 0.8064, while the Proposed model demonstrated superior performance with an accuracy of 0.9494 and a precision of 0.9385. This trend continues into 2015, where the Proposed model outperforms the others, achieving an accuracy of 0.9975 and a precision of 0.9888, significantly surpassing the next best model, MC-LSTM, which achieved an accuracy of 0.9753 and a precision of 0.9663. Similarly, in 2016, the Proposed model maintains the highest performance with an accuracy of 0.9814 and a precision of 0.9839, followed by MC-LSTM with an accuracy of 0.9406 and a precision of 0.9711. This analysis underscores the consistent superiority of the Proposed model over the years in both accuracy and precision metrics.

Figure 7 displays the recall and F1 score performance of various models across the years 2014, 2015, and 2016. The models evaluated include RF-LSTM-XGBoost, cGAN, CNN + LSTM, MC-LSTM, and the Proposed model. In 2014, the RF-LSTM-XGBoost model achieved a recall of 0.7865 and an F1 score of 0.8263, while the Proposed model delivered superior results with a recall of 0.9081 and an F1 score of 0.9018. This trend continued in 2015, where the Proposed model once again outperformed the others, achieving a recall of 0.8969 and an F1 score of 0.9395, surpassing MC-LSTM, which achieved a recall of 0.8711 and an F1 score of 0.9121. Similarly, in 2016, the Proposed model maintained its lead with a recall of 0.9798 and an F1 score of 0.9868, significantly outperforming other models, including MC-LSTM, which had a recall score of 0.9655. This analysis highlights the consistent superiority of the Proposed model across all years in terms of recall and F1 score metrics.

Figure 8 presents the training and testing times for different models across the years 2014, 2015, and 2016. For the RF-LSTM-XGBoost model, the training time decreased from 63.33 s in 2014 to 48.88 s in 2016, while the testing time also improved, dropping from 14.47 s in 2014 to 9.55 s in 2016. Similarly, the cGAN model showed a reduction in both training and testing times over the years, with the training time decreasing from 67.70 s in 2014 to 46.92 s in 2016 and the testing time from 13.41 s in 2014 to 3.90 s in 2016. The CNN + LSTM model followed a similar trend, with training time decreasing from 68.10 s in 2014 to 57.39 s in 2016 and testing time dropping from 12.18 s in 2014 to 3.05 s in 2016. The MC-LSTM model also showed reductions in both training and testing times. However, the Proposed model consistently outperformed all other models across the years. The Proposed model achieved the lowest training times, starting at 59.83 s in 2014 and decreasing to 44.68 s in 2016, while its testing time reduced from 10.10 s in 2014 to 4.90 s in 2016. These results demonstrate the efficiency of the Proposed model in both training and testing compared to other models.

To ensure the robustness of our model’s performance improvements, we conducted statistical significance tests to compare the proposed model with existing methods. Specifically, we applied the paired t-test and Wilcoxon signed-rank test to the accuracy, precision, recall, and F1-score across different years (2014, 2015, and 2016). The results are summarized in Table 1. The p-values obtained from both tests were found to be below the standard threshold (p < 0.05) for all performance metrics, indicating that the performance differences between the proposed model and baseline models are statistically significant.

Table 1 Statistical significance analysis.

From the results, the statistical tests validate that the improvements achieved by our proposed model are significant across all performance metrics.

Analysis of proposed model on training and testing data ratio

Figures 9, 10 and 11 present the experimental investigation of the proposed model compared with existing techniques using different training and testing ratios, such as 60%-40%, 70%-30%, and 80%-20%. This analysis aims to demonstrate the influence of data on the detection of climatic changes.

Fig. 9
figure 9

Analysis of various models.

Fig. 10
figure 10

Comparative Analysis of predictable model with the existing actions.

Fig. 11
figure 11

Timing analysis.

Figure 9 illustrates the accuracy and precision metrics for different models at varying training and testing splits. The RF-LSTM-XGBoost model achieved an accuracy of 0.8133 and a precision of 0.7899 at a 60%-40% split, with its accuracy and precision increasing to 0.9697 and 0.9032, respectively, at an 80%-20% split. The cGAN model performed with an accuracy of 0.8278 and a precision of 0.8133 at a 60%-40% split, achieving a maximum accuracy of 0.9118 and a precision of 0.9199 at an 80%-20% split.

The CNN + LSTM model showed consistent performance across all splits, with its accuracy and precision increasing from 0.9053 to 0.8983 at a 60%-40% split to 0.9703 and 0.9747 at an 80%-20% split. Similarly, the MC-LSTM model demonstrated an improvement in performance, achieving an accuracy of 0.8986 and a precision of 0.8899 at a 60%-40% split, which rose to 0.9797 and 0.9839, respectively, at an 80%-20% split.

The Proposed model consistently outperformed all others, achieving the highest accuracy and precision values across all splits: 0.9079 and 0.8918 at a 60%-40% split, 0.9708 and 0.9876 at a 70%-30% split, and 0.9812 and 0.9910 at an 80%-20% split.

Figure 10 displays the recall and F1 scores for different models at three distinct data splits: 60%-40%, 70%-30%, and 80%-20%. The RF-LSTM-XGBoost model starts with a recall of 0.7799 and an F1 score of 0.7848 at the 60%-40% split, improving to a recall of 0.8233 and an F1 score of 0.8622 at the 80%-20% split. The cGAN model shows a recall of 0.7915 and an F1 score of 0.7786 at the 60%-40% split, reaching a maximum recall of 0.8337 and an F1 score of 0.8522 at the 80%-20% split. The CNN + LSTM model starts with a recall of 0.8019 and an F1 score of 0.7917 at the 60%-40% split, with performance improving to a recall of 0.8599 and an F1 score of 0.8764 at the 80%-20% split. The MC-LSTM model shows consistent improvement across all splits, achieving a recall of 0.8132 and an F1 score of 0.8399 at the 60%-40% split and increasing to a recall of 0.8637 and an F1 score of 0.8998 at the 80%-20% split. The Proposed model consistently outperforms the others, achieving a recall of 0.8340 and an F1 score of 0.8536 at the 60%-40% split, a recall of 0.8643 and an F1 score of 0.9060 at the 70%-30% split, and a recall of 0.8971 and an F1 score of 0.9271 at the 80%-20% split.

Training and testing times for different models across three data splits—60%-40%, 70%-30%, and 80%-20%—are shown in Fig. 11. For the RF-LSTM-XGBoost model, training takes 61.74 s and testing takes 13.90 s at the 60%-40% split. At the 80%-20% split, training takes 54.80 s, and testing takes 15.90 s. For the cGAN model, training takes 67.20 s and testing takes 12.90 s at the 60%-40% split. At the 80%-20% split, training takes 56.64 s, and testing takes 13.82 s. For the CNN + LSTM model, training takes 68.90 s and testing takes 14.37 s at the 60%-40% split. At the 80%-20% split, training takes 56.10 s, and testing takes 11.89 s. For the MC-LSTM model, training takes 69.30 s and testing takes 13.98 s at the 60%-40% split. At the 80%-20% split, training takes 58.39 s and testing takes 9.97 s. The proposed model is the most time-efficient. Training takes 58.45 s and testing takes 11.35 s at the 60%-40% split. At the 80%-20% split, training takes 51.27 s, and testing takes only 4.44 s, significantly less than the other models. This analysis demonstrates that the proposed model is more efficient than the other models in terms of both training and testing times.

Comparative analysis of model performance

Recent studies have explored Graph Neural Networks (GNNs) for climate forecasting, leveraging spatial dependencies in climate data. For example2, proposed a GNN-based model that improves regional weather forecasting but lacks temporal attention mechanisms, limiting its ability to adapt to evolving climate patterns. Similarly3, introduced an Attention-GCN model, which enhances spatial representations but faces scalability challenges when processing long-term climate trends. Unlike these approaches, our model integrates a Hierarchical Graph Representation and Analysis (HGRA) module, which effectively captures multi-scale spatial dependencies, while the Dynamic Temporal Graph Attention Mechanism (DT-GAM) dynamically models temporal variations, resulting in higher accuracy and improved generalization as presented in Table 2.

Table 2 Performance comparison with GNN-based models.

The results clearly indicate that our hierarchical graph-based Transformer model outperforms existing GNN-based approaches by achieving higher accuracy, improved recall, and better scalability while effectively handling spatial and temporal dependencies. The integration of HGRA and DT-GAM allows the model to better adapt to climate variations compared to static or single-layer graph-based models.

Discussion

The results demonstrate the superior performance of the proposed model across various metrics, including accuracy, precision, recall, and F1 score, when compared to existing models such as RF-LSTM-XGBoost, cGAN, CNN + LSTM, and MC-LSTM. The model’s ability to efficiently integrate spatiotemporal features using its hybrid WOA-TTA optimization framework and Transformer-based architecture contributed significantly to its effectiveness. Additionally, the proposed model achieved the lowest training and testing times across all data splits, highlighting its computational efficiency and suitability for real-time applications.

The consistent improvements observed in accuracy and precision with increasing training data emphasize the model’s scalability and adaptability to larger datasets. Furthermore, the ability to balance global and local search through WOA-TTA ensured robust performance and avoided local optima. These findings indicate that the proposed model is well-suited for addressing the challenges of climatic change detection and other time-sensitive predictive tasks.

Potential applications & societal impact

The proposed model has significant implications for disaster preparedness, agriculture, and energy management. By accurately forecasting temperature trends, rainfall intensity, and extreme weather events, the model aids in flood prediction, drought monitoring, and heatwave preparedness. For instance, early flood forecasting requires precise precipitation and temperature predictions to model river inflows and reservoir levels. Our model can generate 7–10-day flood risk projections, enabling authorities to issue timely warnings, optimize water management, and coordinate emergency response efforts. This is particularly critical for coastal and flood-prone regions, where advance warnings can mitigate economic losses and save lives. Additionally, improved climate predictions support precision agriculture, helping farmers optimize irrigation schedules and reduce crop losses due to unpredictable weather patterns.

Limitations

While the proposed Transformer-based hierarchical graph model has demonstrated high accuracy and computational efficiency in climate forecasting, certain limitations must be acknowledged. The model relies on high-quality, artifact-free climate data, which may not always be available in real-world scenarios. Noisy, incomplete, or biased datasets can negatively impact prediction accuracy, limiting the model’s applicability in regions with sparse or unreliable data. Future work will focus on integrating adaptive noise-handling techniques and data augmentation strategies to enhance the model’s resilience to real-world variations.

Additionally, while our model achieves computational efficiency improvements over existing methods, large-scale deployment on multi-region, high-resolution climate datasets remain computationally intensive due to the complexity of hierarchical graph processing and attention-based feature extraction. To address this, future research will explore model pruning, quantization, and distributed training on cloud-based platforms, ensuring improved scalability and real-time inference capabilities. Furthermore, the model’s regional training scope (Delhi dataset) may limit generalizability to diverse climatic regions. Future studies will extend the dataset to coastal, arid, and mountainous regions to validate the model’s adaptability across different climate patterns.

Conclusion and future work

This study demonstrated that the Transformer-based model, integrating Dynamic Temporal Graph Attention Mechanism (DT-GAM), Hierarchical Graph Representation and Analysis (HGRA), and Spatial-Temporal Fusion Module (STFM), effectively predicts daily climate variables, particularly temperature, using historical climate data from Delhi. The hybrid feature selection method (HWOA-TTA), which combines the Tiki-Taka Algorithm (TTA) and Whale Optimization Algorithm (WOA), was employed to optimize input features, ensuring improved computational efficiency. The rigorous preprocessing stages, including data cleaning, normalization, and sequence construction, contributed to the reliability and precision of the model’s predictions. Experimental evaluations demonstrated that the proposed approach consistently outperformed baseline models across multiple metrics, underscoring its efficacy in climate forecasting.

Despite these promising results, a notable limitation of this study is the use of data from a single geographical region (Delhi). Climate patterns vary significantly across different locations due to factors such as topography, humidity, and atmospheric circulation. Consequently, models trained on a single-region dataset may not fully generalize to other geographical areas with diverse climatic behaviors. Future work will address this limitation by expanding the study to multiple regions, including coastal, arid, and mountainous areas, to assess the model’s adaptability and robustness across varying climate conditions. Also, the model’s performance is dependent on high-quality, artifact-free data, which may not always be available in real-world scenarios.

Future research will focus on enhancing model resilience by integrating adaptive layers and data augmentation techniques to mitigate the impact of noise and inconsistencies in climate datasets. These improvements will allow the model to handle diverse and complex climate data more effectively, bridging the gap between theoretical advancements and practical applications. By addressing these challenges, this research aims to further strengthen the role of deep learning in climate science, contributing to enhanced disaster preparedness, more reliable climate predictions, and informed decision-making in climate-sensitive sectors. Future work will also focus on improving the scalability and deployment feasibility of the proposed model. This includes testing on larger and multi-region datasets, optimizing computational efficiency for real-time predictions, and implementing techniques such as distributed training and cloud-based deployment to ensure practical applicability.