Introduction

With the rapid acceleration of urbanization, intelligent transportation systems (ITS) play a crucial role in alleviating traffic congestion, optimizing resource allocation, and enhancing travel efficiency. As one of the core technologies of ITS, traffic flow prediction leverages historical traffic data to accurately forecast current traffic volumes, thereby providing essential support for traffic planning and management1. Currently, the primary challenge in traffic flow prediction lies in effectively extracting spatial-temporal features, in order to improve prediction accuracy and better support traffic management and planning.

Early studies predominantly employed time series analysis models, such as the Autoregressive Integrated Moving Average (ARIMA)2 and Autoregressive (VAR) models3, to forecast traffic flow. However, these models perform poorly in complex traffic flow forecasting due to their inability to capture nonlinear characteristics and leverage the spatial properties of urban networks. With advances in deep learning, Graph Neural Networks (GNN) and self-attention mechanisms have gradually emerged as mainstream methods for traffic flow prediction4,5,6. GNN-based methods typically focus on local structures to capture dependencies between neighboring nodes. Nevertheless, in urban traffic systems, local modeling often fails to fully reflect the dynamics of overall traffic flow7. In contrast, the self-attention mechanism supports global modeling through fully connected interactions that dynamically adjust inter-node relationship weights to capture dependencies across the entire network. However, due to the spatial heterogeneity of transportation networks, this fully connected approach may fail to capture regional differences effectively. For example, nearby node pairs may exhibit different traffic trends due to variations in functional areas, while nodes farther apart but sharing similar functions may exhibit similar traffic patterns. As illustrated in Fig. 1a, sensors A, B, and C are deployed along the road, with sensors A and B positioned adjacently, while sensor C is situated at a non-adjacent location. From a spatial correlation perspective, sensors A and B would be expected to exhibit similar traffic flow trends, whereas A and C might differ. However, due to urban zoning regulations, sensors A and C, both located in non-residential areas, exhibit more similar traffic fluctuations, whereas sensor B, situated in a residential zone, displays markedly different traffic patterns from A.

From a temporal perspective, traffic flow data typically exhibits pronounced multi-scale characteristics. As shown in Fig. 1b, the traffic flow sequence from sensor D involves multiple complex temporal patterns, including prominent long-term trends and significant short-term fluctuations5. Long-term trends generally reflect daily, weekly, and seasonal periodicities in travel behaviors, whereas short-term fluctuations result mainly from unexpected events or sudden congestion. However, existing traffic flow forecasting methods usually focus merely on single-scale temporal features8,9,10,11,12. Although stacked convolutional networks13,14 can capture both long-term and short-term temporal dependencies to some extent, they rely on a single temporal scale and fail to differentiate between multi-scale temporal characteristics. This limitation restricts the model’s ability to effectively learn multi-scale temporal features.

Fig. 1
Fig. 1
Full size image

The findings about traffic data.

To address the above issues, this paper proposes a Multi-Scale Spatial-Temporal Transformer (MSSTFormer) for traffic flow prediction. In the data embedding layer, we introduce a gating mechanism to filter redundant information and enhance the quality of the input data. Moreover, we design a two-stage spatial attention module, which integrates global spatial modeling with a key-node enhancement strategy. Additionally, a frequency dual-channel module approach is proposed, which decouples the low-frequency and high-frequency temporal features, independently modeling long-term trends and short-term fluctuations. This enhances the model’s capability to capture complex temporal dependencies. The key contributions of this study are summarized as follows:

  • This paper proposes the MSSTFormer model, which integrates a spatial-temporal self-attention mechanism for traffic prediction.It effectively addresses the challenges of spatial heterogeneity and multi-scale temporal dynamics.

  • We introduce a gating mechanism into the data embedding stage to suppress redundant information and improve input quality.

  • A two-stage spatial attention module is designed to capture global spatial dependencies during the spatial feature extraction phase and reinforce interactions among strongly correlated nodes via a key mask matrix, thereby capturing the complex relationships between global and local spatial features.

  • A frequency dual-channel temporal attention module is proposed, decoupling traffic flow time series into low-frequency and high-frequency components to enhance the model’s ability to learn temporal features, enabling more effective differentiation between long-term trends and short-term fluctuations.

  • MSSTFormer has been extensively evaluated on four real-world road traffic datasets, demonstrating superior performance compared to most state-of-the-art baseline models.

Related works

Traditional traffic forecasting

Traditional traffic flow prediction methods can be categorized into statistical methods, machine learning methods, and deep learning methods. These approaches analyze traffic flow variations to support traffic management and planning.

Statistical methods were dominant in early traffic flow prediction research. They treated traffic flow as a linear problem and used fixed theoretical models for forecasting. Regression functions were predefined, and parameters were determined by processing raw data. Predictions were made using these regression functions, such as time series models15 and Kalman filter models16. Although these models are computationally simple, they cannot handle the nonlinear dynamics of traffic flow and lack robustness when faced with sudden events. With the increasing complexity and nonlinearity of traffic flow data, researchers have gradually shifted their focus toward machine learning techniques. Machine learning methods overcome the limitations of statistical models as they do not require strict assumptions about data distribution. These approaches utilize data-driven techniques to uncover underlying patterns. Typical methods include Bayesian networks, Support Vector Machines (SVM)17, and k-nearest neighbors (KNN)18. These methods overcome several limitations of statistical models and perform better in complex traffic environments. However, they still require substantial manual effort for feature extraction.

With the rapid development of deep learning, many new methods have started to explore how to extract richer features from time-series data and effectively model spatial-temporal dependencies. However, despite significant progress in spatial-temporal modeling, most methods still rely on traditional time-domain modeling (such as LSTM and GRU). While these methods can capture long-term dependencies in time series, they typically learn direct relationships between time steps, lacking explicit modeling of frequency components. For instance, Huang et al. proposed a method combining deep belief networks to automatically extract high-level features from traffic data19, thus eliminating the need on manually defined features inherent in traditional methods. Lv et al. applied sparse autoencoders to traffic flow prediction20, which automatically extract key features while maintaining model sparsity. With the advancement of computing hardware, models like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) have been widely adopted for modeling temporal dependencies21. However, LSTM and GRU models cannot handle predictions with unequal input and output lengths. Ilya Sutskever et al. introduced the Seq2Seq model22. Convolutional Neural Networks (CNNs) capture dependencies in Euclidean space using convolution kernels. The Temporal Convolutional Network (TCN) enhances time feature extraction capabilities23. As research progresses, multi-model fusion has become more prominent. Wu et al. proposed a deep learning framework that combines CNN and LSTM24. Yu et al. introduced the spatial-temporal Recursive Convolutional Network (SRCN), which integrates the advantages of deep convolutional neural networks (DCNNs) and LSTM networks. This network is capable of simultaneously predicting both long-term and short-term traffic conditions25. Despite their effectiveness in capturing spatial-temporal characteristics, most methods are designed for Euclidean data and face challenges in adapting to complex topological relationships in traffic networks. At the same time, researchers have gradually recognized that time series data inherently possess multi-scale features, and learning multi-scale temporal features can significantly enhance the model’s ability to capture the dynamics of time series. For example, Chen et al. proposed a multi-scale convolutional network (MS-CNN) to capture temporal information at different scales, which was successfully applied to time series classification tasks26. Additionally, Chen et al. also proposed a time-aware multi-scale recurrent neural network (MS-RNN), which can adaptively select the most important temporal scale features27. Zhong et al. introduced the multi-scale decomposition MLP-Mixer, which explicitly decomposes the input time series into different components for analysis. Although these methods effectively capture multi-scale features, they are typically limited to time-domain modeling and rely on predefined scales, which makes them somewhat limited when handling dynamically changing multi-scale time series data.

Graph convolutional networks

Graph Convolutional Networks effectively model the spatial structure of traffic network data by sharing information between nodes via graph convolution operations, improving traffic flow prediction accuracy. Zhao et al. proposed Temporal Graph Convolutional Networks (T-GCN)9, combining GCN with GRU to capture spatial and temporal dependencies. Bai et al. introduced an Adaptive Graph Convolutional Recurrent Network (AGCRN)28. This model learns node-specific patterns and infers dependencies in traffic sequences automatically. Li et al. proposed trajectory-based Graph Neural Networks (TrGNN)29, which integrate vehicle trajectory data and environmental information into the road graph network. Although these studies recognize the importance of capturing both spatial and temporal features, most focus on spatial dependencies and overlook dynamic dependencies that evolve over time.

In recent years, with increasing focus on dynamic spatial-temporal features, several new methods have emerged. Wu et al. introduced Graph WaveNet30, which dynamically learns the adjacency matrix to capture real spatial dependencies without relying on a fixed graph structure. Peng et al. proposed Dynamic Graph Convolutional Networks (DGCN)31, which dynamically update the graph structure. This captures spatial dependencies at different time points and under varying traffic conditions. Huang et al. presented Time-Varying Graph Convolutional Networks (TVGCN)32, which capture spatial-temporal features at multiple scales and dynamically update the graph structure. With the rise of multi-view learning, more researchers have applied this approach to traffic flow prediction. For example, Bai et al. proposed Multi-Task Synchronous Graph Neural Networks (MTSGNN)33, which learn dynamic spatial dependencies across different tasks simultaneously. This enhances the prediction of regions transitions. Wang et al. introduced Multi-View Bidirectional Spatial-Temporal Graph Neural Networks (BiSTGN)34, which construct three spatial-temporal graph sequences from different time-related perspectives, enhancing the model’s ability to capture multi-dimensional temporal information. Huang et al. further developed Multi-View Dynamic Graph Convolutional Networks (MV-DGCN)35, combining multiple views and dynamic graph convolutional networks (DGCN) to capture spatial-temporal features and adapt to dynamic changes in traffic flow. This significantly improves the representation of spatial-temporal features and traffic flow prediction accuracy.

Attention mechanism

The attention mechanism has gained widespread application in recent studies. It has an inherent global receptive field and can capture dynamic dependencies in traffic data effectively. Initially proposed by Vaswani et al.36, the attention mechanism has been widely applied in natural language processing, machine vision, time series forecasting, and other tasks. For instance, Cheng et al. proposed a Graph Multi-Head Attention Network (GMAN)6, which combines graph neural networks with a multi-head attention mechanism. This model enhances traffic flow prediction accuracy by jointly modeling spatial and temporal dependencies with a multi-scale attention mechanism. Huang et al. introduced a Multi-Relation Synchronous Graph Attention Network (MS-GAT)37, which learns spatial, temporal, and channel interactions in a unified and synchronized manner for traffic coupling. Feng et al. developed an Adaptive Graph Spatial-Temporal Transformer Network (ASTTN)38, which jointly models spatial-temporal correlations and local spatial-temporal attention. This network employs an adaptive graph structure and multi-layer spatial-temporal attention stacking to capture spatial-temporal dependencies. Jiang et al. proposed PDFormer39, introducing geographical and semantic spatial masks into the attention mechanism to capture both short-range and long-range dynamic spatial dependencies. Additionally, it employs a delay-perception feature transformation module to account for propagation delays in real traffic roads. Recently, Li et al. developed DDGFormer40, a model combining a self-attention module with distance and direction awareness. It also employs a dynamic adaptive graph convolution module, enabling more effective capture of dynamic spatial-temporal dependencies.

However, to the best of our knowledge, the aforementioned models do not capture multi-scale temporal dynamics from a frequency-domain perspective. Moreover, these methods have limited ability to capture spatial heterogeneity. To this end, this paper proposes the MSSTFormer model, which separates high and low frequencies from a frequency-domain perspective to capture multi-scale temporal dynamics. Additionally, a two-stage spatial attention mechanism is introduced to better capture dynamic spatial-temporal correlations.

Problem definition

Traffic flow prediction aims to forecast the traffic volume of a transportation system at a future time point, based on historical data. In this study, we represent the road network as a graph \(G = (V, E, A)\), where \(V = \{v_1, \dots , v_N\}\) is a set of N nodes, \(E \subseteq V \times V\) represents the set of edges, and A is the adjacency matrix of the network. In traffic flow modeling, nodes generally represent sensors or monitoring points along the roads. Edges represent the topological connections between roads. Based on this graph, we describe the dynamic traffic flow across the entire road network using a traffic flow tensor. Here \(X_t \in \mathbb {R}^{N \times C}\) represents the traffic flow state of all N nodes at time t. C denotes the feature dimension of the traffic flow. Since this study focuses solely on traffic flow prediction, the feature dimension \(C = 1\).

Given the observed traffic flow tensor X of the transportation system, the goal of traffic flow prediction is to learn a mapping function f. This function maps traffic flow observations from the past T time steps to the future \(T'\) time steps:

$$\begin{aligned} {[}X(t-T+1), \dots , X_t; G{]} \xrightarrow {f} [X(t+1), \dots , X(t+T')] \end{aligned}$$
(1)

Methodology

Figure 2 illustrates the framework of MSSTFormer, comprising a data embedding layer, stacked L spatial-temporal encoder layers, and an output layer. Each module is described in detail in the following sections.

Fig. 2
Fig. 2
Full size image

The framework of MSSTFormer.

Data embedding layer

In traffic flow prediction tasks, the data embedding layer integrates multi-dimensional features to generate a comprehensive representation. To optimize feature fusion, a gating mechanism is introduced to suppress redundant information and highlight key features, thereby improving the quality of the input data. Initially, the raw input X is transformed into a high-dimensional feature \(E_f \in \mathbb {R}^{T \times N \times d}\) via a fully connected layer, where d represents the embedding dimension.

In the temporal dimension, traffic flow exhibits distinct daily and weekly periodic patterns. To capture these patterns, we model the daily and weekly time segments separately: daily periodicity d(t) divides the day into 1440 minute-long time slots, and weekly periodicity w(t) divides the week into 7 days. Additionally, we incorporate traffic flow data from the previous day, which is subsequently transformed via a linear layer to generate the embedding representation of these periodic features, denoted as \(E_t \in \mathbb {R}^{T \times N \times d}\).

Simultaneously, to capture spatial correlations among sensors in the road network, structural information is incorporated. The graph Laplacian eigenvector method is employed to generate a spatial embedding matrix \(E_s \in \mathbb {R}^{N \times d}\). This matrix maps graph node relationships into Euclidean space, thereby capturing spatial dependencies among sensors and preserving the global topology of the road network. Furthermore, a time position encoding \(E_p \in \mathbb {R}^{T \times d}\) is introduced to maintain the invariance of the temporal positional information. By stacking these embeddings, \(X_E\) is obtained:

$$\begin{aligned} X_E = E_f + E_t + E_s + E_p \end{aligned}$$
(2)

A gating mechanism is introduced with the aim of mitigating redundancy arising from feature fusion. The computation is defined as follows:

$$\begin{aligned} X' = W_c \left( W_a(X_E) \odot \text {silu}(W_b(X_E)) \right) \end{aligned}$$
(3)

where \(W_a\) , \(W_b\) , and \(W_c \in \mathbb {R}^{d \times d}\). \(X'\) is subsequently fed into the spatial-temporal encoder. For convenience, we use X to represent \(X'\) in the following sections.

Spatial-temporal encoder layer

The spatial–temporal encoder layer consists primarily of two components, employing a two-stage spatial attention module for spatial feature extraction and a frequency dual-channel attention module for temporal feature processing, all within a multi-head attention framework.

Two-stage spatial attention module

Traditional spatial self-attention mechanisms assume that each node interacts equally with all other nodes, treating the spatial graph as fully connected. However, in real-world traffic networks, spatial relationships between nodes are heterogeneous. In certain cases, traffic flow patterns between adjacent nodes may differ significantly, while traffic states at remote nodes may exhibit greater similarity. Therefore, focusing exclusively on geographically adjacent nodes might overlook potential remote spatial dependencies. To tackle this challenge, a two-stage spatial attention module is proposed. In the first stage, a global self-attention mechanism is employed to model the entire traffic network, capturing spatial-temporal dependencies across the entire network. In the second stage, a key mask matrix is constructed to reinforce the interactions between node pairs with strong correlations, while further revealing the complex dependency structure between global and local spatial domains.

In the first stage, a global self-attention mechanism is employed to model spatial dependencies within the traffic network, thereby revealing the latent dynamic spatial structures among nodes.The query matrix \(Q\), key matrix \(K\), and value matrix \(V\) are calculated to provide a global perspective. At time \(t\), the traffic flow observations are multiplied by the corresponding weight matrices to obtain the query matrix \(Q\), key matrix \(K\), and value matrix \(V\):

$$\begin{aligned} \begin{aligned} Q_t^{(S)} = X_{t::} W_Q^S \quad K_t^{(S)} = X_{t::} W_K^S \quad {K'}_t^{(S)} = X_{t::} W_{K'}^S \quad V_t^{(S)} = X_{t::} W_V^S \end{aligned} \end{aligned}$$
(4)

where, \(W_Q^S\), \(W_K^S\), \(W_{K'}^S\), and \(W_V^S \in \mathbb {R}^{d \times d'}\) denote learnable parameters, where\(d'\) represents the dimensionality of the query, key and value matrices.

The dot product between the query and key matrices is computed, followed by normalization. The spatial dependencies, denoted as \(A_t^{(S)}\) (i.e., the attention scores) among all nodes at time \(t\) are computed:

$$\begin{aligned} A_t^{(S)} = \frac{Q_t^{(S)} (K_t^{(S)})^\top }{\sqrt{d'}} \end{aligned}$$
(5)

The computed attention scores are then multiplied by the value matrix to produce the output of the global spatial self-attention module \(G_t^{(S)}\):

$$\begin{aligned} G_t^{(S)} = \text {softmax}(A_t^{(S)}) V_t^{(S)} \end{aligned}$$
(6)

In the second stage, owing to spatial heterogeneity, only a subset of node pairs in traffic networks exhibit significant interactions. Consequently, we propose a key mask matrix that selectively reinforces interactions between geographically proximate and semantically similar nodes.

For each node pair \((i, j)\), we first evaluate their spatial proximity using the road network topology. If the physical distance between them is less than a predefined threshold \(\lambda\) , they are considered spatially connected, and the corresponding mask entry is assigned \(M_{ij} = 1\). Yet spatial dependencies in traffic systems are not solely governed by physical distance. Some distant nodes may still exhibit semantic similarity due to analogous traffic dynamics. For such pairs, we measure traffic similarity using Dynamic Time Warping (DTW)41. We select the \(K\) nodes with the highest similarity scores \(\text {Sim}(i,j)\) for each node as its semantic neighbors, and assign \(M_{ij} = 1\) for these selected pairs. Otherwise, \(M_{ij} = 0\).

With the constructed key mask matrix \(M \in \mathbb {R}^{N \times N}\), the spatial attention scores \(A_t'^{(S)}\) are computed as:

$$\begin{aligned} A_t'^{(S)} = \frac{Q_t^{(S)} ({K'}_t^{(S)})^\top }{\sqrt{d'}} \odot M \end{aligned}$$
(7)

where, \(\odot\) denotes element-wise multiplication.

Finally, our approach effectively captures the intricate spatial dependencies by merging global information with local details, yielding the output of the two-stage spatial attention module, denoted as \(Z_{s} \in \mathbb {R}^{T \times N \times d_s}\), formulated as:

$$\begin{aligned} Z_s = \text {softmax}(A_t'^{(S)}) G_t^{(S)} \end{aligned}$$
(8)

Frequency dual-channel attention module

Traffic flow data exhibit complex multi-scale temporal dynamics, encompassing long-term trends (e.g., daily, weekly, and seasonal periodicities) and short-term fluctuations (e.g., unexpected events or abrupt congestion). However, most existing time series modeling methods rely on single-scale temporal features, neglecting the interactions between different frequency domains8,10,11.

To better capture such interdependencies, this paper proposes a frequency dual-channel attention module that utilizes the Fourier transform to decompose time series into low-frequency components representing stable periodic trends and high-frequency components corresponding to sudden events. The method adaptively models long-term trends and short-term fluctuations by decoupling the time series data.

For node n, a discrete Fourier transform converts the time-series components into frequency representation \(X_n^{(T)}(f)\) as follows:

$$\begin{aligned} X_n^{(T)}(f) = \sum _{t=0}^{T-1} x_n(t) \cdot e^{-i2\pi ft/T}, f \in \{0,1,...,T-1\} \end{aligned}$$
(9)

where \(x_n(t)\) denotes the traffic flow observation at node n at time t, \(X_n^{(T)}(f)\) represents Fourier-transformed signal in complex form, and T is the length of the time window.

A low-pass filter is applied to extract low-frequency components, preserving periodic elements and long-term trends while removing high-frequency noise:

$$\begin{aligned} X_n^{(T,l)}(f) = {\left\{ \begin{array}{ll} X_n^{(T)}(f),& |f| \le f_c \\ 0,& |f| > f_c \end{array}\right. } \end{aligned}$$
(10)

where \(f_c\) is the cutoff frequency of the low-pass filter, controlling the range of retained low-frequency signals. High-frequency components are obtained by subtracting the low-frequency components from the original signal:

$$\begin{aligned} X_n^{(T,h)}(f) = X_n^{(T)}(f) - X_n^{(T,l)}(f) \end{aligned}$$
(11)

Prior to applying the attention mechanism, the query (Q) and key (K) matrices are computed for both low-frequency and high-frequency features, thereby facilitating effective modeling in the subsequent self-attention mechanism. The low-frequency features for node n are calculated as follows:

$$\begin{aligned} \begin{aligned} Q_n^{(T,l)} = X_{:n:}^{(T,l)} W_Q^l \quad K_n^{(T,l)} = X_{:n:}^{(T,l)} W_K^l\\ \end{aligned} \end{aligned}$$
(12)

Similarly, for the high-frequency components:

$$\begin{aligned} \begin{aligned} Q_n^{(T,h)} = X_{:n:}^{(T,h)} W_Q^h \quad K_n^{(T,h)} = X_{:n:}^{(T,h)} W_K^h \\ \end{aligned} \end{aligned}$$
(13)

where \(W_Q^l, W_K^l, W_Q^h, W_K^h\in \mathbb {R}^{d \times d'}\) denote learnable parameter matrices, and \(X_n^{(T,l)}\) and \(X_n^{(T,h)}\) represent the low-frequency and high-frequency features, respectively.

Following frequency decomposition, a dual-channel self-attention mechanism is applied to model low-frequency and high-frequency features separately. Here, V denotes the value matrix extracted from the time-series data. In the low-frequency path, long-term trends are emphasized by computing the attention matrix to extract low-frequency features:

$$\begin{aligned} A_n^{(T,l)}= \text {Softmax}\left( \frac{Q_n^{(T,l)} K_n^{(T,l)\top }}{\sqrt{d'}}\right) V_n^{(T)} \end{aligned}$$
(14)

In the high-frequency path, short-term fluctuations in the traffic flow are captured:

$$\begin{aligned} A_n^{(T,h)} = \text {Softmax}\left( \frac{Q_n^{(T,h)} K_n^{(T,h)\top }}{\sqrt{d'}}\right) V_n^{(T)} \end{aligned}$$
(15)

Furthermore, a learnable adaptive weighting parameter is introduced into the proposed module to dynamically adjust the contributions of low-frequency and high-frequency features to traffic flow prediction, thereby improving the adaptability of the model across multiple temporal scales. The final output of the frequency dual-channel attention module, denoted as \(Z_{t} \in \mathbb {R}^{T \times N \times d_t}\), is formulated as follows:

$$\begin{aligned} {Z_t} = \lambda \cdot A_n^{(T,l)} + (1-\lambda ) \cdot A_n^{(T,h)} \end{aligned}$$
(16)

Output layer

Each spatial-temporal encoder module incorporates a residual connection and a \(1 \times 1\) convolution to project the intermediate output \(X_o\) into a residual representation \(X_{sk} \in \mathbb {R}^{T \times N \times d_{sk}}\), where \(d_{sk}\) denotes the residual feature dimension. Then, we obtain the final hidden state \(X_{hid} \in \mathbb {R}^{T \times N \times d_{sk}}\) by summing the outputs of each residual connection layer. For multi-step forecasting, \(X_{hid}\) is passed through an output module composed of two successive \(1 \times 1\) convolutional layers, which refine and map the hidden features to the prediction space:

$$\begin{aligned} \hat{X} = \text {Conv2}(\text {Conv1}(X_{hid})) \end{aligned}$$
(17)

where \(\hat{X} \in \mathbb {R}^{T' \times N \times C}\) denotes the predicted output for \(T'\) future time steps.

Experiments

Datasets

The datasets used in this study were obtained from four publicly available datasets provided by the California Department of Transportation’s Highway Performance Monitoring System (PeMS): PeMS03, PeMS04, PeMS07, and PeMS08. These datasets were collected from more than 39,000 traffic detectors deployed along major highways in California’s metropolitan areas. The data is recorded in real time at 30-second intervals and subsequently aggregated into 5-minute intervals, encompassing metrics such as traffic flow, average speed, and lane occupancy. Table 1 presents the details of each dataset.

Table 1 The detailed information of datasets.

Baseline models

The performance of our MSSTFormer model was compared with several baseline models, including time series forecasting models, graph neural networks (GNN) models, and attention-based models.

  • HA42: A statistical time series analysis model that uncovers dynamic relationships.

  • ARIMA2: A method that models traffic flow as a seasonal ARIMA process, capturing periodic fluctuations and seasonal variations by incorporating seasonal differencing and parameters.

  • VAR3: A model that integrates Granger causality with a vector autoregression framework to analyze causal relationships between nodes and capture the temporal dynamics of network traffic.

  • STGCN5: Utilizes graph structures to model traffic flow by employing graph convolutional networks to capture spatial dependencies and temporal dynamics.

  • ASTGCN10: Incorporates spatial and temporal attention mechanisms to dynamically adjust the weights assigned to nodes and time steps.

  • STSGCN43: Simultaneously models spatial-temporal dependencies by introducing a spatial-temporal resonance mechanism to elucidate the complementarity and dynamic relationships between spatial and temporal variations.

  • MTGNN44: Captures spatial-temporal dependencies in multivariate time series through the construction of graph structures using GNNs.

  • STFGNN45: Models spatial dependencies and temporal dynamics separately, sharing information across layers via inter-layer fusion and cross-layer information propagation mechanisms.

  • STGODE46: Utilizes Ordinary Differential Equations (ODEs) to capture spatial-temporal dynamics.

  • DSTAGNN47: Introduces a spatial-temporal-aware mechanism to adaptively model both spatial and temporal dependencies.

  • GDGCN48: GDGCN treats multiple historical time periods as nodes in a graph and employs a dynamic graph builder to model time-varying spatial and temporal relationships.

  • STIDGCN49: Combines dynamic graph convolutional networks with multi-perspective modeling to capture spatial-temporal heterogeneity and reveal dynamic dependencies between nodes through an interactive learning framework.

  • GMAN6: Dynamically adjusts the graph structure and adjacency relationships to capture nonlinear interactions between nodes, employing multi-layer attention mechanisms to model spatial-temporal features.

  • ASTGNN50: This model combines self-attention mechanisms to capture spatial-temporal dynamics, periodicity, and spatial heterogeneity. It effectively models the spatial-temporal relationships in traffic flow through dynamic graph convolution and embedding modules.

  • STID51: Introduces a dynamic adjacency matrix to model the time-varying nature of the graph structure, capturing temporal relationships between nodes and adaptively adjusting via a heterogeneity-handling mechanism.

  • PDFormer39: Proposes a propagation delay-aware mechanism combined with a long-term Transformer architecture to adaptively model long-term dependencies.

  • STAEformer52: Enhances the performance of the Vanilla Transformer model by integrating sequence encoding, spatial encoding, and adaptive spatial-temporal encoding via spatial-temporal embeddings.

  • DDGFormer40: Combines attention mechanisms with graph convolutional networks, utilizing directional and distance information to enhance traffic flow modeling capabilities.

Hyperparameter settings

All experiments are conducted on a machine equipped with an NVIDIA GeForce RTX 4090 GPU. A 5-minute time step is employed, with the input sequence length (T) set to 12 and the output sequence length (\(T'\)) also set to 12. We implement MSSTFormer using Ubuntu 18.04, PyTorch 1.10.1, and Python 3.9.7. The hidden dimension \(d\) is searched over \(\{16, 32, 64, 128\}\), and we selected \(d = 64\). The depth of the encoder layers \(L\) is explored within \(\{2, 4, 6, 8\}\), and we selected \(L = 6\) based on performance considerations. The feature dimensions for both the time and space modules, \(d_t\) and \(d_s\), are set to 32. For the PeMS03, PeMS04, and PeMS08 datasets, a batch size of 16 is used, while for the PeMS07 dataset, the batch size is set to 6. The model is trained for 200 epochs using the Adam optimizer with a learning rate of 0.001. Evaluation metrics include mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE):

$$\begin{aligned} MAE= & \frac{1}{n} \sum _{i=1}^{n} \left| \hat{y}_i - y_i \right| \end{aligned}$$
(18)
$$\begin{aligned} MAPE= & \frac{1}{n} \sum _{i=1}^{n} \left| \frac{\hat{y}_i - y_i}{y_i} \right| \times 100\% \end{aligned}$$
(19)
$$\begin{aligned} RMSE= & \sqrt{\frac{1}{n} \sum _{i=1}^{n} (\hat{y}_i - y_i)^2} \end{aligned}$$
(20)

Here, \(\hat{y}_i\) and \(y_i\) represent the predicted and actual traffic flow values, respectively.

Table 2 Comparison Table of Model Performance.

Performance comparison

Table 2 reports the average multi-step prediction performance across four datasets, comparing the MSSTFormer model with baseline models. The results demonstrate that MSSTFormer consistently outperforms all baseline models on most evaluation metrics. Specifically, MSSTFormer achieves a 2.12% improvement in the MAE value on the PeMS03 dataset, with the frequency dual-channel attention module making the most significant contribution to this enhancement compared to other methods. The best-performing results are highlighted in bold and the second-best results are underlined. Based on comparative analysis, several important findings are derived: 1) Compared to traditional time series models such as HA, ARIMA, and VAR, our proposed model incorporates spatial dependencies that have often been overlooked. 2) Compared to graph neural network (GNN)-based models such as STGCN, STSGCN, and GDGCN, attention mechanism-based approaches, including PDFormer and STAEformer, exhibit superior predictive performance. However, spatial modeling frameworks utilizing self-attention, such as STAEformer, do not explicitly differentiate spatial dependencies between geographically adjacent and distant nodes. In contrast, MSSTFormer designs a two-stage spatial attention module that captures global spatial dependencies while enhancing interactions among critical nodes, effectively revealing the complex relationships between global and local spatial features. 3) Among self-attention-based temporal modules, DDGFormer is one of the most competitive baseline models. It captures temporal dependencies by combining a self-attention mechanism with positional encoding. However, due to its single-scale temporal modeling approach, DDGFormer has certain limitations in capturing multi-scale temporal features in traffic flow data. In contrast, MSSTFormer proposes a frequency dual-channel attention module that decomposes temporal features into low-frequency and high-frequency components using Fourier transform, separately modeling long-term trends and short-term fluctuations. This enables the model to effectively handle variations across different time scales, thereby enhancing its predictive capability for complex traffic flow data.

  • MSSTFormer w/o gate embedding: multidimensional features are fused using weighted summation without the use of a gating mechanism.

  • MSSTFormer w/o two-stage spatial attention: the two-stage spatial attention is removed, and global attention is applied to extract spatial features directly.

  • MSSTFormer w/o frequency dual-channel attention: low-frequency and high-frequency features are not preserved, as the attention module from the Vanilla Transformer36 is employed for temporal modeling.

Ablation study

The MSSTFormer model consists of three primary modules: the embedding layer, the two-stage spatial attention module, and the frequency dual-channel attention module. To validate the effectiveness of each module, we conducted an ablation study on the PeMS03, PeMS04, PeMS07, and PeMS08 datasets. The experimental results are shown in Table 3. The findings are as follows: (1) Removing the gate embedding (i.e., w/o gate embedding) leads to a significant decrease in MAE, RMSE, and MAPE across all datasets, indicating that the gate embedding effectively filters unnecessary information and plays a critical role in reducing redundancy. (2) When global spatial attention is used to replace the two-stage spatial attention (i.e., w/o two-stage spatial attention), the model’s performance declines significantly, particularly on the PeMS04 and PeMS07 datasets. This demonstrates that, compared to single-view spatial modeling, the two-stage spatial attention method more effectively captures complex spatial dependencies, thereby revealing the intricate spatial dynamics between global and local contexts. (3) Removing the frequency dual-channel attention (i.e., w/o frequency dual-channel attention) results in a noticeable inability to capture short-term fluctuations and long-term trends, especially on the PeMS07 dataset. This outcome shows that learning temporal correlations only through gated recurrent units is insufficient to capture multi-scale temporal features, and the frequency dual-channel attention module plays a crucial role in handling multi-scale temporal features in traffic flow data.

We evaluate the impact of each module in the MSSTFormer model on single-step (one step every 5 min) prediction evaluation metrics on the PEMS08 dataset in Fig. 3. We can observe that: (1) Incorporating the dynamics of traffic flow with the gate embedding significantly reduces prediction errors in both short-term and long-term forecasts. (2) The two-stage spatial attention mechanism improves the model’s ability to capture spatial dependencies, leading to better performance compared to w/o two-stage spatial attention in both short-term and long-term predictions. (3) The inclusion of frequency dual-channel attention allows the model to effectively separate short-term fluctuations and long-term trends, outperforming the variant w/o frequency dual-channel attention.

Table 3 Comparison of ablation experiments.
Fig. 3
Fig. 3
Full size image

Comparison of single-step prediction performance on the PeMS08 dataset.

Parameter sensitivity analysis

To further explore the impact of various parameters, we performed sensitivity analysis for the spatial mask used in MSSTFormer. Specifically, we explored various values for each hyperparameter within predefined search spaces:6,7,8,9 for the number of nearest neighbors \(K\) selected based on the DTW similarity and4,5,6,7 for the distance threshold \(\lambda\). This comprehensive analysis allowed us to evaluate the impact of different configurations on the performance of our MSSTFormer model. The results are shown in Fig. 4

From the figure, we have the following observations: (1) Increasing the number of nearest neighbors \(K\) based on DTW similarity improves model performance up to \(K = 7\), beyond which further increases offer minimal benefit. A value of \(K = 7\) effectively captures both short-term fluctuations and long-term trends, while lower values (e.g., \(K = 6\)) introduce noise, and higher values (e.g., \(K = 9\)) reduce sensitivity to sudden changes. (2) The distance threshold \(\lambda = 5\) best preserves spatial dependencies, allowing the model to capture relevant relationships without introducing unnecessary complexity. Smaller values (e.g., \(\lambda = 4\)) overemphasize local dependencies, while larger values (e.g., \(\lambda = 7\)) lead to overfitting and reduced efficiency by including irrelevant connections.

Fig. 4
Fig. 4
Full size image

Experimental results of the hyperparameter study on the PeMS08 dataset.

Case study

To verify the effectiveness of MSSTFormer in decomposing traffic flow data into distinct temporal patterns, we selected 12 consecutive time steps at 5-minute intervals from node 42 of the PeMS04 dataset, spanning 08:30 to 09:30 on January 8, 2018. Figure 5 illustrates the attention weight distributions across high-frequency and low-frequency channels for these selected time steps. The color gradient from blue to red in the heatmap represents the increase in attention weights, where red corresponds to higher attention and blue to lower attention. As shown in Fig. 5a, the attention weights in the low-frequency channel exhibit a smooth and progressively increasing pattern over time, which reflects steady increases in traffic flow during off-peak hours. This pattern indicates that MSSTFormer prioritizes long-term traffic flow trends rather than short-term fluctuations. In contrast, Fig. 5breveals significant fluctuations in the attention weights of the high-frequency channel, with distinct peaks observed around time steps 6–8. These peaks indicate sudden changes in traffic flow near sensor 42, likely due to congestion or incidents. This highlights MSSTFormer’s ability to capture transient events, utilizing high-frequency attention to focus on short-term disruptions and improving the model’s responsiveness to dynamic traffic conditions.

By separately processing these two distinct temporal features, MSSTFormer dynamically adjusts their contributions to traffic flow predictions. Specifically, during peak traffic periods, the model allocates greater attention to high-frequency features, thereby enhancing its sensitivity to short-term fluctuations. Conversely, during stable periods, it prioritizes low-frequency features ensures accurate predictions of long-term traffic patterns. Consequently, MSSTFormer effectively distinguishes between short-term fluctuations and long-term trends, significantly improving prediction accuracy.

Fig. 5
Fig. 5
Full size image

The visual analysis of the frequency dual-channel attention module.

Conclusions

In this study, we propose a novel model named MSSTFormer. A two-stage spatial attention module is integrated, which effectively addresses the complex relationships between global and local dependencies while enhancing the interactions among strongly correlated key nodes, thereby improving the model’s ability to capture intricate spatial features. Additionally, a new frequency dual-channel attention module is incorporated, which decouples high-frequency and low-frequency components to separately model long-term trends and short-term fluctuations, further enhancing the model’s ability to capture complex temporal dynamics. Moreover, a gating mechanism is embedded in the data embedding layer to reduce information redundancy. Extensive experiments conducted on four real-world datasets validate the superior performance of MSSTFormer. The results demonstrate its advantage over state-of-the-art models in traffic flow prediction tasks and highlight the strong potential of MSSTFormer for real-time traffic flow prediction, particularly in environments with dynamic and unpredictable conditions. Nonetheless, the model is not yet lightweight, and deploying it in large-scale urban transportation systems may introduce considerable computational and memory overhead. In future work, we plan to enhance the model’s computational efficiency and scalability to facilitate large-scale, real-time deployment. We will also focus on optimizing the model’s architecture to achieve a better balance between prediction accuracy and resource consumption, ensuring its practical applicability in intelligent transportation systems.