Introduction

The rapid development of IoV technology has led to the widespread use of connected and autonomous vehicles globally. Intelligent connected vehicles integrate electronic control units (ECUs) and V2X technology to provide advanced features such as autonomous driving, collision warning, and automatic parking assistance, which significantly enhance the driving experience. However, with the continuous advancement in vehicle connectivity and system complexity, in-vehicle networks (IVNs) have become potential targets for attackers, significantly amplifying security threat risks for intelligent connected vehicles (ICVs)1.

The in-vehicle network’s internal lack of effective authentication mechanisms, unsecured broadcast transmission protocols, and insufficient encryption measures make the Controller Area Network (CAN) bus vulnerable to message injection attacks2. Furthermore, the expansive architecture of external vehicular networks introduces multiple potential attack entry points at each node, exposing these networks to risks including DDoS (Distributed Denial of Service) and spoofing attacks3. A notable example is the 2015 Jeep Cherokee remote hacking incident, where attackers exploited CAN bus vulnerabilities to manipulate critical vehicle functions, resulting in catastrophic safety failures4.

To address the growing vulnerabilities and potential threats in in-vehicle networks, some researchers have proposed security enhancement measures based on authentication and encryption methods to improve the security of the CAN protocol. However, these approaches generally rely on modifications to the existing protocol, which is impractical for already deployed vehicles5. Currently, intrusion detection systems (IDS) based on traditional machine learning (ML) and deep learning (DL) have been demonstrated as effective protective mechanisms6. Traditional IDSs often focus on extracting either spatial or temporal features alone, while generally ignoring their joint extraction. However, the essential characteristics of complex attack behaviors typically encompass both spatial correlation and temporal dynamics simultaneously. A single feature dimension is insufficient to fully characterize attack patterns, and thus traditional methods exhibit suboptimal performance in detecting complex attacks.

In recent years, graph neural networks (GNNs) have demonstrated significant advantages in modeling spatial correlations within vehicular data, while Transformers excel at capturing long-range global temporal dependencies7. Therefore, increasing research efforts have focused on constructing intrusion detection systems (IDS) based on GNNs or Transformers to enhance the modeling and detection capabilities for complex attack behaviors. However, existing GNN-based IDS primarily rely on node-level analysis, failing to leverage global structural information within graph data to improve detection performance. Meanwhile, Transformer-based IDS, although effective in modeling temporal dependencies, often struggles to fully integrate spatial structural features inherent in graph data. More critically, most existing IDSs focus solely on either intra-vehicle or external network attacks, leaving hybrid IDSs capable of simultaneously detecting both attack types largely underexplored8.

Given these challenges, there is an urgent need for a unified model framework that can simultaneously model spatial topological structures and temporal evolution patterns, while possessing the capability to detect hybrid attacks in both in-vehicle and external vehicle networks. To this end, this paper proposes GCN-2-Former, a hybrid graph convolutional network (GCN) and Transformer-based intrusion detection model for vehicular networks. This model converts vehicle network data into a graph structure using a graph construction module, extracts spatial features via a graph convolutional encoder, and introduces a multi-layer Transformer architecture to capture global temporal dependencies. Finally, a graph-level feature fusion strategy is used to integrate spatial and temporal features, enabling global detection of vehicle network attacks. Specifically, the model models the feature correlation topology of traffic samples within the window through GCN, which can accurately capture abnormal traffic aggregation or dispersion patterns in short-term injection attacks inside the vehicle. Through the multi-layer Transformer, it deeply models the temporal dependencies of network traffic within the window, enabling effective identification of abnormal evolutionary trends in traffic intensity and frequency in long-term external attacks. By virtue of the collaborative modeling of spatial topology and temporal dynamics, the proposed model can simultaneously perceive short-term injection attacks in in-vehicle networks and long-term threat evolution in external networks, thereby achieving unified intrusion detection capability for multi-scenario and multi-type attacks both inside and outside the vehicle. Experiments show that the proposed model achieves 100% detection accuracy on the Car Hacking dataset9 and 99.98% on the CICIDS2017 dataset10significantly outperforming existing methods.

The main contributions of this paper are summarized as follows:

This paper proposes GCN-2-Former, an innovative intrusion detection model for vehicular networks that integrates spatial and temporal features via a graph-temporal encoder combining GCN and Transformer architectures. Leveraging feature fusion to enhance representation, the model achieves superior performance in both intra-vehicle and external intrusion detection tasks.

This paper proposes a dynamic graph construction method that converts network flows into temporal graph sequences using a sliding window mechanism and similarity-based metrics. This method assigns category labels to graph instances based on node attributes, reconstructs the graph structure in real time through sliding windows, adapts to the spatial-temporal dynamics of IoV traffic, and effectively captures attack patterns’ spatial-temporal correlations.

This paper systematically evaluates the proposed model on both the Car Hacking dataset and the CICIDS2017 dataset. The experimental results show that the proposed model achieves 100% accuracy and F1 score on the Car Hacking dataset and 99.98% accuracy and F1 score on the CICIDS2017 dataset, significantly outperforming existing methods.

The remainder of this paper is organized as follows: Sect. 2 reviews related work; Sect. 3 presents the design and implementation of the GCN-2-Former model in detail; Sect. 4 evaluates and analyzes the model’s performance; Sect. 5 concludes the paper and discusses future research directions.

Related works

Intrusion detection solution based on classic ML

In recent years, due to the excellent performance of traditional ML in classification tasks, many researchers have focused on developing IDS based on traditional ML. Aswal et al.11 evaluated the effectiveness of six classical ML algorithms in detecting Bot attacks in the IoV. They used Bot attack samples from the CICIDS2017 dataset as representatives of botnet attacks in vehicular networks, but did not consider other types of attacks. Gu and Lu12 proposed an intrusion detection framework based on SVM with plain Bayesian feature embedding (NB-SVM), which achieves an accuracy of 98.92% on the CICIDS2017 dataset, but it only considered the binary intrusion detection case. Song et al.13 constructed a novel intrusion detection system called SIDiLDNG. This system combines an improved Levenshtein distance algorithm with N-gram analysis to extract global features and local properties of message sequences efficiently. As a result, it significantly improves the accuracy and efficiency of anomaly detection in CAN networks. Ye et al.14 proposed GDT-IDS based on graph theory and decision trees. This model significantly improved detection accuracy by introducing new graph-based features.

To further improve the performance of IDS detection, some researchers have focused on constructing an IDS based on ensemble learning and federated learning. El-Gayer et al.15 proposed a Dynamic Forest Structure Ensemble model (DFSENet). The model combines SMOTE and PCA to deal with data imbalance and high dimensionality. It uses multi-level ensemble learning to improve detection accuracy. Yang et al.16 proposed a multi-layer hybrid intrusion detection system (MTH-IDS), which integrates signature detection and anomaly detection. It adopts DT, RF, and XGBoost to detect known attacks, identifies zero-day attacks through clustering-labeled k-means and bias classifiers, and combines Bayesian optimization for parameter tuning. Ullah et al.17 proposed a hybrid ML model using ensemble learning techniques to detect various attacks in IoV external networks. Its detection accuracy on the CICIDS2017 dataset reached 99.75%. Mokbal et al.18 proposed an XGBoost-based ensemble framework. It selects optimal features through tree-based feature importance, constructs a model containing 300 decision trees, and achieves multi-classification and binary classification accuracies of 99.90% and 99.86% respectively on the CICIDS2017 dataset, which optimizes the problems of false positives and false negatives. Mirzaee et al.19 proposed a layered federated learning (CHFL)-based IDS, achieving a detection accuracy of 99.10% on the CICIDS2017 dataset. Driss et al.20 proposed an attack detection framework based on federated learning (FL). The framework trains local models on edge devices and uses a federated learning server to aggregate model parameters. It effectively solves the privacy and resource limitation problems of traditional centralized machine learning.

Intrusion detection solution based on DL

With the outstanding performance and rapid development of DL in fields like computer vision and natural language processing, researchers have started applying DL to IDS. Ding et al.21 proposed the DeepSecDrive framework for detecting attacks in IVNs. In tests on the Car Hacking dataset, it achieved an accuracy and F1-score of 98.30%. Wang et al.22 evaluated ten deep learning-based intrusion detection methods using the Car Hacking dataset. The results showed that existing algorithms had good detection performance for DoS and RPM attacks, but their performance was generally weak for fuzzing attacks. In addition, IDSs based on Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Recurrent Neural Networks (RNN) have been proven effective. Yang and Shami23 proposed an IDS that combines CNN with transfer learning and ensemble learning techniques. This system showed strong performance in attack detection. Desta et al.24 proposed a deep convolutional neural network (DCNN) method called Reduced InceptionResNe for detecting intra-vehicle attacks. It was trained using CAN ID sequences identified on the bus and achieved high detection performance on the Car Hacking dataset. Yu et al.25 proposed an intrusion detection model based on LSTM neural networks. The model predicts incoming message IDs by using the periodicity of IVN message ID sequences. In response to the optimization requirements for feature correlation modeling and spatiotemporal feature fusion, researchers have further proposed improved models. Lei et al.26 proposed the HNN model, which reconstructs feature correlations through contribution-based feature selection and Triangular Area Maps (TAM). It extracts spatiotemporal features in parallel using CNN and LSTM, and optimizes performance through attention mechanisms or feature concatenation. On datasets such as UNSW-NB15 and CICIDS2017, the accuracy is improved by 3.78–1.13% compared with the baseline. Cao et al.27 proposed the IoVST method, which characterizes the spatiotemporal relationships of messages through Message Attribute Graph Modeling (MAGM). It combines a Transformer to aggregate local features and extract global features. On the extended VeReMi dataset, the F1 score and accuracy are improved by 1.68% and 1.92% respectively, compared with the optimal baseline, and it is adaptable to multi-network scenarios.

To overcome the gradient vanishing problem of traditional RNNs and LSTMs, current researchers have introduced the attention mechanism into IDSs.Long et al.28 proposed a network intrusion detection algorithm based on the Transformer model. This algorithm effectively captures temporal dependencies in network intrusion data, improving detection accuracy. However, the model’s performance in detecting attacks such as spoofing and data tampering is not ideal, leading to an overall imbalance in detection results. Liu and Wu29 proposed an intrusion detection model based on an enhanced Transformer, which improves classification performance by refining position encoding. However, the model still has room for improvement in detection accuracy on the NSL-KDD dataset. Wang et al.30 proposed a lightweight intrusion detection model based on an improved Vision Transformer (IVN-ViT). The proposed model models global dependencies using attention mechanisms, avoids recursive structures, and enables parallel computation, making the model more efficient and faster in feature extraction. However, its generalization ability in complex attack scenarios has not been verified.

In recent years, the application of pre-trained models and large language models (LLMs) in the field of intrusion detection has continued to deepen. The UNILM model proposed by Li et al.31 constructs a multi-task unified pre-training framework by sharing the Transformer architecture and self-attention masks. Its ability to capture cross-domain dependencies provides new ideas for network traffic modeling. BERT-based large language models have shown significant advantages in network security threat identification tasks. Fu et al.32 proposed the IoV-BERT-IDS framework for the Internet of Vehicles (IoV). By converting in-vehicle traffic through a semantic extractor and combining the Masked Byte Word Model (MBWM) with Next Byte Sequence Prediction (NBSP) to learn bidirectional features, they verified the practicality of hybrid intrusion detection for vehicles on the CICIDS and Car-Hacking datasets. SecurityBERT proposed by Ferrag et al.33 focuses on lightweight and privacy protection. It processes traffic using PPFLE encoding and byte-level BPE tokenization, achieving an accuracy of 98.2% on the Edge-IIoTset dataset. With a model size of only 16.7 MB and inference time less than 0.15 s, it provides a feasible solution for deployment on resource-constrained devices.

Compared to deep learning methods, graph-based approaches detect intrusions in in-vehicle networks by analyzing data distribution patterns. Existing research shows that intrusion detection based on graph theory is still in the early exploration stage, with limited literature, and most studies focus only on node-level analysis34. He et al.35 proposed an IDS based on Arbitration and Data graphs. The system uses a two-layer GCN architecture and effectively improves detection accuracy in vehicular CAN networks. However, the model was tested only on a single dataset, so its generalization ability has not been fully verified. Xiao et al.36 proposed a CAN-GAT model aimed at detecting anomalies in the CAN bus and demonstrated a detection framework based on graph convolution, graph attention, and the CAN-GAT network. Song et al.37 proposed a graph-based intrusion detection system (DGIDS), which demonstrated outstanding performance on datasets like Car Hacking. However, the system has certain limitations. It relies only on identifiers (IDs) and the sequence relationship between messages, making it difficult to detect attacks that change message content but not message order, such as DDoS and protocol spoofing attacks.

The above literature shows that current IDSs mostly focus on either intra-vehicle network attacks14,22,35,37 or external network attacks12,13,19,20,29with few studies considering both intra-vehicle and external network attacks. Additionally, many studies have neglected the integration of temporal and spatial features when constructing IDS systems. Although some literature has considered spatial-temporal features38their detection accuracy still needs improvement. Most existing GNN-based IDSs focus on node-level intrusion detection35,37often ignoring the vital information embedded in the overall graph structure. To address these issues, this paper proposes a model that integrates GCN and Transformer, aiming to achieve graph-level intrusion detection for known attack types in both intra-vehicle and external networks by fully leveraging spatial-temporal features.

The proposed GCN-2-Former model

Model definition

The proposed model integrates the GCN and Transformer architectures, combining dynamic graph construction with a spatial-temporal feature fusion strategy to achieve efficient and accurate detection of complex network attacks. The input to the model is heterogeneous network flows from the IoV, and its mathematical representation is\(D=\{ ({X_i},{y_i})\} _{{i=1}}^{Z}\). Among them, Z represents the total number of samples in the dataset, Xi\(\in {{\mathbb{R}}^{{\text{w}} \times d}}\) represents the spatial-temporal feature matrix of the i-th sample (w is the time step length, d is the feature dimension), and yi\(\in {\{ 0,1\} ^C}\) represents the corresponding C-class label vector.

To capture the spatial-temporal evolution characteristics of network flows, the model uses a sliding window mechanism to divide the original data stream into a sequence of dynamic graphs. Given the window size w and step length s, M dynamic graph structures are generated: \(G=\{ {G_1},{G_2}, \cdots ,{G_{\text{t}}}\}\), where\({G_{\text{t}}}=({V_{\text{t}}},{E_t},{W_t},{y_t})\). Common symbols used in this paper and their corresponding interpretations are shown in Table 1.

Table 1 Notations and the corresponding interpretation.

Overall framework

As shown in Fig. 1, the proposed model consists of four core modules: the data preprocessing module, the graph construction module, the Graph-Temporal Encoder module, and the integrated classifier module. The data preprocessing module handles missing values, detects outliers, performs feature normalization, and addresses class imbalance to ensure high-quality and stable input data. The graph construction module builds dynamic graph structures based on a sliding window mechanism, where the graph attributes are determined by the attributes of its nodes. The Graph-Temporal Encoder module includes a graph convolution encoder and a temporal encoder. The former captures spatial relationships between nodes, while the latter learns long-term dependencies in the time series. Finally, the integrated classifier module performs the final classification through feature fusion and classification layers.

Fig. 1
figure 1

Architecture of the GCN-2-Former.

Data preprocessing module

The data preprocessing module proposed in this paper consists of three steps: cleaning, balancing, and standardization. The raw data is converted into structured features directly usable by the model through systematic processing.

Data cleaning: A high-quality dataset is built by removing redundancy, filling in missing values, and eliminating non-informative features. Specifically, redundancy elimination is performed on the CICIDS2017 and Car Hacking datasets to remove duplicate samples. In the processing of missing values and outliers, the abnormal labels of CAN messages in the Car Hacking dataset are mapped to 256, and the infinite values and missing values in CICIDS2017 are replaced with 0. For feature selection, the 11-dimensional core features of Car Hacking and the 78-dimensional features of CICIDS2017 are retained.

Data balancing: This paper adopts a general data balancing strategy to address the differences in class distribution among different datasets. For datasets with class imbalance (such as CICIDS2017), the proportion of majority classes is adjusted through controlled downsampling38and the Synthetic Minority Oversampling Technique39 (SMOTE) is combined to expand minority class attack samples. For datasets with relatively balanced class distribution (such as Car Hacking), no class balancing processing is performed.

Data standardization: Integer encoding is applied to classification labels to fit the multi-class task. Numerical features are standardized using Z-score normalization to eliminate scale differences and improve the stability of gradient optimization.

After the above-mentioned processing, the Car Hacking and CICIDS2017 datasets are converted into 11-dimensional and 78-dimensional numerical feature vectors, respectively, forming structured inputs that can be directly processed by the model.

Graph construction module

Node definition

In this paper, the traffic dataset is divided into time series segments of fixed length through a sliding window mechanism. Each data sample (that is, a single record in the network traffic dataset in this study, such as a network flow record of CICIDS2017, or a CAN bus message of CarHacking) is mapped to a node in the graph structure. The size of the window (w) is equal to the number of nodes (N) in the graph; that is, each sliding window contains N consecutive data samples, corresponding to N nodes in the graph. The detailed parameter sensitivity analysis of the window size will be presented in the experimental results section to verify its impact on the model performance. The node set is formalized as: \(V=\{ {v_1},{v_2}, \cdots ,{v_N}\}\), where \({v_i} \in {{\mathbb{R}}^d}\) represents the traffic feature vector corresponding to the i-th data point in the window, and d is the dimension of the original feature space. Herein, vi represents the preprocessed feature vector of the i-th data sample in the window.

Edge construction

The core goal of the edge construction module is to establish the topological connection relationship of dynamic graphs by quantifying the feature correlations between nodes, providing a structured foundation for subsequent feature extraction. The specific design is as follows:

To quantify the strength of feature correlations between nodes, Euclidean distance is used to measure the similarity between node pairs vi and vj. The similarity calculation formula is:

$$similarity({v_i},{v_j})=\parallel {v_i} - {v_j}{\parallel ^2}$$
(1)

To avoid excessive redundancy in the graph structure and focus on key correlations, the selection of edges must satisfy dual constraints. On one hand, only node pairs where the similarity meets similarity(vi,vj)<\(\:\tau\:\) are retained, thereby filtering out weakly correlated nodes. On the other hand, each node is only connected to the top k nodes with the highest similarity within the window, so as to control the upper limit of the number of connections per node. Based on the above constraints, the construction rule of the edge set E can be expressed as:

$$E=\left\{ {(i,j):similarity({v_i},{v_j})<\tau } \right\}$$
(2)

To further distinguish the correlation strength of edges, the weight of each edge is given by the reciprocal of the similarity. The specific calculation formula is:

$${\omega _{ij}}=1 - similarity({v_i},{v_j})$$
(3)

This rule ensures that the higher the similarity between nodes, the greater the weight of the corresponding edge, which can effectively enhance the differences of key connections in the graph structure.

This ensures that the stronger the similarity, the higher the weight of the edges between the nodes.

Finally, the edge set constructed through the above steps is uniformly represented as \(\:E=\{{e}_{1},{e}_{1},\cdots\:,{e}_{1}\}\), where S is the total number of edges. Each edge ei represents the connection relationship between two nodes vi and vj, completely depicting the topological structure of the dynamic graph and providing a structured input for subsequent graph convolution operations.

Graph label assignment

In this paper, a label is assigned to each graph based on the traffic types within the window. If the window contains only normal traffic, the graph is labeled as normal. On the other hand, if any attack traffic is present, the label is determined by the most frequent attack type in the window.

Display the label set in the window as:\(L=\{ {l_1},{l_2}, \cdots ,{l_N}\}\).

Where li is the label of node vi, and the graph label y is defined as:

$$y=\left\{ {\begin{array}{*{20}{c}} {Normal}&{if\forall {l_i}=Normal} \\ {argmax(bincount(L))}&{if\exists {l_i} \ne Normal} \end{array}} \right.$$
(4)

Where argmax selects the most common attack labels, and bincount calculates the number of occurrences for each label.

Graph-temporal encoder module

As the core architecture of the proposed model, the graph temporal encoder module aims to effectively capture complex spatiotemporal correlation patterns in in-vehicle network traffic data by integrating the capabilities of spatial feature extraction and temporal dependency modeling. This module is composed of two key sub-modules: the graph convolution encoder and the temporal encoder. The former focuses on mining the spatial topological features of network data, while the latter concentrates on capturing long-distance temporal dependencies in time series. Through a close logical connection, the two form a complete spatiotemporal feature learning framework.

Fig. 2
figure 2

Convolutional encoder module.

GCN, as a well-established GNN, was first proposed to handle graph-structured data and solve semi-supervised learning problems. A multi-layer GCN can be formulated as:

$${H^{(l+1)}}=\sigma ({\tilde {D}^{ - \frac{1}{2}}}\tilde {A}{\tilde {D}^{ - \frac{1}{2}}}{H^{(t)}}{W^{(t)}})$$
(5)

In which Ã=A + IN denotes the adjacency matrix with self-connections, IN denotes the identity matrix, W(l) denotes the trainable weight matrices of specific layers, σ(·) denotes the nonlinear activation function, and H(l)\(\in {{\mathbb{R}}^{N \times D}}\) denotes the activation matrix in layer l.

As shown in Fig. 2, the proposed graph convolution encoder consists of two consecutive GCN layers. The first GCN layer aggregates the feature information of each node and its neighboring nodes, extracting more representative features that reflect the local spatial structure. Based on the output of the first layer, the second GCN layer further extends the aggregation range, mining deeper relationships to generate higher-level feature representations. In both GCN layers, we use ReLU as the activation function to enhance the model’s nonlinear expression capability and introduce a Dropout mechanism to alleviate overfitting. Through the collaborative effect of these two GCN layers, the model effectively integrates neighborhood information and transforms the original node input into a high-dimensional representation containing rich local spatial features, providing a solid foundation for subsequent analysis. After passing through the two GCN encoder layers and batch dimension processing, the node embedding representation HG\(\in {{\mathbb{R}}^{B \times N \times {d_G}}}\) is obtained.

Fig. 3
figure 3

Temporal encoder module.

As shown in Fig. 3, the proposed temporal encoder module consists of a stack of multiple Transformer encoder layers. Each layer mainly includes two sub-layers: Multi-Head Attention (MHA) and Feed Forward Network (FFN). Residual connections and Layer Normalization techniques are used to enhance stability and accelerate training convergence. The temporal encoder receives the output tensor HG from the graph convolution encoder. To capture the sequential relationships of nodes in the time series, positional encoding is incorporated into the input. The positional encoding is generated using sine and cosine functions, as follows:

$$P{E_{t,i}}=\left\{ {\begin{array}{*{20}{c}} {\sin \left( {\frac{t}{{{{10000}^{\frac{i}{{{d_G}}}}}}}} \right)}&{i=2k} \\ {\cos \left( {\frac{t}{{{{10000}^{\frac{{i - 1}}{{{d_G}}}}}}}} \right)}&{i=2k+1} \end{array}} \right.$$
(6)

Here, t denotes the time step position in the input sequence, and i denotes the index value in the embedding. Subsequently, the position encoding matrix PE is expanded to the same batch size and sequence length via a broadcasting mechanism, the encodings of the first N positions are extracted, and the result is added to HG. The tensor HPE \(\in {{\mathbb{R}}^{B \times N \times {d_G}}}\) contains the embedding data with spatial-temporal information and serves as the input to MHA in the temporal encoder.

MHA receives the tensor HPE and performs feature association modeling on it. MHA first maps HPE through linear transformation, generating the query matrix Q, key matrix K, and value matrix V, respectively. The calculation formulas are as follows:

$$\begin{gathered} Q={H_{{\text{PE}}}}\cdot {W^Q} \hfill \\ K={H_{{\text{PE}}}}\cdot {W^K} \hfill \\ V={H_{{\text{PE}}}}\cdot {W^V} \hfill \\ \end{gathered}$$
(7)

Where WQ, WK, and WV\(\in {{\mathbb{R}}^{{d_G} \times {d_G}}}\) are the learnable weight matrices.

Q, K, and V are then partitioned into multiple heads, and the attention scores for each head are calculated separately:

$$hea{d_i}=Attention({Q_i}W_{i}^{Q},{K_i}W_{i}^{K},{V_i}W_{i}^{V})=Softmax(\frac{{{Q_i}K_{i}^{T}}}{{\sqrt {{d_k}} }}){V_i}$$
(8)

Where dk is the dimension of the key vector, and Qi, Ki, and Vi are the partition matrices of the corresponding heads of Q, K, and V, respectively.

Finally, the output of each head is spliced and linearly transformed to obtain the final multi-head attention output:

$$MultiHead(Q,K,V)=Concat(hea{d_1}, \cdots ,hea{d_h}){W^O}$$
(9)

Where WO\(\in {{\mathbb{R}}^{h{d_k} \times {d_G}}}\)is the output weight matrix.

The output of the MHA goes through a feed-forward neural network, which consists of two linear layers and a GELU activation function:

$$\begin{gathered} FF{N_1}(X)=GELU(X{W_1}+{b_1}) \hfill \\ FF{N_2}(X)=FF{N_1}(X){W_2}+{b_2} \hfill \\ \end{gathered}$$
(10)

Where W1\(\in {{\mathbb{R}}^{{d_G} \times {d_{FNN}}}}\), W2\(\in {{\mathbb{R}}^{{d_G} \times {d_{FNN}}}}\) ,b1,b2 are the bias vectors, and the output of the feed forward neural network is Zlayer.

Layer normalization is used to stabilize the training process after both the multi-head self-attention mechanism and the feed-forward neural network. For the output of the multi-head self-attention mechanism, there is:

$${Z_{attn}}=LayerNorm(H^{\prime}+MultiHead(Q,K,V))$$
(11)

For the output of the feedforward neural network, there are:

$${Z_{layer}}=LayerNorm({Z_{attn}}+FF{N_2}({Z_{attn}}))$$
(12)

The feed-forward neural network consists of two fully connected layers with a GELU activation function in the middle, which is used to nonlinearly transform the output of the polytope’s attention to further enhance the expressive power of the model.

The timing encoder module contains multiple layers of encoders, and the output of each layer will be used as the input of the next layer. after multiple layers of encoders, the output is obtained as HT\(\in {{\mathbb{R}}^{B \times N \times {d_T}}}\).

Integrated classifier module

The integrated classifier module is the core component of the model and is responsible for fusing the features from the graph convolutional encoder and the temporal encoder and performing the final classification through the multilayer Fully Connected Network (FC) and Softmax layers. This module receives the outputs from the graph convolutional encoder and temporal encoder modules and performs Global Average Pooling (GAP) on them to obtain the global graph feature vector hG\(\in {{\mathbb{R}}^{B \times {d_G}}}\) and global temporal feature vector hT\(\in {{\mathbb{R}}^{B \times {d_G}}}\).The outputs from the graph convolutional encoder and temporal encoder modules are used for the final classification of the model. Subsequently, hG is spliced with hT to obtain the fusion feature vector HF\(\in {{\mathbb{R}}^{B \times ({d_G}+{d_T})}}\):

$${H_F}=Concat({h_G},{H_T})$$
(13)

The fusion feature vector HF is fed into the FC layer to further extract high-level feature representations, and sequentially passes through the ReLU activation function and Dropout layer to enhance the nonlinear expressivity and generalization performance: the computational process of the FC layer can be represented as follows:

$$\begin{gathered} {H_{FC}}=\operatorname{Re} LU({W_{FC}} \cdot {H_F}+{b_{FC}}) \hfill \\ {H_{Droupt}}=Droupt({H_{FC}},p) \hfill \\ \end{gathered}$$
(14)

Where WFC\(\in {{\mathbb{R}}^{{d_{FC}} \times ({d_G}+{d_T})}}\) is the weight matrix of the fully connected layer, bFC\(\in {{\mathbb{R}}^{{d_{FC}}}}\) is the bias term. p is the discard rate, which indicates the probability of each neuron being discarded.

The output of the Dropout layer is fed into the Softmax layer to get the probability distribution for each category. The Softmax layer is computed as follows:

$$P=Softmax({W_S} \cdot {H_{Droupt}}+{b_S})$$
(15)

Where WS\(\in {{\mathbb{R}}^{C \times {d_{FC}}}}\) is the weight matrix of the Softmax layer, bS is the bias term, and C is the number of categories. The Softmax function is defined as follows:

$$Softmax({z_i})=\frac{{{e^{{z_i}}}}}{{\sum\nolimits_{{j=1}}^{C} {{e^{{z_j}}}} }}$$
(16)

Where zi denotes the score of the i-th category. Eventually, the model will select the category with the highest probability as the prediction result. The model is trained using the Cross-Entropy Loss function. The input batch data contains B graphical samples, and each sample corresponds to the real label y = {y1,y2,…,yB}. The model outputs the category probability distribution as ŷ = { ŷ1, ŷ2,…, ŷB}, where ŷi\(\in {{\mathbb{R}}^C}\). The mathematical expression of the Cross-Entropy Loss function is:

$$\mathcal{L}= - \frac{1}{B}\sum\limits_{{i=1}}^{B} {\sum\limits_{{C=1}}^{C} {y_{i}^{{(C)}}} } \log \hat {y}_{i}^{{(C)}}$$
(17)

Where is the true label of the i-th sample in class C, and \(\hat {y}_{i}^{{(C)}}\) is the probability that the i-th sample belongs to class C as predicted by the model. By minimizing this loss function, the model can optimize the parameters of the classification layer and improve its ability to discriminate between various types of attacks and normal traffic.

Experiments and performance evaluation

Experimental equipment

The model proposed in this paper has been trained and tested in a Python 3.8.19 environment using the PyTorch framework. The device is equipped with an Intel Core i5-12600KF processor (3.70 GHz) and 32 GB of RAM.

Data description

To evaluate the performance of the model in vehicular network intrusion detection tasks, this paper selects two typical datasets in the field of IoV intrusion detection: the Car Hacking dataset and the CICIDS2017 dataset.

As a typical in-vehicle network dataset, the Car Hacking dataset strictly adheres to the CAN bus protocol specifications and contains 11 core features: timestamp, CAN ID, Data Length Code (DLC), and the 8-bit data fields of CAN messages (DATA[0]-DATA[7]). Timestamps can identify the interference of DoS attacks on the communication rhythm of the bus by monitoring abnormal fluctuations in message time intervals. Abnormal patterns of CAN ID (such as unauthorized IDs) serve as a key basis for detecting spoofing and message injection attacks. Malformed values of DLC (such as non − 0–8 byte lengths) can be associated with fuzz testing attack behaviors. The 8-bit data fields can identify content tampering attacks, such as rotational speed tampering and gear forgery, through deviations of their values from the normal range. In view of the effective characterization of the above-mentioned features for in-vehicle attack patterns, this paper selects these 11 - dimensional features for model development. Compared to other publicly available CAN intrusion datasets, this dataset has significant advantages in terms of coverage and representativeness. Given that some categories in this dataset have sufficient sample sizes, this paper did not perform class imbalance handling.

The CICIDS2017 dataset, an internationally recognized benchmark for network attacks, covers a wide range of modern network attack types, such as DDoS and port scanning, with attack patterns that closely match the characteristics of external network attacks. Its original 78-dimensional feature vector depicts network behaviors from multiple dimensions, providing comprehensive support for intrusion detection. Temporal features (such as flow duration) can capture the sudden growth and short-cycle repetition patterns of traffic in DDoS attacks, and realize attack detection by identifying abnormal deviations in the temporal distribution of traffic. Statistical features can effectively distinguish normal traffic from abnormal scanning behaviors. Protocol state features can accurately locate protocol-layer attacks such as SYN flooding. Application-layer features provide a basis for identifying application-layer attacks such as SQL injection, XSS, and brute-force cracking. The above 78-dimensional features completely cover the core behavior patterns of external network attacks; therefore, this study adopts its full set of features to construct the intrusion detection model. However, this dataset suffers from a serious class imbalance issue. To address this, this paper uses the SMOTE oversampling technique to enhance the minority class samples and applies controlled undersampling to the normal traffic. This improves the balance of the sample distribution and reduces the risk of class bias during model training. The class distribution and sample statistics of the two preprocessed datasets are shown in Tables 2 and 3.

Table 2 CICIDS2017 dataset statistics.
Table 3 Car hacking dataset statistics.

Evaluation metrics

This paper uses standard classification metrics to evaluate the performance and robustness of the proposed model, including Accuracy (Acc), Precision (Pre), F1 score (F1), Detection Rate (DR), False Alarm Rate (FAR), and False Positive Rate (FPR). These metrics are calculated based on the values of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). The specific formulas are as follows:

  1. 1.

    Accuracy: Accuracy is defined as the ratio of correctly predicted instances to the total number of instances. Higher accuracy rates are indicative of enhanced model performance.

  2. 2.

    Precision: Precision is defined as the proportion of correctly predicted positive instances among all instances predicted as positive by the model.

  3. 3.

    It is imperative to recall that the true positive rate (TPR) is a metric that calculates the proportion of correctly predicted positive instances among all actual positive instances in the dataset.

    $$DR=\frac{{TP}}{{TP+FN}}$$
    (18)
  4. 4.

    F1-score: The F1-score is defined as the harmonic mean of precision and recall, providing an evaluation metric that is balanced and particularly useful for datasets that are imbalanced.

    $$F1=2 \times \frac{{Pre \times DR}}{{Pre+DR}}$$
    (19)
  5. 5.

    False Positive Rate: FPR is a metric used to quantify the proportion of actual negative instances that are incorrectly predicted as positive by the model.

    $$FPR=\frac{{FN}}{{TP+FN}}$$
    (20)

Parameter settings

During the entire experiment, we selected appropriate values through hyperparameter tuning. The model parameters are shown in Table 4.

Table 4 Model parameter Settings.

In this experiment, the window size is a key parameter. When the window is too small, the model cannot learn complete contextual features. On the other hand, a window that is too large can harm the model’s real-time detection ability. Previous studies36 have shown that setting the window length between 50 and 100 data samples can significantly improve model performance. Based on this finding, we evaluated the model’s performance under different window sizes on the CICIDS2017 dataset, as shown in Fig. 4. The results show that when the window size is 80, the model achieves the best overall performance. Therefore, we set the window size to 80 in this experiment and used the same setting on the Car Hacking dataset.

Fig. 4
figure 4

Performance on the CICIDS2017 dataset at different window size N values.

Ablation study

To evaluate the generalization ability and performance of the model, we designed and built six GNN-based models, including traditional models (GAT-2, GCN-2, GSAGE-2) and Transformer-based variants (GAT-2-former, GCN-2-former, GSAGE-2-former). The Car Hacking and CICIDS2017 datasets were used as benchmark datasets to test the performance of these models. To ensure the reliability and objectivity of the experimental results, the generated graph datasets were randomly split into training and testing sets at a ratio of 8:2 for model training and validation.

As shown in Fig. 5, the performance evaluation of the six models on two benchmark datasets shows that all variant models perform better than the traditional ones. The main advantage comes from the temporal encoder module, which improves global dependency modeling, reduces over-smoothing, and enhances feature representation. On the Car Hacking dataset, GAT-2-Former and GSAGE-2-Former perform similarly to GCN-2-Former. However, on the more complex CICIDS2017 dataset, their performance is significantly lower. This gap is due to GCN’s better ability to adapt to complex graph structures through local neighborhood aggregation. When combined with a Transformer, it enables effective fusion of global and local features. In contrast, the node-level attention in GAT and the inductive learning of GSAGE are not sufficient to handle complex attack patterns. Among all models, GCN-2-Former achieves the best results on both datasets. Its strength lies in the combination of GCN and Transformer: GCN extracts local topological features, while Transformer captures temporal dependencies using self-attention. Especially on the Car Hacking dataset, the model reached 100% accuracy and F1 score, demonstrating its strong performance and practical value in vehicle network intrusion detection.

Fig. 5
figure 5

Comparison of performance metrics of models on (a) car hacking dataset and (b) CICIDS2017 dataset.

As shown in Figs. 6 and 7, this paper provides a systematic analysis of the training dynamics of six models on two datasets. The experimental results show that the variant models converge faster and achieve higher accuracy than the traditional models within 20 training epochs on both datasets. Among all variant models, the proposed model quickly reaches the minimum loss (0.0001 on the Car Hacking dataset and 0.001 on the CICIDS2017 dataset) and the highest accuracy (1.0 on Car Hacking and 0.9998 on CICIDS2017) within 20 epochs, demonstrating better optimization efficiency and learning capability. In addition, the loss curves of the proposed model are smoother on both datasets, indicating that the overfitting problem is effectively controlled.

Fig. 6
figure 6

Comparison of training loss curves of models on (a) Car-Hacking Dataset and (b) CICIDS2017 Dataset.

Fig. 7
figure 7

Comparison of accuracy curves of models on (a) Car - Hacking Dataset and (b) CICIDS2017 Dataset.

Figure 8 shows the ROC curves and AUC values of the six models on two benchmark datasets. The comparison shows that the proposed model can effectively reduce the false alarm rate while detecting most types of intrusions, thus improving detection efficiency and reliability. Specifically, the proposed model achieves the highest AUC values on both datasets, which indicates strong generalization ability and stability in handling different types of attacks.

Fig. 8
figure 8

Comparison of ROC curves and AUC values of models on (a) Car - Hacking Dataset and (b) CICIDS2017 Dataset.

This paper further analyzes the performance of the proposed model on two benchmark datasets, as shown in Tables 5 and 6. The results show that the model achieves over 99.9% accuracy in all attack types and normal traffic classification tasks, with a maximum accuracy of 100%. It is worth noting that in the CICIDS2017 dataset, the recall rates for Heartbleed and Brute Force attacks are relatively low. This is because Heartbleed samples are extremely rare in the dataset, and the behavior of Brute Force attacks is very similar to normal traffic, which leads to some missed detections. In the Car Hacking dataset, the model achieved 99.99% accuracy and F1 score in detecting Fuzzy attacks, which effectively addresses the weakness of many models in identifying Fuzzy attacks. In addition, the False Positive Rate (FPR) for each attack category is mostly extremely low, reflecting the model’s strong ability to reduce misclassification and further verifying its reliability in practical applications where false alarms need to be minimized.

Table 5 Performance metrics of the proposed model on the car hacking Dataset.
Table 6 Performance metrics of the proposed model on the CICIDS2017 dataset.

Comparison with baseline models

To more comprehensively and objectively evaluate the performance of the proposed model, this paper compares it with other advanced models on two benchmark datasets. The evaluation is based on experimental results, especially accuracy, precision, recall, and F1-score.

Table 7 shows the performance comparison between the proposed model and several recent models from the literature on the Car Hacking dataset. From the perspective of six key indicators, GCN-2-Former achieves 100% in accuracy, precision, recall, and F1-score, significantly outperforming other comparative models. Specifically, the highest precision of DeepSecDrive is 98.59%, which is much lower than that of the proposed model. This is because its static structure design cannot capture the dynamic changes of real-time traffic, and its feature extraction is limited to local and non-local relationships, which leads to weak detection ability for complex attack patterns. DGIDS performs well in terms of accuracy and F1-score, but its testing time reaches 117 ms, far exceeding the 10 ms-level response requirement. The CNN model has accuracy and F1-score close to those of the proposed model and achieves the fastest training speed; however, its testing time of 5.96 ms is still significantly high. Rec-CNN attains an accuracy of 99.90%, yet both its training and testing times are excessively long. SIDILDNG has the shortest testing time (0.007 ms), but its relatively low accuracy (95.84%) and F1-score (96.61%) result in limited detection capability. In summary, GCN-2-Former excels across all indicators, precisely balances detection accuracy and efficiency, and provides a more reliable solution for in-vehicle network security monitoring.

Table 7 Performance comparison between the proposed model and existing methods on the car hacking dataset.

Table 8 shows the performance comparison between the proposed model and existing mainstream models on the CICIDS2017 dataset. The experimental results show that GCN-2-Former achieves the best performance across all evaluation metrics. Its accuracy (99.98%), precision (99.97%), recall (99.97%), and F1-score (99.98%) are all higher than those of existing methods, while its training time (284s) and testing time (0.09ms) are both lower than those of existing methods. Specifically, the XGBoost, IoVST, and CNN models perform comparably to the model proposed in this paper in terms of accuracy and F1-score metrics, but their training time and testing time are significantly higher than those of the proposed model. In addition, the accuracy of SVM-B is 98.92%. This further indicates that traditional machine learning methods have limitations in detecting threats in complex Internet of Vehicles (IoV) environments.

Table 8 Performance comparison between the proposed model and existing methods on the CICIDS2017 dataset.

In summary, the proposed model fully leverages the advantages of GCN in modeling graph-structured data, combined with Transformer to effectively capture global dependencies, achieving optimal detection performance. Compared to traditional machine learning algorithms, the proposed model demonstrates better detection performance. When compared to models that focus only on spatial or temporal feature extraction (such as Rec-CNN, SIDILDNG, and SVM-B), the proposed model achieves higher accuracy and F1-scores. Its outstanding performance not only validates the model’s effectiveness but also provides a more accurate and robust solution for intrusion detection in the IoV.

Model complexity analysis

To evaluate the practicality of the proposed model in resource-constrained in-vehicle environments, a quantitative analysis of model complexity was conducted from four dimensions: parameter scale, storage overhead, training efficiency, and inference performance.

As shown in Table 9, the proposed model demonstrates superior performance over similar models in terms of compactness metrics such as parameter scale and storage overhead. Compared with Rec-CNN and IoVST, the number of parameters of the proposed model is reduced by 20.7% and 52.9% respectively. Although MTH-IDS achieves a relatively compact model size (2.61 MB), its parameter count has not been publicly disclosed. In contrast, the proposed model has a total of 423,678 trainable parameters and a model size of 2.2 MB, resulting in low storage requirements. Its compact parameter scale and storage footprint make it suitable for deployment scenarios involving in-vehicle Electronic Control Units (ECUs) with limited memory resources.

Table 9 Model complexity comparison: parameter and storage metrics.

Furthermore, the evaluation of training efficiency under experimental hardware shows that: on the CICIDS2017 dataset, the model takes 284 s to complete 20 rounds of training; on the larger - scale Car - Hacking dataset, the training time is 2143 s (the difference is caused by the sample size), and the training duration in both cases is within the reasonable range for model iterative optimization. Inference efficiency is a key indicator for real-time intrusion detection in in-vehicle networks. The model’s single-sample inference time on both datasets is 7.2 milliseconds (the sample is a time-series graph constructed from 80 network messages). The throughput reaches 10,652 samples per second on the CICIDS2017 dataset and 8,803 samples per second on the Car-Hacking dataset. It meets the 10ms real-time detection requirement of in-vehicle networks and can efficiently handle high-frequency traffic data.​.

Overall, the proposed model achieves a balance between detection performance and computational complexity. It has a compact parameter scale, low storage overhead, as well as efficient training and inference speeds, making it suitable for deployment in resource - constrained in - vehicle environments.

Conclusion and future work

This paper proposes GCN-2-Former, an intrusion detection model integrating GCN and Transformer architectures. The model combines GCN’s strengths in extracting spatial features with Transformer’s ability to capture global temporal dependencies, effectively learning spatial-temporal features of vehicular network data for high-accuracy detection of both intra-vehicle and external network attacks. Through the dynamic graph construction method, the model represents vehicle data as a spatial-temporal dynamic graph. It defines graph-level labels based on statistical features of node attributes within a sliding window. This significantly improves the model’s performance and robustness in detecting complex attack patterns. Experimental results show that the proposed model achieves the best performance on both the Car Hacking dataset and the CICIDS2017 dataset, with detection accuracy and F1-score reaching 100% and 99.98%, respectively. It significantly outperforms traditional IDS and other advanced IDS methods. In future work, we will focus on optimizing the model structure and parameters to reduce computational cost. We will also further explore unsupervised learning methods to improve the model’s ability to detect unknown attacks, aiming to build a more comprehensive and intelligent security system for the IoV.