Introduction

With the continuous advancement of autonomous driving technology, autonomous vehicles (AVs) are expected to enhance traffic efficiency, safety, and mobility for all users within the next decade or two, particularly benefiting the elderly and individuals with disabilities1,2. Once commercialized and widely deployed, AVs will share road space with traditional vehicles, non-motorized vehicles, and pedestrians in mixed traffic environments. In such settings, pedestrians, as one of the most vulnerable road users, will need to interact with both conventional vehicles and AVs, potentially leading to uncertainties in pedestrian behavior or unsafe situations. Predicting pedestrian crossing intentions has thus become a critical algorithm to ensure pedestrian safety. This algorithm analyzes pedestrian behavior and environmental information to predict whether a pedestrian intends to cross the road. It is built upon foundational technologies such as pedestrian detection and tracking, combined with higher-level video analysis techniques. Pedestrian crossing intention recognition provides AVs with crucial information, enabling them to understand pedestrian intentions and make appropriate decisions. This capability allows AVs to respond more intelligently to complex traffic scenarios, ensuring pedestrian safety while maintaining smooth traffic flow.

Pedestrians groups

In urban environments, interactions between multiple pedestrians and AVs are increasingly common, particularly when a group of people cross the street together. In today’s traffic environment, which is still dominated by non- AVs, pedestrians depend on this social information when making crossing decisions. For instance, Faria et al.3. observed that individuals tend to accelerate their crossing speed when they see others crossing directly in front of them. Furthermore, imitation behavior also plays a crucial role in pedestrian decision-making. Pedestrians tend to mimic the actions of those around them when deciding whether to comply with traffic rules4. Non-compliant group behavior can further encourage others to follow suit5. Social influence guides pedestrians to conform to the behavior of those around them, meaning that when assessing crossing situations, their decisions often align with social norms or group standards4.

Spatiotemporal features

Spatiotemporal features refer to the distribution characteristics of pedestrian behavior across temporal and spatial dimensions, which are essential for understanding pedestrian crossing interactions. These features encompass two main aspects: temporal and spatial characteristics. Temporal features capture the time-related distribution of pedestrian crossing behavior, including crossing intervals, crossing speed, and dwell time6. Spatial features describe the spatial distribution of pedestrian behavior, such as crossing locations, movement trajectories, and interaction distances between pedestrians7. To address the prediction of pedestrian intention in urban traffic environments, Khaled et al. proposed a real-time prediction framework that leverages spatiotemporal image sequences captured by monocular RGB cameras to better forecast pedestrian actions8. Utilizing this rich information, researchers developed an end-to-end fully convolutional long short-term memory networks (LSTM) encoder designed to model pedestrian flows within a crowd9. This approach effectively reduces tracking interruptions caused by issues such as long-distance tracking failures, trajectory discontinuities, or severe occlusions, significantly minimizing pedestrian ID switching incidents10.

Pedestrian intention prediction algorithm

Currently, algorithms based on pedestrian motion features are the mainstream approach for predicting pedestrian crossing intentions. These algorithms analyze pedestrian movement patterns to predict future trajectories, thereby determining crossing intentions. For instance, an algorithm employing Gaussian process dynamics models and probabilistic hierarchical trajectory matching predicts pedestrian paths using enhanced features extracted from dense optical flow, enabling the identification of crossing intentions11. Another approach combines the extended Kalman filter algorithm with dynamic models to predict pedestrian movement paths, providing detailed forecasts for four typical pedestrian motion types: crossing, standing, walking, and bending12. Additionally, research integrating inverse reinforcement learning with bidirectional recurrent neural networks (RNNs) has achieved high-precision trajectory predictions, with an average displacement error of fewer than five pixels13.

With the development of intelligent transportation systems, pedestrian crossing intention prediction has become a critical research direction for improving traffic safety. Traditional methods rely on hand-crafted features, but their performance is often limited in complex scenarios. Deep learning, particularly convolutional neural networks (CNNs) and LSTM, has demonstrated remarkable capabilities in image feature extraction and temporal sequence modeling. This paper proposes a pedestrian group crossing intention prediction model that integrates spatiotemporal features. The model effectively combines spatial and temporal characteristics to achieve accurate intention prediction.

Related work

Pedestrian group research

In transportation research, the analysis of pedestrian group behavior has become increasingly important, particularly in understanding collective crossing behavior and its impact on traffic flow. Moussaïd et al.14 observed 1,500 pedestrian groups in natural environments and found that the arrangement of group members changes with surrounding density. In low-density environments, groups often walk side by side to facilitate communication, while in medium-density scenarios, they tend to adopt a V-shaped formation, with the central member slightly trailing behind the side members. Gorrini et al.15 studied the behavior of groups of two, three, and four individuals within a sample of 1,600 pedestrians in low-density environments. Their results indicate that as group size increases, collective walking speed decreases. Pairs tend to walk side by side with lower dispersion, trios frequently adopt a V-shaped formation, and groups of four often split into smaller sub-structures, such as two pairs or a trio combined with a single individual. These group patterns suggest that even in low-density environments, larger groups maintain certain social boundaries, with common formations including side-by-side, V-shaped, and leader-follower arrangements16. These formations significantly influence pedestrian walking decisions and spatial distribution. Furthermore, even when walking in groups, individuals maintain reasonable distances from one another to avoid collisions17.

Research on human pose estimation

In recent years, significant progress has been made in 2D human pose estimation based on monocular vision. Prediction models built using deep convolutional neural networks have demonstrated the ability to recognize the intentions of vulnerable road users18. To further explore pedestrian group movement, Zaki19 proposed a trajectory-based group dynamics prediction method, which identifies pedestrian groups through spatio-temporal proximity and motion consistency, and analyzes their behaviors using trajectory features to reveal movement patterns in urban environments., Perdoch20 simulated the movement of robots working in groups to model the leader-follower concept observed in pedestrian group walking. Moreover, methods combining 2D human pose estimation with graph convolutional networks have shown advantages in predicting pedestrian crossing intentions, particularly in complex urban road environments. A model called “Pedestrian Graph” achieved state-of-the-art performance on the Joint Attention in Autonomous Driving (JAAD) dataset21.

Pedestrian crossing intention recognition can be considered a subtask of pedestrian action recognition. However, due to the diversity and complexity of pedestrian feature data, applying pedestrian crossing action recognition to video remains a challenging problem22. Current action recognition methods can be categorized into three main types: RGB video stream methods23, optical flow methods24, and skeleton modeling methods25. Among these, skeleton modeling predicts actions by estimating human poses, offering low dependency on environmental conditions and robustness to environmental changes. RNNs and CNNs are the most commonly used models in this domain. RNNs, capable of capturing action variations across temporal frames, include typical architectures such as bi-RNNs26, Deep-LSTMs27, feature fusion models, and attention-based models28. Skeleton-based pedestrian crossing action recognition usually begins with detecting the skeletal keypoints of the human body29. In addition, deep learning techniques have been applied to estimate pedestrian head pose and full-body orientation, utilizing supervised deep convolutional network models for prediction30. Another approach, RU-LSTM, predicts pedestrian crossing intentions by analyzing interactions among pedestrians, the surrounding environment, and other vehicles, leveraging multiple cues to enhance accuracy31.

Pedestrian crossing intentions are influenced by various factors, including pedestrian movement patterns, interactions with other road users, and individual characteristics. Observing pedestrian walking trajectories, patterns, and speeds can effectively predict future behaviors. For example, the use of fisheye monocular cameras has overcome blind spots in standard cameras, demonstrating excellent performance in pedestrian trajectory prediction32. Additionally, unsupervised learning methods can generate pedestrian trajectory intentions for multi-pedestrian tracking and rank the optimal trajectories using probabilistic approaches33. Some methods optimize trajectory prediction by analyzing irregular movement patterns34, while others employ graph convolutional networks to generate pedestrian crossing prediction graphs, enhancing performance by integrating multiple features35.

Another approach for human activity recognition is using pose estimation. 2D Human Pose Estimation (2D-HPE) is a fundamental problem in computer vision that involves detecting and localizing 2D keypoints from images or videos. With the gradual development of deep learning, 2D-HPE has achieved remarkable progress through the use of CNNs. A widely used method, OpenPose36, is a real-time multi-person pose detection technique that employs Part Affinity Fields (PAFs) to associate body parts with individuals in an image. CPN (Cascade Pyramid Network)37 introduced a two-stage framework: a global network for locating relatively simple keypoints and a refinement network designed to handle occluded and challenging keypoints. HRNet (High-Resolution Network)38 emphasizes that high-resolution features are crucial for position-sensitive vision tasks. Consequently, HRNet maintains high-resolution representations throughout the entire process of 2D human pose estimation.

Overall, significant progress has been made in the fields of pedestrian group behavior, crossing intention, and pedestrian behavior prediction based on 2D human pose estimation. Research indicates that group formations and interaction patterns with the environment have profound effects on traffic flow, while advancements in action recognition and pose estimation technologies provide powerful tools for pedestrian intention prediction. Additionally, the quality and balance of datasets are critical for the effective training and evaluation of predictive models.

Research methodology

This paper proposes an innovative method that integrates CNN and LSTM networks to predict pedestrian crossing intentions by fusing spatiotemporal features. As illustrated in Fig. 1, the method extracts multiple features from the dataset and employs a CNN module to capture the spatial information from images. Simultaneously, the LSTM module models the temporal dependencies across video frames.

During the feature fusion stage, the spatial features extracted by the CNN and the temporal features captured by the LSTM are combined to enhance the model’s understanding of pedestrian behavior patterns. Finally, the fused features are analyzed by a classification network, which outputs the pedestrian’s crossing intention.

Fig. 1
figure 1

Pedestrian intention recognition process.

Feature extraction

In pedestrian crossing intention prediction research, the extraction and analysis of spatiotemporal features are critical to model performance. This study extracts five key spatiotemporal features: pedestrian groups, pedestrian 2D position trajectories, local context, global context, and pose keypoints. These features describe pedestrian behavior from multiple dimensions, including group interactions, motion trajectories, proximal environment, overall external factors, and individual posture, thereby constructing a comprehensive framework for intention prediction.

Pedestrian pose keypoints

Keypoints, also known as joints, are points of interest in an image. In this context, these points of interest represent joint positions such as shoulders, elbows, ankles, and more. The keypoints themselves provide spatial information based on their positions in the image, while their variations across multiple frames represent temporal information. Over time, predicted keypoints deliver spatiotemporal data that can be used to calculate a pedestrian’s trajectory and speed. This information is utilized to predict a pedestrian’s intention to cross the street.

In this study, a pre-trained OpenPose36 model is used to extract pedestrian keypoints. OpenPose is an open-source library for pose estimation widely applied in computer vision tasks, especially in human pose estimation, facial expression analysis, and hand gesture recognition. OpenPose can detect and analyze human keypoints in images and videos in real-time, including head, shoulders, elbows, wrists, hips, knees, and ankles. It also supports multi-person pose estimation, detecting multiple subjects in a single image or video and providing detailed skeletal keypoint information for each individual. Figure 2 illustrates the human pose keypoints extracted by the OpenPose model.

Fig. 2
figure 2

Human pose keypoints.

This study extracts pedestrian pose keypoints \(\:Pi={\{p}_{i}^{t-m},{p}_{i}^{t-m+1},\dots\:,{p}_{i}^{t},\dots\:,{p}_{i}^{t+m-1},{p}_{i}^{t+m}\}\), where \(\:p\) represents the 2D coordinates of 18 pose joints, specifically:\(\:{p}_{i}^{t-m}=\{{x}_{i0}^{t-m},{y}_{i0}^{t-m},{x}_{i1}^{t-m}\), \(\:{y}_{i1}^{t-m}\),…, \(\:,{x}_{i17}^{t-m}\), \(\:{y}_{i17}^{t-m}\}\), Here, \(\:i\) represents the pedestrian ID, \(\:t\) denotes the time step, and \(\:m\) is a constant at a given time step.

Pedestrian groups

Based on observations from the JAAD dataset, pedestrians often tend to cross the street as part of a group. If one member of the group begins walking to cross the street, another member of the group will spontaneously follow suit (as shown in Fig. 3). A small group of pedestrians can be defined as individuals interacting with one another, where such interaction may be indicated by their proximity, body posture, gaze, or active engagement through conversation. These small groups form while walking or waiting on the sidewalk and continue to exist as a unit when crossing the street. It should be pointed out that the pedestrian group studied in this paper is defined as individuals who are spatially close, that is, individuals without social relationships.

Fig. 3
figure 3

Illustration of pedestrian groups crossing the street.

This study defines the number of other pedestrians surrounding a given pedestrian as \(\:Ni={\{n}_{i}^{t-m},{n}_{i}^{t-m+1},\dots\:,{n}_{i}^{t},\dots\:,{n}_{i}^{t+m-1},{n}_{i}^{t+m}\}\) Here, \(\:i\) represents the pedestrian ID, \(\:t\) denotes the time step, and \(\:m\) is a constant at a given time step. In the study by Hübner et al.39, pedestrians within a 2-meter radius are considered to have a sense of group affiliation. Based on this concept, the number of pedestrians in this study is defined as follows: First, all other pedestrians within a 2-meter radius of the target pedestrian are identified. Next, for each of these identified pedestrians, a new 2-meter radius is constructed to determine whether additional pedestrians are present within the defined area. This process continues recursively until no further pedestrians are detected (as shown in Fig. 4). The final pedestrian count is defined as the total number of pedestrians surrounding the target individual. It is important to note that this method is applicable only to streets and intersections with relatively low pedestrian traffic.

Fig. 4
figure 4

Schematic diagram of pedestrian group.

Pedestrian 2D position trajectory

In this study, the pedestrian’s visual features and 2D position trajectories are extracted using the YoloV5-DeepSort model. As shown in Fig. 5, the pedestrian’s local context\(\:Ei={\{e}_{i}^{t-m},{e}_{i}^{t-m+1},\dots\:,{e}_{i}^{t},\dots\:,{e}_{i}^{t+m-1},{e}_{i}^{t+m}\}\) is composed of a sequence of RGB image patches of size [224, 224] pixels surrounding the target pedestrian, while the pedestrian’s 2D position trajectory \(\:Li={\{l}_{i}^{t-m},{l}_{i}^{t-m+1},\dots\:,{l}_{i}^{t},\dots\:,{l}_{i}^{t+m-1},{l}_{i}^{t+m}\}\) is composed of the bounding box coordinates of the target pedestrian, specifically:\(\:{l}_{i}^{t-m}=\{{x}_{ti}^{t-m},{y}_{ti}^{t-m},{x}_{bi}^{t-m}\),\(\:{y}_{bi}^{t-m}\)}, Here, \(\:{x}_{ti}^{t-m},{y}_{ti}^{t-m}\) represent the coordinates of the top-left corner of the bounding box, while \(\:{x}_{bi}^{t-m}\), \(\:{y}_{bi}^{t-m}\) denote the coordinates of the bottom-right corner. In this equation, \(\:i\) represents the pedestrian ID, \(\:t\) denotes the time step, and \(\:m\) is a constant at a given time step.

Fig. 5
figure 5

Pedestrian visual features and 2D position trajectories.

Road environment features

The road environment is a critical factor influencing pedestrian crossing decisions, reflected in the structural characteristics of the road, the behavioral patterns of other road users, and the dynamic interactions between road users. These factors collectively shape pedestrians’ judgments and decisions about whether to cross at a given moment. In this study, we use the DeepLabV3Plus40 model to automate the extraction of road environment features. By leveraging semantic segmentation, this model accurately identifies key elements such as roads, vehicles, and pedestrians, providing fine-grained environmental information for pedestrian crossing intention prediction. As shown in Fig. 6, this study selects road, street, pedestrian, and vehicle features as the global context \(\:C=\{{c}^{t-m},{c}^{t-m+1},\dots\:,{c}^{t},\dots\:,{c}^{t+m-1},{c}^{t+m}\}\) As part of the visual feature input, the semantic segmentation of all input frames is resized to [224,224] pixels, consistent with the visual features of the pedestrians.

Fig. 6
figure 6

Road environment features.

Model overview

The overall model framework is shown in Fig. 7, consisting of the CNN module, RNN module, attention mechanism, and feature fusion method.

Fig. 7
figure 7

Model architecture.

In this study, the VGG19 model is selected as the CNN module. VGG19 is a deep convolutional neural network model42, which has shown excellent performance in image classification and feature extraction tasks. We use the pre-trained weights based on the ImageNet dataset43 and convert the classifier part of VGG19 (vgg19.classifier) into a Sequential model, removing the final layer to serve as the new classifier. Additionally, the parameters of the 16 convolutional layers in VGG19 are frozen, meaning they do not participate in the model’s training and optimization. This is a common technique used in transfer learning to retain the features of the pre-trained model while avoiding excessive computation and optimization of the pre-trained parameters during the training of the new model.

In this study, a LSTM network is selected as the RNN module. LSTM is a special type of RNN that effectively handles and forgets long-term dependencies, making it particularly suited for processing and predicting sequential data. In this study, the size of the LSTM hidden layer is set to 256. The calculation formulas for LSTM are presented in Eqs. (1)–(5).

$$\:{f}_{t}=\sigma\:({W}_{f}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{f})$$
(1)
$$\:{i}_{t}=\sigma\:({W}_{i}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{i})$$
(2)
$$\:{o}_{t}=\sigma\:({W}_{o}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{o})$$
(3)
$$\:{c}_{t}={f}_{t}\cdot\:{c}_{t-1}+{i}_{t}\cdot\:tanh({W}_{c}\cdot\:\left[{h}_{t-1},{x}_{t}\right]+{b}_{c})$$
(4)
$$\:{h}_{t}={o}_{t}\cdot\:tanh\left({\text{c}}_{t}\right)$$
(5)

where \(\:{f}_{t}\), \(\:{i}_{t}\), \(\:{o}_{t}\), and \(\:{c}_{t}\) represent the actions for the forget, input, output and cell state gates at time \(\:t\) respectively. Furthermore, \(\:{W}_{f}\), \(\:{W}_{i}\), \(\:{W}_{o}\), \(\:{W}_{c}\), \(\:{b}_{f}\), \(\:{b}_{i}\), \(\:{b}_{o}\), \(\:{b}_{c}\) are the weight matrices and variable biases of the four gates mentioned above. In addition, \(\:{x}_{t}\) and \(\:{h}_{t}\) are input and final output at time \(\:t\) of the memory cell.

The attention module is used to dynamically adjust the weights of different parts of the input data, enhancing the model’s ability to focus on important information. The attention mechanism is widely applied in the field of deep learning, especially in tasks related to Natural Language Processing (NLP) and Computer Vision (CV). The sequence features (e.g., the output of an RNN-based encoder) are represented as hidden states \(h~ = ~\left\{ {h_{1}, h_{2} ,...,h_{i} } \right\}\). The attention weights are computed as follows Eq. (6):

$$\:a=\frac{\text{e}\text{x}\text{p}\left(score\right({h}_{i},{h}_{s}\left)\right)}{{\sum\:}_{k}\text{e}\text{x}\text{p}\left(score\left({h}_{i},{h}_{k}\right)\right)}$$
(6)

where \(\:score({h}_{i},{h}_{s})\)= \(\:{{h}_{i}}^{T}{W}_{s}{h}_{s}\) and \(\:{W}_{s}\) is a weight matrix. This attention weight captures the relationship between the current hidden state \(\:{h}_{i}\) and the previous source hidden state \(\:{h}_{s}\), and the softmax function is applied to obtain the attention weights.

This study adopts a hybrid approach to fuse different inputs, as shown in Fig. 7, categorizing the features into visual and non-visual features. The non-visual feature fusion integrates three elements: bounding boxes, pose keypoints, and pedestrian groups. These features are hierarchically fused based on their complexity. In the process depicted in Fig. 7a, sequential pedestrian pose keypoints \(\:Pi\) are input into an LSTM encoder. The output of the first stage, along with the pedestrian’s 2D position trajectory \(\:Li\), is then fed into a new LSTM encoder. Subsequently, the output of the second stage is combined with the pedestrian group \(\:Ni\) and input into the final LSTM encoder. Finally, the output of the last encoder passes through the attention module to obtain the final non-visual feature vector \(\:{A}_{i-nv}\). The visual feature fusion integrates two elements: local context (the magnified pedestrian appearance around the bounding box) and global context (semantic segmentation of important objects in the entire scene). As shown in Fig. 7b, local context \(\:Ei\) is extracted using the CNN module, followed by temporal features extracted from the LSTM module. The global context \(\:C\) is extracted in the same way as the local context \(\:Ei\). These two features are then input into the attention module, generating the final visual feature vector \(\:{A}_{i-v}\).Finally, the non-visual feature vector \(\:{A}_{i-nv}\) and the visual feature vector \(\:{A}_{i-v}\) are fed into an attention module, followed by a fully connected (FC) layer to complete the prediction as follows Eq. (7):

$$\:{{F}_{i}={f}_{FC}\left({f}_{attention}\right(A}_{i-nv};{A}_{i-v}\left)\right)$$
(7)

Data collection and analysis

This study utilizes the Joint Attention in Autonomous Driving (JAAD) dataset41, which comprises 346 videos, each lasting 5–10 s, recorded using cameras mounted on vehicles. The videos are captured at a resolution of 1920 × 1080 and a frame rate of 30 fps. The dataset includes ground truth annotations of pedestrian bounding boxes and behavioral labels describing the current state of each pedestrian. The dataset consists of two subsets: JAAD Behavior Data (JAADbeh) and JAAD All Data (JAADall). JAADbeh contains pedestrians either crossing the road (495 samples) or about to cross the road (191 samples). JAADall includes additional pedestrians performing non-crossing actions (2,100 samples). Furthermore, we created a custom dataset of urban road driving scenarios, which includes pedestrian bounding box annotations and pedestrian group information, as shown in Fig. 8. In the figure, green boxes represent annotated pedestrian bounding boxes, while red boxes indicate pedestrian group information. This dataset was collected from selected urban roads in Zhangdian District, Zibo, China. It consists of 78 videos, each lasting between 15 and 25 s, captured using an action camera mounted on a vehicle. The videos were recorded at a resolution of 1920 × 1080 and a frame rate of 60 fps.

Fig. 8
figure 8

Custom urban road driving dataset.

This study incorporates pedestrian groups into non-visual features to investigate whether detecting the intention to cross the road of one member of a group enhances the detection of another member’s crossing intention. To explore this phenomenon, this study first employs YOLOv5-DeepSort to track and classify pedestrian groups in the JAAD dataset. Across 346 videos totaling 82,032 frames, 2,786 unique pedestrians were recorded. Table 1 presents the distribution of detected pedestrian groups in the JAAD dataset.

Table 1 Data distribution of detected Pedestrians.

Data analysis revealed that 38.45% of the frames contained only one pedestrian, while 57.52% of the frames included two or more pedestrians. Based on these observations, pedestrian groups were categorized into four types: single pedestrians, pairs, trios, and groups of four or more. Subsequently, the proposed model was used to train and detect these four categories of pedestrian groups separately. To ensure that the datasets for these four categories have the same scale, we use stratified random sampling (stratified according to category proportions to maintain the distribution consistency between the training and test sets, preventing class imbalance caused by random splitting) to reduce each dataset to 10,000 frames.

Results and discussion

In the experiments, the proposed model was compared with the following baseline methods: SingleRNN44, SF-GRU45, and PCPA46. Building upon these baselines, the proposed model incorporates pedestrian group factors into the prediction of pedestrian crossing intention, aiming to explore the impact of group dynamics on pedestrian crossing behavior. This study employed a dropout rate of 0.5 in the attention module, an L2 regularization of 0.001 in the fully connected (FC) layer, a binary cross-entropy loss function, and the Adam optimizer with a learning rate of 6 × 10− 4. The training process included 40 epochs with a batch size of 8, and a StepLR learning rate adjustment strategy was applied, reducing the learning rate by a factor of 10 every 5 epochs.

Table 2 presents the results of the proposed model on the JAADbeh dataset, compared with the singlernn, SF-GRU, and PCPA models. The proposed model demonstrated superior performance in terms of accuracy, precision, and F1-Score. F1-Score, which considers both precision and recall, serves as a crucial evaluation metric. It is the harmonic mean of precision and recall, providing a comprehensive assessment of a classification model’s performance. A higher F1-Score indicates better model performance. Notably, the proposed model achieved an approximate 3% improvement in F1-Score.

Table 2 Comparison of Models on the JAADbeh dataset.

Table 3 presents the results on the JAADall dataset, which includes all video sequences and associated annotations from the JAAD dataset. JAADall covers a broader range of scenarios and behaviors, featuring a larger dataset size and greater scene diversity, encompassing various traffic and environmental conditions. The proposed model demonstrated strong performance in terms of accuracy, AUC, precision, and F1-Score. Similar to the results on the JAADbeh dataset, the proposed model achieved notable performance across the three evaluation metrics: accuracy, precision, and F1-Score.

Table 3 Comparison of Models on the JAADalldataset.

By comparing the results in Tables 2 and 3, it can be observed that the proposed model outperforms other models on both the JAADbeh and JAADall datasets. This finding indicates that the proposed method demonstrates significant advantages across all evaluation metrics, further validating its effectiveness and reliability in predicting pedestrian crossing intentions.

In addition to the direct comparison with other models mentioned above, this study also compares different fusion schemes for non-visual features, as shown in Fig. 9. Three approaches for non-visual feature fusion are considered. Unlike the hierarchical fusion proposed in this study, scheme (a) inputs the three non-visual features separately into the LSTM encoder, followed by the attention module. Scheme (b) inputs the three non-visual features together into the LSTM encoder, followed by the attention module. Scheme (c) inputs the three non-visual features separately into the LSTM encoder, then into the attention module separately, and finally fuses them with the visual features.

Fig. 9
figure 9

Non-visual feature fusion schemes.

Tables 4 and 5 present a comparison between the proposed model and the three different fusion scheme models on the JAADbeh and JAADall datasets, respectively. The models of the different fusion schemes are denoted as Mode(a), Mode(b), and Mode(c). It can be observed that the proposed model demonstrates a significant advantage over the three other fusion scheme models. Furthermore, on the JAADall dataset, all four evaluation metrics show good performance, indicating that the proposed model performs well on a larger dataset with higher scene diversity.

Table 4 Comparison of different fusion schemes on the JAADbeh Dataset.
Table 5 Comparison of different fusion schemes on the JAADall Dataset.

In addition, this study also investigates the impact of pedestrian group features on model performance. To this end, two models were trained: one that includes pedestrian group features and another that does not, while keeping other input features and network structures consistent. As shown in Fig. 10, under the same number of training epochs using the JAAD dataset, the proposed model that considers pedestrian group features exhibits significantly higher accuracy compared to the model that does not include these features. As the number of pedestrians in the group increases, the model’s prediction accuracy improves regardless of whether pedestrian group features are considered.

Fig. 10
figure 10

Prediction accuracy of pedestrian crossing intention for different pedestrian groups in two trained models.

Fig. 11
figure 11

Pedestrian crossing intention prediction in two trained models.

Furthermore, this study conducts validation on both the JAADall dataset and the custom urban road dataset. As shown in Fig. 11 (where the red box indicates a pedestrian crossing the street and the green box indicates a pedestrian not crossing), at time \(\:T\), when both models recognize the first pedestrian crossing the street, the model incorporating pedestrian group features consistently identifies subsequent pedestrians faster than the model that does not incorporate such features. The results indicate that when one pedestrian in the group begins crossing the street, the other pedestrians typically follow closely behind. Meanwhile, compared to an individual pedestrian, the likelihood of a pedestrian group crossing the street in the presence of vehicle pressure significantly increases. A single pedestrian may abandon crossing due to pressure, whereas the formation of a pedestrian group helps reduce the pressure felt by individuals, thereby enhancing the crossing intention of the entire group.

The above results further validate the important role of pedestrian group features, as part of the non-visual features, in predicting pedestrian crossing intention. By integrating pedestrian group information, the model is able to more accurately capture the influence of group dynamics on individual behavior, thereby improving both prediction accuracy and robustness.

Conclusion

This study proposes a novel method for predicting pedestrian crossing intentions based on the fusion of spatiotemporal features. The method utilizes CNN and LSTM modules to extract visual and non-visual features, respectively, and employs an attention mechanism for feature fusion, enabling the prediction of pedestrian intentions. The model treats pedestrian groups as a key non-visual feature and uses a hierarchical mixed fusion strategy for combining non-visual features. The main findings of the study are summarized as follows.

  1. (1)

    The evaluation results on the JAADbeh and JAADall datasets show that the proposed model performs well in comparison with other pedestrian crossing intention prediction algorithms, achieving good results in accuracy, precision, and F1-Score. To further validate the effectiveness of the hierarchical fusion strategy, a comparison of different non-visual feature fusion schemes was conducted. The results demonstrate that the proposed model consistently outperforms the three other fusion strategies, particularly in the JAADall dataset, which features a larger data size and greater scene diversity. All four evaluation metrics show strong performance. This indicates that the model proposed in this study has good robustness and adaptability when dealing with more complex scenarios and larger datasets.

  2. (2)

    This study introduces pedestrian group features into non-visual features to explore whether the prediction of crossing intentions for other group members can be improved when one member of the group begins crossing first. To this end, two models were trained: one that includes pedestrian group features and another that does not, while keeping other input features and network structures consistent. The results show that as the size of the pedestrian group increases, the prediction accuracy of the model improves, regardless of whether pedestrian group features are considered. Furthermore, validation on the JAAD and custom urban road dataset indicates that the model incorporating pedestrian group features consistently outperforms the model without these features in terms of recognition speed at subsequent time steps. This suggests that when one pedestrian in the group begins crossing, other pedestrians typically follow closely behind. Moreover, compared to individual pedestrians, pedestrian groups exhibit a higher tendency to cross in the presence of vehicle pressure. A single pedestrian may abandon crossing due to pressure, but when pedestrians form a group, the pressure experienced by individuals is alleviated, thereby enhancing the overall crossing intention of the group.

In practical applications, the proposed model is expected to provide effective pedestrian intention prediction for autonomous driving systems and intelligent traffic management, enhancing the system’s decision-making capabilities in complex dynamic environments. By incorporating pedestrian group features into the crossing intention prediction model, traffic systems can better recognize the crossing behavior of pedestrian groups, thereby reducing accidents and improving road safety. Future research could further explore the impact of other non-visual features, such as pedestrian status, on crossing intention prediction. Another potential direction is to expand the model’s adaptability to handle a wider range of traffic scenarios, such as pedestrian intention prediction in nighttime or adverse weather conditions. Additionally, research could focus on improving the computational efficiency of the model under real-time conditions to ensure its practical effectiveness and response speed in real-world traffic environments.