Introduction

Multi-object tracking (MOT) represents a critical research area within computer vision1,2,3,4, widely applied in scenarios such as human motion analysis2,5, autonomous driving6, and intelligent surveillance systems7. The primary objective of MOT is to predict the trajectories of multiple objects within a video sequence. Despite many remarkable studies being reported in5,6,8,9,10, challenges such as identifying similar or small targets and reidentifying lost objects6,10 substantially hinder tracking performance.

Fig. 1
figure 1

Comparison of the various tracking frameworks. (a) the tracking-by-detection method; (b) other one-stage tracking method; (c) our one-staget tracking method with dual attention mechanism.

Currently, prevalent tracking algorithms5,6,9,11,12, employ the tracking-by-detection (TBD) framework, considering detection and tracking as distinct tasks. As illustrated in Fig.1 (a), the detection model identifies objects using bounding boxes, while the tracking model extracts appearance embeddings and then performs data association to generate trajectories.

However, this two-stage processing method lacks end-to-end optimization and suffers from significant computational overhead. One primary limitation of the TBD framework is that the separation of target detection and tracking means that detection accuracy directly impacts tracking performance. For instance, the noise or occlusions can lead to incorrect detections, which subsequently affect tracking accuracy. Additionally, as the number of objects in an image increases, the inference time also grows because appearance embedding extraction must be performed independently for each object. Although some studies have aimed at tightly combining detection and tracking8,13, the inherent independence of the detection model has not fundamentally resolved the issue of increased detection objects leading to more time consumption.

One-stage MOT methods are gradually emerging and have garnered significant attention, as shown in Fig.1(b). For instance,13 leverage the bounding box regression and classification capabilities of an object detector to predict the position of an object in the next frame. In contrast to10 introduce an innovative concept that combines detection features and appearance embeddings within a single unified framework. Despite its effectiveness, this method has certain limitations. Firstly, it employs a relatively heavy detector based on the Darknet53 backbone14, resulting in suboptimal real-time performance. Secondly, it lacks the ability to focus on discriminative features of instances to effectively distinguish similar objects. Although15 introduce attention mechanisms to mitigate interference from irrelevant information or complex backgrounds, it is inadequate for reidentifying lost objects as it overlooks instance-level semantic relationships across samples. For instance, if an object appears in sequential frames, we can capture the contextual interactions to achieve stable tracking for the lost object.

To alleviate these limitations, we propose a one-stage lightweight multi-object tracking method, illustrated in Fig. 1 (c). Specifically, we employ the intra-sample local attention mechanism (SLAM), which enables the model’s focus on discriminative regions, thereby enhancing the recognition of similar objects. Moreover, we utilize the inter-sample global attention mechanism (SGAM), which captures instance-level semantic information across samples, thus facilitating feature interaction between objects in different frames and improving the re-identification of lost objects. To validate our method, we conducted extensive comparison and ablation experiments. In summary, the main contributions of our work are as follows:

(1) We propose a dual attention mechanism to extract discriminative context features and capture instance-level semantic information shared among samples. This method significantly enhances the recognition performance for both similar objects and lost objects.

(2) We propose the STATION dataset, designed for substation scenarios, which includes real-world challenges such as occlusion and similar objects. This dataset is intended to evaluate the model’s ability to re-identify lost objects and distinguish between similar objects.

(3) Extensive experiments on MOT and STATION datasets demonstrate the effectiveness of the proposed method and achieve a tracking performance that outperforms other comparable methods.

The rest of this paper is organized as follows: In Section 2, we review the relevant literature. In Section 3, we discuss the specific implementation details of our method. Section 4 presents the experimental results. Finally, Section 5 provides the conclusions of this work.

Related work

Tracking by detection

Due to advances in target detection, the mainstream MOT methods1,5,13,16,17 have largely adopted the TBD paradigm in recent years.1 integrated appearance information to improve the tracking performance of being able to track objects over longer periods of time.16 emphasized the importance of the detector and utilized a CNN-based detector in conjunction with traditional tracking components like the Kalman filter to ensure reliable performance. The separation between detection and tracking in the aforementioned methods1,16 leads to a lack of global information utilization, thereby failing to effectively address the issues of false alarms and the re-identification of lost objects in dense and occluded scenarios. While certain studies have attempted to tighten the integration of detection and tracking13, the intrinsic independence of the detection model has not adequately addressed the problem of higher detection object counts resulting in increased time consumption. Li et.al18 introduced NanoTrack, a novel lightweight multi-object tracking method designed to enhance real-time performance by effectively integrating low-scoring detections. Although NanoTrack operates at higher speeds, its tracking accuracy remains limited. Compared to the existing TBD methods, our method learn the hidden shared structure between detection and embedding models, training the network in an end-to-end manner. This not only enhances tracking efficiency but also effectively extracts global discriminative semantic information to improve tracking performance.

One-stage MOT

The one-stage joint detection and tracking approach has gained widespread interest because of its streamlined and unified structure. A typical method19,20,21 is to develop a tracking-specific branch on top of an object detector to forecast either object tracking offsets or re-ID embeddings for associating data. As a notable advancement, JDE10 introduced the One-Stage tracking method, integrating the appearance embedding model within a single-shot detector. This tracking model simultaneously generates detection and corresponding embedding outputs. When compared to the TBD method, it lowers computational costs and is not constrained by the number of detected objects. However, despite its advantages, the JDE method utilizes the Darknet5314 as backbone network, which has limitations in feature extraction efficiency. Consequently, it is not optimally suited for scenarios requiring high real-time performance. Furthermore, JDE10 faces challenges to distinguish between similar objects and reidentify lost objects within crowded and occluded scenes. Sun et.al22 proposed an adaptive one-stage multi-object tracking algorithm based on sub-trajectories that incorporates a novel weight-updating module and appearance update strategy; however, its tracking accuracy and robustness remain limited. Unlike existing one-stage methods, we propose a lightweight multi-object tracking method. This method not only enhances tracking efficiency but also improves the identification of similar and lost objects through the use of a dual attention mechanism.

Attention mechanism in MOT

The attention mechanism has been widely employed in a range of research fields, including but not limited to image classification23,24, object detection14, semantic segmentation25,26, medical image processing27, spiking neural network28,29, domain adaptation30. Within the domain of Multi-Object Tracking,31 proposed the Double Matching Attention Network for data association, which utilizes spatial and temporal attention mechanisms to determine whether the detection and the tracklet belong to the same object.32 introduced the spatialtemporal attention mechanism for object state evaluation to solve the drift problem caused by object occlusion; The aforementioned methods primarily utilize attention mechanisms for data association and state estimation, neglecting the extraction of discriminative semantic features within the united detection and tracking framework. Furthermore, spatialtemporal attention is restricted to the candidate object, lacking awareness of global weight information. However, our method utilizes a dual attention mechanism to extract discriminative context features and capture instance-level semantic information shared among samples.

Fig. 2
figure 2

The overall of the proposed network is as follows: The backbone with lightweight model extracts fundamental visual features \(B1, B2, B3, B4\). Sample-perception features \(P2, P3, P4, E2, E3, E4\) are obtained through a dual attention mechanism (DAM), comprising intra-sample local attention mechanism (SLAM) and inter-sample global attention mechanism (SGAM). SLAM is utilized for extracting distinctive context information, while SGAM is adopted to capture shared instance-level semantics across samples. Finally, the tracking association component predicts and assigns object trajectories, ensuring accurate and reliable tracking.

Method

Overview and backbone

Our proposed lightweight multi-object tracking method, illustrated in Fig.2, comprises two primary components: the backbone network and the dual attention mechanism. The backbone network is responsible for extracting primary visual features. The dual attention mechanism consists of intra-sample local attention mechanism (SLAM) and inter-sample global attention mechanism (SGAM). The SLAM allows the model to focus on discriminative regions, enhancing the recognition of similar objects. Furthermore, the SGAM captures instance-level semantic information across samples and facilitates feature interaction between objects in different frames, thereby improving the re-identification performance for lost objects.

To enhance the real-time performance of multi-object tracking, a lightweight backbone network with minimal model parameters and FLOPs was employed. This backbone is designed to extract multi-scale visual features B2, B3, B4 based on the Feature Pyramid Network (FPN)33. During the top-down and bottom-up phases, a dual attention mechanism is utilized to obtain sample-perception features P2, P3, P4, E2, E3, E4 effectively capturing the interactive attention information both within and across samples.

Finally, each of the three task heads processes the fused feature maps at three scales to produce dense prediction maps for class, bounding box, and appearance embedding. These predictions are then combined with online association algorithms to generate object trajectories.

Dual attention mechanism

We propose the dual attention mechanism, which consists of two key components. Firstly, it adopts SLAM, enabling the model to focus on discriminative regions to retrieve instance context, thereby facilitating the effective distinguishing of similar objects. Secondly, it utilizes SGAM, which captures instance-level semantic information across samples. This promotes feature interaction between objects in different frames, thus enhancing the re-identification performance for lost objects.

Intra-sample local attention

SLAM is designed to capture detailed attention focused on the discriminative regions of an image, while minimizing interference from irrelevant information. It comprises two key components: a parallel channel module and an autocorrelation spatial module. The parallel channel module is responsible for extracting generalized features enriched with semantic information, whereas the autocorrelation spatial module suppresses background interference to extract discriminative features effectively. As illustrated in Fig.3, the parallel channel module processes the input features \(X=[\textrm{x}_1,\textrm{x}_2,...,\textrm{x}_c] \in \mathbb {R}^{H\times W\times C}\) by splitting them into two branches for pooling operations. Subsequently, the channel dimension is compressed using a \(1 \times 1\) convolution operation to produce a feature map with dimensions \(C/2 \times 1 \times 1\):

$$\begin{aligned} {\left\{ \begin{array}{ll} F_{up} = \delta \left( f(g_1(X)) \right) \\ F_{down} = \delta \left( f(g_2(X)) \right) , \\ \end{array}\right. } \end{aligned}$$
(1)

where, \(g_1\) denotes global average pooling and \(g_2\) denotes global max pooling. The function f refers to the \(1 \times 1\) convolution operation with shared weights across two branches, while \(\delta\) means the activation function. The features \(F_{up}\) and \(F_{down}\) are the outputs of two parallel branches.

Furthermore, the outputs of the two branches are concatenated to produce the feature map of dimensions \(C \times 1 \times 1\). Subsequently, by applying the \(\text {Sigmoid}\) function, we can obtain the parallel channel feature:

$$\begin{aligned} F_{channel} = \sigma (\text {Concat}(F_{up}, F_{down})), \end{aligned}$$
(2)

where, \(\sigma\) is the \(\text {Sigmoid}\) activation function.

The autocorrelation spatial module comprises three branches, with each branch individually performing a \(1 \times 1\) convolution operation on the input feature map, allowing us to obtain:

$$\begin{aligned} {\left\{ \begin{array}{ll} \tilde{F_{up}} = \tilde{f}(X) \\ \tilde{F_{mid}} = \tilde{f}(X) \\ \tilde{F_{lw}} = \tilde{f}(X), \\ \end{array}\right. } \end{aligned}$$
(3)

where, X denotes the input features, and \(\tilde{f}\) represents the \(1 \times 1\) convolution. The features \(\tilde{F_{up}}\), \(\tilde{F_{mid}}\), and \(\tilde{F_{lw}}\) correspond to the upper, middle, and lower branches after the convolution, respectively. After convolution, the channel dimension of the feature map is reduced to 1/8 of its original dimension.

Fig. 3
figure 3

SLAM comprises a parallel channel module and an autocorrelation spatial module. The parallel channel module is designed to extract rich semantic information, thereby enhancing the generalization performance of the features. Meanwhile, the autocorrelation spatial module is employed to suppress background interference, enabling the extraction of more discriminative features.

The upper and middle branches perform average pooling and max pooling operations, respectively, resulting in two feature maps of dimensions \(1 \times W \times H\). These are then concatenated to form a feature map of dimensions \(2 \times W \times H\). Subsequently, the channels are compressed by half using a \(7 \times 7\) convolution. Finally, the Sigmoid function is applied to the resulting feature of dimensions \(1 \times W \times H\) to obtain the spatial feature:

$$\begin{aligned} F_{spatial} = \sigma (\hat{f}(\text {Concat}(\tilde{g_1}(\tilde{F_{up}}), \tilde{g_2}(\tilde{F_{mid}})))) \end{aligned}$$
(4)

where, \(\tilde{g_1}\) denotes global average pooling and \(\tilde{g_2}\) denotes global max pooling, \(\hat{f}\) refers to the \(7 \times 7\) convolution.

Simultaneously, the convolution results from the upper and middle branches are multiplied and subjected to a softmax operation to produce a matrix of dimensions \((W \times H) \times (W \times H)\). The convolution result of the lower branch is then multiplied by this matrix, followed by a \(1 \times 1\) convolution to obtain the sample-wise feature, resulting in a feature map of dimensions \(C \times W \times H\):

$$\begin{aligned} F_{sample} = \tilde{f}(\tilde{F_{lw}}\cdot \tilde{\sigma }(\tilde{F_{up}}\cdot \tilde{F_{mid}})), \end{aligned}$$
(5)

where \(\tilde{\sigma }\) represents the \(\text {Softmax}\) activation function. Subsequently, the sample feature \(F_{sample}\) is summed with the input features \(X\). The resulting sum is then multiplied by the spatial feature \(F_{spatial}\) to generate the autocorrelation spatial weight:

$$\begin{aligned} F_{ACspatial} = (X + F_{sample}) \cdot F_{spatial}. \end{aligned}$$
(6)

Ultimately, the intra-sample local attention weight can be obtained:

$$\begin{aligned} W_{intra-sample} = F_{channel} \cdot F_{ACspatial}. \end{aligned}$$
(7)

Inter-sample global attention

We utilize the global correlation between samples to generate interaction-aware weights, capturing semantic information across samples and enabling the learning of representations with consistent re-identification performance.

Fig. 4
figure 4

SGAM utilizes global correlation between samples to generate batch-interaction weights, capturing semantic information across samples.

As illustrated in Fig.4, we utilize the previously defined spatial feature \(F_{spatial} \in \mathbb {R}^{B \times 1 \times W \times H}\) and channel feature \(F_{channel} \in \mathbb {R}^{B \times C \times 1 \times 1}\) to compute the inter-sample global weight. Initially, a pooling operation compresses the spatial dimension to obtain \(F_{spatial}^{'} \in \mathbb {R}^{B \times 1 \times 1 \times 1}\). Concurrently, a mean operation reduces the dimension of \(F_{channel}\) along the channel, resulting in \(F_{channel}^{'} \in \mathbb {R}^{B \times 1 \times 1 \times 1}\). Therefore, the equation can be represented as:

$$\begin{aligned} \begin{aligned} {\left\{ \begin{array}{ll} F_{spatial}^{\prime }=AvgPooling(F_{spatial}) \\ F_{channel}^{\prime }=Mean(F_{channel}). \\ \end{array}\right. } \end{aligned} \end{aligned}$$
(8)

By applying pooling and mean operations, the semantic information from the original spatial and channel dimensions is transformed into batch interaction information. We then perform a pixel-wise summation between \(F_{spatial}^{'}\) and \(F_{channel}^{'}\) to obtain the inter-sample interaction features \(F_{batch} \in \mathbb {R}^{B \times 1 \times 1 \times 1}\):

$$\begin{aligned} F_{batch}=F_{channel}^{\prime } + F_{spatial}^{\prime }. \end{aligned}$$
(9)

More specifically, \(F_{batch}\) can be represented as a vector comprising the interaction information \(F_{i}\) corresponding to each sample \(X_{i} \in \mathbb {R}^{C \times W \times H}\). This is defined as:

$$\begin{aligned} F_{batch}=[F_1,F_2,...,F_B]\in \mathbb {R}^{B\times 1\times 1\times 1}, \end{aligned}$$
(10)

where, B indicates the total number of images in the current batch. Furthermore, we use the softmax function to generate the weight of the interaction information of each sample:

$$\begin{aligned} \omega _i=\frac{e^{F_i}}{\sum _{j=1}^Be^{F_i}}, \end{aligned}$$
(11)
Fig. 5
figure 5

Tracking Association component predicts and assigns object trajectories, ensuring accurate and reliable tracking.

thus, the global attention weight vector across samples can be written as:

$$\begin{aligned} W_{inter-sample}=[\omega _1,\omega _2,...,\omega _B]. \end{aligned}$$
(12)

In the end, we perform a pixel-wise multiplication of the input feature X with the intra-sample attention \(W_{intra-sample}\) and the inter-sample attention \(W_{inter-sample}\) to obtain the output feature:

$$\begin{aligned} X_{out}=X\cdot W_{intra-sample}\cdot W_{inter-sample}. \end{aligned}$$
(13)

Online association

The Multi-Object Tracking (MOT) model is designed to detect and track multiple objects, outputting their respective bounding boxes and appearance embeddings. Each bounding box is defined by the parameters \((x, y, a, h)\), where \(x\) and \(y\) denote the coordinates of the center, \(a\) represents the aspect ratio, and \(h\) indicates the height. Furthermore, we establish the dynamic model for object tracking as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} x = x + \nu _{x} \\ y = y + \nu _{y} \\ a = a + \nu _{a} \\ h = h + \nu _{h}, \end{array}\right. } \end{aligned}$$
(14)

where \(\nu _{x}, \nu _{y}, \nu _{a}, \nu _{h}\) represent the velocity components, and \(X^{k}=(x,y,a,h,\nu _{x},\nu _{y},\nu _{a},\nu _{h})\) denotes the object’s motion state. As illustrated in Fig. 5, the object motion state \(X^{k}_{t-1}\) at time \(t-1\) is used to estimate the object motion state \(X^{k}_{t}\) at time \(t\) by employing a Kalman filter in conjunction with the velocity vector \((\nu _{x}, \nu _{y}, \nu _{a}, \nu _{h})\).

In the process of conducting online tracking associations, we consider the Intersection over Union (\(IoU\)), the appearance embedding distance, and the Mahalanobis distance between the detected state (observations) and the estimated state (trajectories). We compute the cost matrix based on the appearance embeddings of the trajectories and observations, and then proceed with cost matching. When an observation is successfully matched with a trajectory, the appearance embedding of the trajectory is subsequently updated.

$$\begin{aligned} f_t=\eta f_{t-1}+(1-\eta )\hat{f}, \end{aligned}$$
(15)

where \(f_t\) represents the appearance embedding of the trajectory at time \(t\), \(\hat{f}\) denotes the appearance embedding of the matched observation, and \(\eta\) is the smoothing coefficient for the appearance embedding. If a trajectory does not match any observation, it is considered lost. The trajectory will be removed from the trajectory pool after being lost for a predetermined period \(T\); however, if it is re-matched with an observation within this period, the trajectory will return to its normal tracking state. Conversely, if an observation does not match any existing trajectory, a new trajectory is initialized with that observation.

MOT Loss

The objective of the multi-object tracking network is to minimize the discrepancy between the predicted values and the ground truth. To achieve this, the Focal Loss is employed for the category branch’s task head. The loss function is defined as follows:

$$\begin{aligned} \mathcal {L}_{f}(p_t)=-\theta _t(1-p_t)^\phi \log (p_t), \end{aligned}$$
(16)

where \(\theta _{t}\) represents the weighting factor, \((1-p_t)^\phi\) denotes the modulating factor, and \(\phi\) stands for the adjustable focusing parameter. Since object tracking is typically treated as an instance-based classification problem, the appearance embedding employs the cross-entropy loss function:

$$\begin{aligned} \mathcal {L}_{e}(\textrm{x},\textrm{y})=\sum _{i=1}^N-\log (\frac{\exp (\textrm{x})}{\sum _j\exp (\textrm{x}_j)})[\textrm{y}_i], \end{aligned}$$
(17)

where x denotes prediction and y denotes ground truth. In the detection branch, DIoU loss is used. If the Intersection over Union (IoU) between the anchor box and the ground-truth box exceeds or equals 0.5, the anchor box will be assigned. If the IoU falls within [0, 0.4), the anchor box is considered as background. If the IoU lies between [0.4, 0.5), the anchor box will not be included in the loss calculation. The anchor box with the highest IoU is designated as the ground truth box. DIoU loss is defined as follows:

$$\begin{aligned} \mathcal {L}_{d}=1-IoU+\frac{\rho ^2(b,b^{gt})}{c^2}, \end{aligned}$$
(18)

where, b and \(b^{gt}\) denote the center points of the anchor and ground truth boxes, respectively, \(\rho (\cdot )\) represents the Euclidean distance, and c signifies the smallest diagonal length that encloses both bounding boxes. To address the issue of unbalanced loss in multi-task learning, we employ a task-independent uncertainty evaluation method. The objective function, which inherently balances the loss, is formulated as follows:

$$\begin{aligned} \mathcal {L}=\sum _{i=0}^{3}\alpha \cdot ( \frac{\mathcal {L}_{f}^{i}}{e^{\omega _{f}^{i}}}+\omega _{f}^{i})+ \beta \cdot (\frac{\mathcal {L}_{d}^{i}}{e^{\omega _{d}^{i}}}+\omega _{d}^{i})+ \gamma \cdot (\frac{\mathcal {L}_{e}^{i}}{e^{\omega _{e}^{i}}}+\omega _{e}^{i}) , \end{aligned}$$
(19)

where \(\alpha , \beta , \gamma\) represent hyperparameters, i indicates different task heads, and \(\omega\) signifies the distinct branch loss weights for the category \((\omega _{f}^{i})\), the bounding box \((\omega _{d}^{i})\), or the appearance embedding \((\omega _{e}^{i})\). All of these parameters are learnable.

Table 1 Lightweight Model.

Experiment

Implementation details

We implemented the training of the proposed network using the PyTorch framework, with the final hyperparameters set as follows: the batch size was set to 8, and the model was trained with 200 epochs. Standard data augmentation techniques, including horizontal flipping, affine transformation, and color jittering, were employed. We utilized the SGD optimizer34 for training, with momentum and weight decay parameters set to 0.9 and 5e-4, respectively. The initial learning rate was 0.025, which was reduced by a factor of ten at the 100th and 150th epochs. The training process took roughly 80 hours on an NVIDIA GeForce RTX 3090 GPU. We also employ a multi-scale dense connection structure (DCS) to enhance semantic feature extraction across various scales, and utilize a feature enhancement module (FEM) to balance detection bias within the unified network. Moreover, as shown in Table 1, we adopt a lightweight model to achieve more efficient tracking.

Fig. 6
figure 6

The substation dataset contains the challenges of occlusion, crowding, and similar objects.

Table 2 Comparison with other methods on MOT15, MOT16 and MOT17 test datasets at large 1088x608 resolution.

Datasets

We utilize the ETH55 and CityPerson56 datasets, which provide only bounding box annotations, to train the detection branch. Additionally, we employ the CalTech57, CUHK-SYSU58, MOT17 training sets59, and PRW60 datasets, which offer both bounding box and identity annotations, to train both the detection and embedding branches. The fully trained model is subsequently evaluated on the MOT1561, MOT1659, and MOT1759 test datasets.

Furthermore, we propose a multi-object tracking dataset, STATION, specifically designed for substation scenarios, to provide a more challenging benchmark. As shown in Fig. 6, the substation staff wear uniform with a similar appearance, making it difficult to distinguish effectively. The STATION dataset encompasses various real-world scenes, including occlusion and similar objects, which presenting challenges in re-identifying lost targets and distinguishing between similar objects. This dataset comprises 9 sequences with a total of 3142 frames.

Comparison with other methods

We compared our proposed method with several remarkable methods using the test sets of the MOT15, MOT16, and MOT17 benchmarks, with all evaluation results obtained directly from the official MOT Challenge server.

As shown in Table 2, our method outperforms all other methods on the MOT15 dataset in terms of IDF1, MT, and ML metrics. Although the lightweight design and the absence of an additional and time-consuming re-identification module result in our method having slightly lower MOTA and higher IDs compared to AP-HWDPL38, it is worth noting that our tracking speed is five times faster. Secondly, while our method is online and offline methods generally achieve higher MOTA as they leverage both past and future information for object tracking, our proposed method outperforms other offline methods such as NOMT35, CRF_Track36, and TSML_CDEnew37.

On the MOT16 dataset, our method outperforms other TBD and one-stage multi-object tracking methods, with our MOTA score exceeding that of the second-best method, HDTR43, by 3.3%. Furthermore, our method runs at a speed that is 11 times faster than HDTR43. This demonstrates the robust tracking performance of our method, its adaptability to high-resolution scenes in the MOT16 dataset, and its capability to meet real-time tracking requirements.

On the MOT17 dataset, we compared our method with the latest comparable multi-object tracking methods, MapTrack5 and TADN6. Our method surpasses these two approaches in five key metrics: MOTA, IDF1, MT, ML, and FPS. This comprehensive performance evaluation not only validates the effectiveness of our method but also highlights its superior speed and capability in real-time tracking scenarios.

Table 3 Comparison with other methods on MOT15, MOT16 and MOT17 train and test datasets at small 576x320 resolution.

We also performed a comparison of our method with representative one-stage MOT techniques, namely JDE10 and TADN6. Given that our primary optimization was aimed at addressing the time-consuming nature of JDE’s detection process, we implemented a lightweight backbone design to enhance the MOT system’s performance in real-time tracking scenarios. Our evaluation focused on comparing the performance of our method and JDE at a small resolution of 576x320. As illustrated in Table 3, our method substantially exceeded the performance of JDE and TADN on the MOT15, MOT16, and MOT17 datasets, with a significant increase in FPS of approximately 30% and four times. This demonstrates the superior tracking and speed performance of our method.

Table 4 Comparison with other methods on our STATION dataset.

Furthermore, we conducted comparative experiments on the proposed STATION dataset. As illustrated in Table 4, our method outperformed JDE and TADN in all six metrics: MOTA, IDF1, MT, ML, IDs, and FPS. In particular, MOTA surpassed JDE by 10.5% and FPS was four times faster than TADN, attaining a speed of 61.8 frames per second. These experimental results demonstrate that our method effectively reidentifies lost objects and distinguishes between similar objects, particularly in scenarios involving occlusions and similar objects.

Ablation study

To evaluate the individual impact of each module in our algorithm, we devised five ablation comparison methods by deactivating one module at a time. Each comparison method is described as follows:

  • w/o FEM: We deactivated all the proposed modules, utilizing only the backbone network and task head to perform detection and tracking tasks.

  • w/o DCS: We deactivated the DCS multi-scale dense connection structure, maintaining only the FEM feature enhancement module which extracts appearance embedding features with strong semantics, aimed at balancing the bias of the unified network towards detection tasks and achieving joint optimization.

  • w/FEM & DCS: Utilizing only the FEM module and DCS structure, excluding the attention module; by integrating the DCS multi-scale dense connection structure, object detection is enhanced, especially for small targets.

  • w/SLAM: We adopted the SLAM into the FEM and DCS to extract intra-sample attention information, allowing the model to better focus on the discriminative areas of the image, reduce redundant information interference, and obtain rich features, thereby improving MOT performance.

  • w/SLAM & SGAM: Based on FEM, DCS, and SLAM, we adopted the SGAM to learn inter-sample correlations and generate interactive perception weights, to perceive semantic information across samples and extract consistent representations that mitigate the effects of crowding and occlusion.

We performed ablation studies on the MOTA and IDF1 metrics using both the validation and test sets of MOT-15, as well as the validation set of MOT16. As illustrated in Fig. 7, the introduction of the FEM and DCS modules resulted in noticeable improvements in both MOTA and IDF1 metrics. This indicates that the enhanced features and the use of multiple scales contribute positively to the performance of multi-object tracking (MOT). Furthermore, the utilization of DAM, including SLAM and SGAM, led to significant enhancements in both MOTA and IDF1. This improvement demonstrates that the DAM effectively extracts discriminative contextual features and captures instance-level semantic information shared among samples, thereby substantially enhancing the recognition performance for both similar and lost objects.

Fig. 7
figure 7

The ablation experiment of our proposed method on the MOT15 test, MOT15 validation and MOT16 validation datasets.

Table 5 Analysis of LR_Coeff hyperparameter on MOT16 train datasets.
Fig. 8
figure 8

We conducted comprehensive experiments on the TUDCrossing, AVGTownCentre, PETS09-S2L2, and KITTI-19 datasets to validate our method’s robust tracking performance in static, crowded, and varied illumination scenarios. Additionally, these experiments demonstrated our method’s strong reidentification capabilities for lost objects.

Table 6 Analysis of \(\alpha\), \(\beta\) and \(\gamma\) Loss weights on MOT16 train datasets.

Furthermore, we explored the impact of various hyperparameter settings on the network’s performance. As illustrated in Table 5, the learning rate coefficient (LR_Coeff) has three values, each corresponding to the contributions of the backbone, detection, and tracking to the learning rate, respectively. The results indicate that when the LR_Coeff values are set to [1, 1, 1], the MOTA and IDF1 metrics achieve their optimal values. Additionally, as shown in Table 6, we analyzed different configurations of loss weights, where \(\alpha\) represents the categorical branch, \(\beta\) represents the regression box branch, and \(\gamma\) represents the appearance embedding branch. The results demonstrate that MOTA and IDF1 achieved their best values when \(\alpha =0.5, \beta =0.5, \gamma =0.5\).

Fig. 9
figure 9

We conducted comparative experiments on the STATION and MOT16 test datasets against the JDE method to demonstrate our method’s superior reidentification capabilities for lost objects and enhanced distinguishing capabilities for similar objects.

Adaptability to various scenes

In Fig. 8, the TUD-Crossing dataset61 (first row) depicts a scenario where numerous pedestrians cross an intersection in opposite directions.

In frame 43 (first column), targets with track IDs 8 and 10 are observed walking in tandem, and both are successfully tracked. In frame 73 (second column), the target with track ID 8 is occluded and disappears from view due to the target with track ID 12 moving in the opposite direction, resulting in only the target with track ID 10 being tracked. In frame 79 (third column), the target with track ID 8 is reidentified after being temporarily lost, and the target with track ID 10 continues to be tracked despite severe occlusion. This successful re-identification and tracking under challenging conditions can be attributed to our DAM. This mechanism employs SGAM, capturing instance-level semantic information across samples and facilitating feature interaction between objects in different frames, thereby enhancing the re-identification performance for lost objects. In the dataset KITTI-1961 (the second row), the image aspect ratio is quite extreme at 3.3:1, whereas our training data primarily features ratios ranging from 1:1 to 2:1. Moreover, significant illumination changes affect performance, presenting challenges for tracking. Nonetheless, experiments reveal that our method remains robust against variations in illumination and scale. Furthermore, the datasets AVG-TownCentre61 (the third row) and PETS09-S2L261 (the fourth row) illustrate that our method consistently maintains strong tracking performance in static, crowded, and varied illumination conditions.

Furthermore, To demonstrate the superiority of our method, we conducted comparative experiments using the MOT16 dataset and our proposed STATION dataset, alongside the JDE method. As illustrated in Fig. 9, within the STATION dataset (first row), the target with track ID 83 is partially visible due to occlusion in the first column and completely occluded in the second column. Given the uniform clothing and similar appearance of the substation staff, which presents a challenge for identifying lost targets, our method successfully reidentifies the target in the third column. In contrast, the JDE method (second row) exhibits an ID switch in the third column after the target is lost in the third column. This demonstrates that the introduction of the DAM in our method effectively extracts discriminative contextual features and captures instance-level semantic information shared among samples. Consequently, this mechanism significantly enhances the recognition performance for both similar and lost objects. Moreover, in the MOT16-06 sequence, the JDE method (fourth row) erroneously detects a billboard as a pedestrian with a track ID of 216, whereas our method avoids such false detections, underscoring its robustness and accuracy in challenging tracking scenarios.

Visualization

As illustrated in Fig. 10, the dual attention mechanism is visualized. Despite variations in illumination and scale, as well as the challenges posed by crowding and occlusion in these images, the adoption of SLAM and SGAM extracts distinct sample information and cross-sample semantic perception context. This mechanism pays more attention to discriminative region of the images while minimizing the impact of background and noise interference, thereby enhancing the model’s tracking capabilities.

Fig. 10
figure 10

Visualization of the attention mechanism in various scenarios demonstrates that the adoption of the dual attention mechanism enables the model to focus on the discriminative regions of the images.

Conclusion and future work

In this work, we propose an effective multi-object tracking method, which utilizes a dual attention mechanism to extract discriminative contextual features and capture instance-level semantic information shared among samples. This method significantly enhances recognition performance for both similar and lost objects. Additionally, we propose the STATION dataset, which encompasses various real-world scenes characterized by severe occlusion and similar objects. Extensive experiments conducted on the MOT and STATION datasets demonstrate the effectiveness of the proposed method, achieving superior tracking performance compared to other comparable methods. In the Future, we will explore the use of graph neural networks to learn the instance-wise graph relationships, aiming to extract more distinctive features.