Introduction

With the rapid advancement of drone remote sensing technology, the application scope of drone aerial imagery has significantly widened, encompassing crucial domains such as environmental monitoring, urban planning, agricultural management, and disaster relief1. In these application scenarios, accurate and efficient tracking of multiple objects in aerial imagery is paramount. Multi-object tracking technology provides real-time information about object positions and movement trajectories, offering vital data support for decision-makers2,3. However, drones face many challenges when performing complex tasks, such as accurately tracking small objects on the ground (such as wildlife, pedestrians, or specific objects). Drones frequently change their perspective and position during mission execution, resulting in rapid changes to the appearance and position of the tracked object in the image. Traditional algorithms struggle to maintain stable tracking in such dynamic environments. When drones fly at high altitudes, they often face more complex and varied backgrounds than on the ground and are easily obstructed by objects such as trees and buildings. This increases the difficulty of object detection and tracking. Drones are constrained by their payload and energy consumption, so the demand for computing resources and communication bandwidth must be as low as possible. Therefore, algorithms need to minimize computational complexity and data transmission while ensuring performance.

Traditional multi-object tracking methods heavily rely on manually designed feature extractors and complex tracking algorithms, often exhibiting subpar performance when dealing with complex backgrounds and occlusion scenarios4. In recent years, deep learning technology has provided new solutions for multi-object tracking. Deep learning models such as convolutional and recurrent neural networks have made significant strides in multi-object tracking5,6,7. However, these methods typically involve separating object detection from tracking into two independent steps, which increases computational complexity and may lead to decreased tracking performance8.

To overcome these challenges, researchers have begun exploring end-to-end multi-object tracking methods, integrating object detection and tracking tasks into a unified model, aiming for one-stop solutions9,10,11. In this context, the Transformer model has made breakthrough progress in various domains, including natural language processing and computer vision, due to its robust feature representation capabilities and self-attention mechanism. The Transformer model can capture global dependency relationships, offering inherent advantages in handling complex backgrounds and inter-object interactions in aerial imagery12.

Hence, we present End-to-End Drone Multiple Object Tracking (ETDMOT), a novel approach utilizing Transformers to efficiently and accurately track multiple objects in drone aerial footage. By integrating object detection and tracking into a unified Transformer framework, our approach leverages self-attention mechanisms to capture complex relationships among objects, significantly enhancing the tracking pipeline. The primary contributions of this work are as follows:

  1. (1)

    Object Detection and Long-term Tracking Matching (ODLTM) We introduce ODLTM to address the challenges of object matching discrepancies, such as missing objects and non-overlapping matches. By utilizing a Gaussian distribution to model the effective response field of feature points, we calculate the similarity between detections and trajectories using receptive field distance. This comprehensive method enhances both precision and robustness in tracking.

  2. (2)

    Cross Frame Self-Attention Mechanism (CF-SA) The CF-SA mechanism accurately retrieves rich semantic information from recorded trajectories, capturing both horizontal and vertical details through strip attention. Techniques like linear projection and attention processes combine horizontal and vertical features, yielding a thorough representation of the object’s trajectory for more precise tracking.

  3. (3)

    Enhanced Self Characteristics (ESC) ESC is designed to achieve semantic alignment by leveraging residual-form multi-head self-attention. This module optimizes individual feature extraction by incorporating information from various points on the feature map. The inclusion of a sine function encoding method addresses the challenge of position discrimination in attention mechanisms, resulting in semantically aligned feature maps and improved tracking performance.

  4. (4)

    Storage Attention Layer This layer ensures object consistency and stability by extracting long-term characteristics from all previous object states. A long-term memory vector, updated using an exponential decay moving average, facilitates information sharing between objects with different IDs. This process improves feature representation, ensuring accurate and continuous tracking across frames by preserving past state data.

Our approach innovatively combines object detection and tracking into a unified pipeline, streamlining the process and enhancing both accuracy and efficiency. Through the use of self-attention mechanisms, ETDMOT is able to capture complex relationships between objects, making it particularly well-suited for dynamic and cluttered environments typical of drone footage.

Related work

Multiple object tracking

Multi-object tracking (MOT) technology has garnered significant attention due to its wide-ranging applications, sparking considerable interest among scholars13. In the early stages of research, scholars primarily focused on utilizing optimization algorithms to deduce object trajectories. One of the simplest MOT methods is IOUTracker, which relies solely on the intersection over the union of bounding boxes to achieve object tracking14. Over time, researchers gradually introduced motion models and Kalman filters to predict object positions in the next frame. Despite these improvements enhancing speed and performance, MOT algorithms still exhibit unsatisfactory performance in complex occlusion and object loss scenarios15. To overcome these challenges, researchers began incorporating re-identification (ReID) features as appearance models, utilizing visual features of objects across different frames to match objects and improve the correlation accuracy between trajectories and detection results16. Apart from ReID, some studies utilize image segmentation techniques for object identification and tracking to handle occlusion scenarios better17.

Additionally, some researchers have started using recurrent neural networks or attention mechanisms to model the spatiotemporal relationships between objects to enhance tracking accuracy and stability. Recently, attention mechanisms and recurrent neural networks have been used to model spatiotemporal relationships between objects, improving tracking accuracy and stability. However, many of these methods rely on a single matching strategy and fail to consider the diverse attributes of different object types, leading to suboptimal performance. Complex combinations of motion, appearance, and spatiotemporal features further reduce effectiveness, especially in dynamic environments. To address these limitations, we propose ETDMOT framework that unifies detection and tracking within a Transformer-based architecture. This approach leverages self-attention mechanisms to capture intricate relationships between objects, providing a more robust and efficient solution for multi-object tracking in drone-based footage.

Object feature-based multi-object tracking methods

As a result of the rapid development of object detectors, feature-based modeling has become widely used in drone-based multi-object tracking algorithms. It achieves multi-object tracking by capturing unique features of objects, such as color, texture, and optical flow. These extracted features must be distinctive to ensure their effective differentiation in feature space, enabling the identification of the most similar objects in the next frame based on similarity criteria. SCTrack employs a three-stage data association method, integrating object appearance models, spatial distance, and explicit occlusion handling units. The system relies on tracked objects’ motion patterns and takes environmental constraints into account, demonstrating exemplary performance in handling occluded objects18. The OSIM network trains a vast residual network on the VeRi dataset to extract object appearance features. They use the Mahalanobis distance of detected bounding boxes as a motion metric and compute the minimum cosine distance of pixels inside bounding boxes as an appearance similarity metric. By combining and weighting these two metrics and using cascade matching to connect data, we can do strong multi-object tracking19. Appearance, similarity, and consistency are often fused to address the subjective setting of fusion proportions between appearance and motion. We find out how similar two objects look to other objects nearby and use the social LSTM network to guess how they will move. Then, we use the weighted appearance similarity and motion prediction to make connections between this object and the object from the previous frame20.

Despite advancements, existing feature-based multi-object tracking methods face key limitations: high computational costs, making real-time tracking in drone footage challenging; frequent false detections due to dynamic aerial environments; and complex object association, as balancing appearance and motion features can be inconsistent and subjective. These challenges highlight the need for improvement in scalability, accuracy in cluttered environments, and unified feature integration. To address these issues, our approach, ETDMOT, integrates detection and tracking while introducing innovations such as Enhanced Self Characteristics (ESC), which improves semantic alignment by optimizing feature extraction across frames. Combined with CF-SA’s powerful feature extraction and the Storage Attention Layer for long-term consistency, ETDMOT addresses these limitations, optimizing both accuracy and computational efficiency in drone-based environments.

Joint detection and tracking multi-object methods

To enhance the operational speed of the entire drone aerial multi-object tracking system, researchers actively explore methods that combine object detection and feature extraction to achieve greater levels of sharing in computation. JDE is the first work to attempt this approach, innovatively integrating the feature extraction branch into the YOLOv3 single-stage detector21. FairMOT balances the handling between detection and recognition tasks more evenly by reducing anchor ambiguity using the anchor-free detector CenterNet22. In addition to these joint detection and feature embedding methods, other single-stage trackers have emerged. GLOA creates global-local perception blocks that take in input frames and extract scale variance feature information. It then adds identity embedding branches to prediction heads to get more discriminative identity information23. CenterTrack24 and Chained Tracker25 predict bounding boxes in consecutive frames using a multi-frame approach to facilitate efficient short-term correlation, ultimately forming long-term object trajectories. However, these methods often suffer from numerous identity switches due to difficulties in capturing long-term dependency relationships. Moreover, these methods cannot simultaneously consider the multiple features of objects and the differences in features between different categories, making the tracking of small objects prone to loss. SOE-Net26 incorporates long-range context into pointwise local descriptors. The CASSPR method27 takes out and groups low-resolution data through a sparse voxel branch and gets detailed local data through a pointwise branch. Text2Loc28 uses max pooling and hierarchical transformers to find the global locations of things. It does this by keeping the balance between positive and negative samples through text subgraph contrastive learning and capturing the changing relationships between each sample.

Despite advancements, joint detection and tracking methods face key limitations: frequent identity switches due to difficulties in maintaining long-term dependencies, inadequate handling of diverse object features (especially for small or distinct objects), and limited long-term tracking capabilities. These issues undermine tracking performance, particularly in complex drone environments. By leveraging ODLTM’s precise matching and the Storage Attention Layer’s memory-driven tracking, ETDMOT significantly reduces identity switches and enhances long-term tracking accuracy. These contributions, coupled with the self-attention mechanism, enable better tracking of diverse object features, improving overall efficiency and performance in drone-based applications.

Transformer-based multi-object tracking methods

In recent years, Transformer-based models have achieved tremendous success in computer vision, particularly excelling in object detection, spawning several Transformer-based methods in the drone MOT domain. Swin-JDE29, STDFormer30, and MOTR31 are highly regarded, all built upon the DETR32 and its derivative models, becoming representatives of online tracking. The Transformer’s long-range modeling abilities are used by STDFormer to find intent information in time motion and decision information in space interaction. With the help of Transformer, Swin-JDE matches each trajectory and detection information by taking into account the detection confidence, appearance embedding distance, and IoU distance factors. Moreover, MOTR achieves end-to-end object tracking by iteratively updating tracking queries, eliminating the need for complex post-processing steps. MeMOT33, similar to MOTR, is based on attention mechanisms and can be predicted by focusing on object states. Although these methods pave the way for new tracking paradigms, their performance still falls short of state-of-the-art tracking algorithms. Standard self-attention can capture fine-grained interactions between short and long distances. However, when paying attention to high-resolution feature maps, they suffer from high computational costs, leading to explosive growth in time and memory costs. EANTrack34 enhances the model’s discriminative ability by integrating relevant feature information through a complementary structure and functionality. ASTMT35 utilizes a spatial-temporal memory network to store scene information and reduce interference, along with an aligned matching module to alleviate occlusion effects, thereby enhancing tracking accuracy. RPformer36, a novel parallel Transformer-based tracker, integrates the principles of Siamese and Transformer methodologies through its design, which includes two complementary Transformer modules and a feature fusion head.

While Transformer-based approaches introduce innovative tracking paradigms, they face several limitations: high computational costs due to the self-attention mechanism on high-resolution feature maps, challenges in handling occlusions and maintaining trajectory consistency, and difficulty in differentiating diverse object features like size and movement. These issues hinder real-time tracking, increase identity switches, and reduce performance in complex drone environments. Our proposed ETDMOT framework optimizes the self-attention mechanism for greater computational efficiency, improves occlusion handling through ESC, and ensures long-term trajectory consistency using the Storage Attention Layer. These enhancements enable more accurate, efficient, and reliable tracking in complex drone environments.

Methodology

ETDMOT consists of four main modules, as shown in Fig. 1: (1) Object detection and trajectory feature joint extraction module, which detects objects and provides candidate object and trajectory information for each frame of the current time step; the Backbone network efficiently extracts object features from video frames and passes these features to the Transformer Encoder for information aggregation and integration. We choose ResNet-50 as our backbone. Subsequently, the aggregated features and detection and tracking queries serve as inputs to the L-layer decoder for detecting newly appearing objects or continuously tracking labeled objects. (2) In the matching stage, we adopt the ODLTM strategy for label allocation, which comprehensively considers the object’s appearance, space, and Gaussian features to ensure the label allocation’s accuracy and rationality–used to obtain rich contextual information and enhance the expression ability of features. (3) Historical information extraction module SAL (Storage Attention Layer). We have created independent storage for each object to support long-term stable tracking. After fusing the trajectory information of the current frame with the previous frame, the feature, long-term memory vector, and current frame output vector are used as Key, Query, and Value, which are input into the cross-frame long-term interaction module to deeply explore the long-term features of the object, providing strong information support for stable tracking of the object. (4) To ensure the semantic consistency of the input of the cross-frame long-term interaction module, we designed an ESC module (Enhanced Self Characteristics) to extract the semantic features of the trajectory information between the current frame and the previous frame. At the same time, we introduced the Cross Frame Self-Attention mechanism to extract the deep semantic information of stored trajectories accurately. Finally, we combine this processed information with Trajectory Storage as input for the subsequent frame processing to achieve cyclic iteration and ensure the continuity and accuracy of tracking.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

The framework of ETDMOT. The ETDMOT framework efficiently extracts object features from video frames using a Backbone network and integrates them with a Transformer Encoder. With ResNet-50 as the backbone, the framework detects new objects and tracks labeled ones using the P-layer decoder. We use the ODLTM strategy for label assignment, considering appearance, spatial, and Gaussian features. Each object has independent storage for long-term, stable tracking. Trajectory information from consecutive frames is fused and used as input to the Cross-frame Long-term Interaction module for extracting long-term features. The ESC module guarantees semantic consistency, whereas the Cross Spatial-SA mechanism extracts deep semantic information. Combined with trajectory storage, continuous and accurate tracking is achieved through cyclic iteration.

Object detection and long-term tracking matching

ODLTM selects appearance, space, and Gaussian features as the basic features for tracking. The appearance features provide unique visual information about the object, which helps to distinguish and recognize it in complex backgrounds. The spatial features describe the object’s motion trajectory and dynamic changes, which helps predict its possible position in the next frame. Gaussian features provide a description of the object’s local or global statistical properties, which helps to address issues of partial occlusion or deformation. These three features each have their own advantages in object tracking and can complement each other, thereby improving tracking accuracy and robustness.Although small objects occupy fewer pixels in the image, their appearance features (such as color, brightness contrast, etc.) are often more prominent and easily distinguishable from other objects. Meanwhile, information such as velocity and acceleration in spatial features is particularly important in small object tracking, as they can more quickly reflect the motion trend and direction of small objects. As for Gaussian features, they also have certain advantages in dealing with slight deformations or partial occlusions of small objects because Gaussian models can tolerate feature changes and noise interference to a certain extent.

The framework of ODLTM is shown in Fig. 2. At moment \(t-1\), there are \(N^{t-1}\) queries categorized into two types: \(N_{Tra}^{t-1}\) tracking queries \(Q_{Tra}\) and \(N_{Det}^{t-1}\) detection queries \(Q_{Det}^{t-1}\), where \(N^{t-1}=N_{Det}^{t-1}+N_{Tra}^{t-1}\). All the queries self-attend each other and then cross-attend the image feature tokens via L decoders, and the output embeddings of the l-th decoder are denoted as \(ET_{Tra}^{l}\) and \(ED_{Det}^{l}\).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The framework of ODLTM. Match the current frame’s detection results with the previous frame’s tracking results through comprehensive matching. High-scoring results in the detection are assigned new IDs, high-scoring tracks in the tracking are retained while low-scoring ones are considered out of sight. Finally, the tracking results of the previous frame are used as Q, and the re-matched trajectories are used as K and V inputs to the Object Feature Cluster, achieving feature clustering. Finally, merge with new trajectories and enter the next iteration.

At moment t, there are \(M_{G}^{t}\) ground truth boxes. Among them, \(M_{Tra}^{t}\) previously tracked objects, denoted as \(\widehat{ET}\), are assigned to \(N_{Tra}^{t-1}\) tracking queries, where \(M_{Tra}^{t} \le N_{Tra}^{t-1}\) as some objects disappear. Formally, the j-th tracking embedding \(e_{j}^{p}\) will be assigned to the same identity with the previous timestamp if still alive at this moment, otherwise zero (disappearing). Besides, \(M_{Det}^{t}\) newborn objects, denoted as \(\widehat{ED}\), are assigned to \(N_{Det}^{t-1}\) detection queries. Specifically, the Hungarian matching algorithm is used to find the optimal pairing between \(F^{i}\) and \(\widehat{F}\) for each decoder by a cost function

$$\begin{aligned} L_{m}=L_{f}(a)+L_{1}(b)+L_{g}(c) \end{aligned}$$
(1)

That takes into account the class scores and box overlapping. Where \(L_{f}(a)\) represents the focal loss for classification, \(L_{1}(b)\) represents the \(L_{1}\) cost of the bounding box, and \(L_{g}(c)\) represents the Kullback-Leibler divergence distance between the object and trajectory frame Gaussian distribution. Next, we provide a detailed introduction to \(L_{g}(c)\).

Traditional trackers often use IOU as an evaluation criterion. However, it is sensitive to small offsets and incredibly ineffective when there is no overlap, leading to missed detections and matches for small objects. To address this issue, we introduce a method based on Gaussian distribution, where the effective response field of feature points is simulated as a Gaussian distribution, and the similarity between detection and trajectory is measured using receptive field distance. Due to the smooth transition of the Gaussian distribution, the influence range of each feature point covers the entire image, with weights gradually decreasing from the center to the edge but always positive, thus achieving a balanced selection of positive samples for objects of different sizes.

As the exact Effective Receptive Field (ERF) of each feature point accounts for only a fraction of the entire Total Receptive Field (TRF), measuring precise ERFs is challenging37. Both ERF and TRF share the same center point, and the theoretical receptive field TRF of the n-th layer in a standard convolutional neural network is represented as \(tr_{n}\):

$$\begin{aligned} tr_{n}=tr_{n-1}+(k_{n}-1)\prod _{i=1}^{n-1}s_{i} \end{aligned}$$
(2)

where \(tr_{n}\) denotes the TRF for each point on the n-th convolutional layer, \(k_{n}\) and \(s_{n}\) represent the kernel size and stride of the convolution operation on the n-th layer. The half radius of TRF is approximated as the half radius of ERF (\(er_{n}\)). The square of \(er_{n}\) is used as the covariance of the two-dimensional Gaussian distribution for standard square convolutional kernels. The range of ERF is modeled as a two-dimensional Gaussian distribution \(N_{e}(\mu _{e},\Sigma _{e})\):

$$\begin{aligned} \mu _{e}=\begin{bmatrix} x_{n} \\ y_{n} \end{bmatrix}, \Sigma _{e}=\begin{bmatrix} er_{n}^{2} & 0 \\ 0 & er_{n}^{2} \end{bmatrix} \end{aligned}$$
(3)

To better measure the matching degree between ERF and trajectory frame, the trajectory frame \((x_{g},y_{g},w_{g},h_{g})\) is also modeled as a standard two-dimensional Gaussian distribution \(N_{g}(\mu _{g},\Sigma _{g})\). Where the center point of each annotated box is used as the mean vector of the Gaussian, and the square of half the side length is used as the covariance matrix:

$$\begin{aligned} \mu _{g}=\begin{bmatrix} x_{g} \\ y_{g} \end{bmatrix}, \Sigma _{g}=\begin{bmatrix} \frac{w_{g}^{2}}{4} & 0 \\ 0 & \frac{w_{g}^{2}}{4} \end{bmatrix} \end{aligned}$$
(4)

The Kullback-Leibler divergence is used to measure the statistical distance between two Gaussian distributions, and the KLD between ERF and trajectory frame is as follows:

$$\begin{aligned} L_{g}(c)=\frac{1}{2}\left( tr(\Sigma _{g}^{-1}\Sigma _{e})+(\mu _{g}-\mu _{e})^{T}\Sigma _{g}^{-1}(\mu _{g}-\mu _{e}) +\ln \left| \frac{\Sigma _{g}}{\Sigma _{e}}\right| \right) \end{aligned}$$
(5)

The high scores in the detection results are assigned a new ID, while the high scores in the tracking continue to be retained, while the low scores are considered out of sight. Input the tracking result of the previous frame as Q and the rematched trajectory as K and V into the Object Feature Cluster to achieve feature clustering. Finally, merge with the new trajectory and enter the next loop.

Cross frame self attention block

The structure of the Cross Frame Self Attention Block is shown in Fig. 3. In CF-SA, we employ distribution multi-head attention to focus on adjacent distributions within a window of input frames for capturing sequential details. This approach consists of four simultaneous components. Let Y represent the features extracted from a collection of input frames in \(\mathbb {R}^{T \times H \times W \times C}\) , where T, H, W, and C denote the number of frames, height, width, and channels, respectively.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

The framework of cross frame self attention block. The storage of each trajectory is first randomly divided into four parts, each without overlap. After segmentation, each feature is sent to a different distribution multi-head attention to extract trajectory information from multiple time instances. Finally, the features of each random block are re-stitched to extract long-term features.

For each frame \(Y_i \in \mathbb {R}^{H \times W \times C}\), \(i \in \{1,2,\ldots , T\}\), we first normalize them with Layer Normalization (LN), we partition \(Y \in \mathbb {R}^{T \times H \times W \times C}\) into \(Y^1\), \(Y^2\), \(Y^3\), and \(Y^4\) that are non-overlapping and situated in the exact location. These partitions serve as the input features for every self-attention block.

$$\begin{aligned} Y^1,Y^2,Y^3,Y^4 =\text {Split}(LN(Y)) \end{aligned}$$
(6)

These distributions are denoted as \(Y_i^j \in \mathbb {R}^{T \times W \times \frac{C}{2}}\), where \(j \in \{1,2,\ldots ,4\}\) denote the number of SAs. Then, we generate \(Q_{i,m}^j\), \(K_{i,m}^j\), and \(V_{i,m}^j \in \mathbb {R}^{T \times W \times \frac{C}{4M}}\) as queries, keys, and values respectively. These are derived from a linear projection of \(Y_i^j\).

$$\begin{aligned} Q_{i,m}^j,K_{i,m}^j,V_{i,m}^j =Y_i^j P_m^Q,Y_i^j P_m^K,Y_i^j P_m^V \end{aligned}$$
(7)

where \(P_m^Q\), \(P_m^K\), and \(P_m^V \in \mathbb {R}^{{C}\times \frac{C}{4M}}\), and \(m \in \{1,2,\ldots , M\}\) respectively denote linear transformation matrices for the queries, keys, and values. \(M=8\) is the number of heads for the multi-head distribution attention. Following that, \(Q_{i,m}^j\),\(K_{i,m}^j\), and \(V_{i,m}^j\) are transformed into tensors within \(\mathbb {R}^{T \times W \times \frac{C}{4M}}\). Subsequently, the attention mechanism operates on a strip-wise basis in the temporal dimension, resulting in attended outputs denoted as \(O_{i,m}^{j} \in \mathbb {R}^{T \times W \times \frac{C}{2M}}\).

$$\begin{aligned} O_{i,m}^{j} =\text {Softmax}\left( \frac{Q_{i,m}^j (K_{i,m}^j )^\top }{\sqrt{\frac{W \times C}{2M}}}\right) V_{i,m}^j \end{aligned}$$
(8)

Subsequently, \(O_{i,m}^{j}\) undergoes reshaping into \(T \times W \times \frac{C}{4M}\) for all heads, followed by concatenation along the channel dimension, resulting in \(O_i^{j} \in \mathbb {R}^{T \times W \times \frac{C}{4}}\). These H attended outputs, \(O_i^{j}\), are then consolidated into \(O^{j} \in \mathbb {R}^{T \times H \times W \times \frac{C}{4}}\), serving as the reweighted features in the horizontal direction. Following the same procedure, we obtain the final output:

$$\begin{aligned} \widehat{O}^{CF}= & \text {Conv}_{1 \times 1}(O^{1} \oplus O^{2} \oplus O^{3} \oplus O^{4})+Y \end{aligned}$$
(9)
$$\begin{aligned} {O}^{CF}= & MLP(LN(\widehat{O}^{CF})+\widehat{O}^{CF},\widehat{O}^{CF} ) \end{aligned}$$
(10)

we then generate the final CF-SA output \(O^{CF} \in \mathbb {R}^{T \times H \times W \times C}\). Its total space complexity is \(O((H+W)T^4)\).

Enhanced self characteristics

The structure of ESC is illustrated in Fig. 4. ESC dynamically integrates information from various positions of the feature map using multi-head self-attention in a residual manner. The attention mechanism cannot discern the positional information of the input feature sequence. Therefore, we introduce a spatial positional encoding process to the input \(X \in \mathbb {R}^{d \times N_x}\). We employ a sine function to generate spatial positional encoding. Ultimately, the mechanism of ESC can be summarized as follows:

$$\begin{aligned} X_{\text {ESC}} = X + \text {Multihead}(X + P_x, X + P_x, X) \end{aligned}$$
(11)

where \(P_x \in \mathbb {R}^{d \times N_x}\) represents the spatial positional encodings, and \(X_{\text {ESC}} \in \mathbb {R}^{d \times N_x}\) denotes the output of ESC.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The ESC module is based on multi-head self-attention in a residual form.

Storage attention layer

Whenever a new object appears, we create a corresponding long-term memory vector \(S_{i}^{tra}\) using its feature vector. Since the object’s appearance gradually changes over time, it is necessary to update this long-term memory vector continuously. The storage attention layer structure is shown in Fig. 5. Considering that the changes in objects between consecutive frames are typically smooth, we use the exponential moving average method to update this vector, calculating the weighted average of all previous vectors with weights decaying exponentially. The specific expression is as follows:

$$\begin{aligned} SAL_{t} = \alpha \left[ S_{tra}^{1} + (1 - \alpha )S_{tra}^{2} + (1 - \alpha )^{2}S_{tra}^{3} + \cdots \right] = \alpha \sum _{i=1}^{t}(1 - \alpha )^{i-1}S_{tra}^{i} \end{aligned}$$
(12)

where \(\alpha \in (0, 1]\). In practice, \(\alpha\) is set to 0.01 to ensure that the long-term memory vector of the same object undergoes only slight changes across consecutive frames. This ensures smooth and stable updates of the memory vector, continuously maintaining the latest feature representation of the object and preserving its stability and continuity in long-term memory. After the computation of one frame is completed, all output vectors \(O_{Tra}^{t}\) corresponding to the tracked objects are input into the subsequent module, and the tracking embedding \(E_{Tra}^{t+1}\) for use by the next frame decoder is updated accordingly.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

The framework of storage attention layer.

The current frame’s output vector \(O_{Tra}^{t}\) is dynamically combined with the previous frame’s output vector \(O_{Tra}^{t-1}\) to generate the fused feature \(\widehat{O}_{Tra}^{t}\). This method of enhancement using adjacent frames effectively strengthens the representation of objects in the video and produces more robust features. The Storage-Attention Layer employs a separate layer of self-attention mechanism. The fused feature \(\widehat{O}_{Tra}^{t}\), long-term memory vector \(M_{Tra}^{t},\) and current frame output vector \(O_{Tra}^{t}\) of each object are respectively used as inputs for the query, key, and value of the self-attention mechanism. This allows interaction between objects with different IDs, enabling the learning of more discriminative feature representations beneficial for subsequent frame tracking. We inject the corresponding object’s long-term memory vector for each object vector outputted by the Storage-Attention Layer through simple addition. Since the update of long-term memory is stable and smooth, this step helps to stabilize subsequent tracking embeddings over time, thereby improving tracking performance.

Experiments

Dataset and evaluation metrics

The proposed algorithm underwent comprehensive evaluations on the VisDrone MOT38 and UAVDT39 datasets, which encompass diverse drone-captured scenes, facilitating a thorough assessment of the proposed methods’ practical effectiveness. Extensive evaluations compared the algorithm with other leading multi-object trackers across various scenarios and conditions. Established MOT evaluation metrics were utilized to assess performance comprehensively, aiming to gauge overall effectiveness and pinpoint potential weaknesses in each model. The metrics include:

  1. (1)

    FP (\(\downarrow\)): Number of false positives in the entire video.

  2. (2)

    FN (\(\downarrow\)): Number of false negatives in the entire video.

  3. (3)

    IDSW (\(\downarrow\)): Number of identity switches in the entire video.

  4. (4)

    FM (\(\downarrow\)): Number of ground truth trajectories interrupted during the tracking process.

  5. (5)

    IDF1 (\(\uparrow\)): Ratio of correctly identified detections to the computed detections and ground truth.

  6. (6)

    MOTA (\(\uparrow\)): Combined FP, FN, and IDSW, scored as follows:

    $$\begin{aligned} MOTA = 1- \frac{FN+FP+IDSW}{GT} \end{aligned}$$
    (13)

    7) MOTP (\(\uparrow\)): Mismatch between ground truth and predicted results calculated as:

    $$\begin{aligned} MOTP = 1-\frac{ {\Sigma }_{t,i} d_{t,i}}{{\Sigma }_{t} c_{t}} \end{aligned}$$
    (14)

    These metrics contribute to a comprehensive assessment of MOT algorithm performance in various aspects, providing in-depth insights into system effectiveness.

Experimental settings

The detector is initialized with pre-existing weights obtained from training on the COCO dataset. We train the detector using SGD with the following parameters: 150 epochs, a batch size of 16, a learning rate of 0.02, momentum set to 0.9, and decay set to 0.0001. We train the detector on both the VisDrone and UAVDT datasets and perform validation using the same set of verification images. We execute the testing on hardware (NVIDIA RTX 4090 with 24GB of memory) and calculate the average over the top 100 most reliable detection results.

Comparative experiments

The comparison was conducted between DeepSORT40, ByteTrack41, BoT-SORT42, UAVMOT43, DCMOT44, MTTJDT45, SimpleTrack46 and JDE47, as well as Transformer-based methods including TransTrack48, TrackFormer49, TransCenter50, MOTR31, MeMOT33, GTR51, TR-MOT52, STN-Track53, STDFormer30 and GAO-Tracker54. In the datasets, the distribution of object categories is uneven, leading to performance differences in detection models across different classes. Different thresholds were set for different object categories to balance the detection accuracy of each class during evaluation. The threshold for cars was set at 0.3, trucks at 0.1, pedestrians at 0.4, and buses, which are visually more variable, were set at a lower threshold of 0.05 to capture bus objects in various scenes more effectively.

Tables 1 and 2 comprehensively compare ETDMOT with other popular trackers on the VisDrone MOT and UAVDT datasets. Evaluation includes critical metrics such as MOTA, MOTP, IDF1, and IDSW, along with comparisons with other methods. ETDMOT demonstrates outstanding performance by effectively utilizing position and appearance information. In comparison, DeepSORT independently associates each class based solely on position information to extend its framework for handling multiple classes. ByteTrack utilizes low-score detections for similarity tracking and background noise filtering. BoT-SORT introduces camera motion compensation to enhance matching. UAVMOT enhances object feature association through an ID feature update module. SimpleTrack creates a new association matrix by merging object embedding cosine and GIOU distances. TransTrack, based on Transformer, tracks existing objects in the current frame using a Query-Key mechanism while completing new object detection. TrackFormer considers position, occlusion, and object recognition features simultaneously. TransCenter globally and robustly predicts the heatmap of object centers for the association. MOTR models the entire trajectory of a object with a track query. MeMOT encodes information from all previous frames as clues for tracking. GTR expands the window length of matching and fully utilizes interaction information. TR-MOT achieves reliable association using visual temporal features. STDFormer utilizes Transformer’s remote modeling ability to extract intent information in temporal motion and decision information in spatial interaction. However, these methods use a single matching rule for all detections, failing to accurately model various object categories, resulting in inferior tracking performance. MTTJDT improves tracking efficiency by using adjacent frames as shared feature inputs. STN-Track uses the Swin Transformer as the neck of YOLOX for the detection module and the G-Byte data association method for the tracking module.

Table 1 Comparison between ETDMOT and the latest multiple trackers tested on the VisDrone dataset.
Table 2 Comparison between ETDMOT and the latest multiple trackers tested on the UAVDT dataset.

Figures 6 and 7 illustrate a significant comparison between ETDMOT and other trackers on the VisDrone and UAVDT datasets. Leveraging its outstanding performance, ETDMOT can accurately identify diverse objects from the perspective of drones, covering various objects such as pedestrians and vehicles. It not only accurately separates objects from cluttered backgrounds, but it also delineates precise bounding boxes around these objects. Meanwhile, EDTMOT achieved smooth frame rates of 20.4 FPS and 22.1 FPS on the VisDroneMOT and UAVDT datasets, respectively, meeting real-time processing standards. At the same time, it demonstrates extraordinary computing power with 751 GFLOPs of floating-point operations per second, maintaining a small number of parameters and further highlighting its powerful processing performance. Thanks to the jointly optimized object detection and tracking strategy, ETDMOT maintains stable and efficient tracking performance even in complex scenarios involving object occlusion and high-speed movement. Remarkably, despite facing multiple complex factors, such as frequent changes in drone perspective, fluctuations in lighting conditions, and significant variations in object sizes, the algorithm still demonstrates stable tracking capabilities, ensuring the continuity and accuracy of the tracking.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Comparison of visualization results between ETDMOT and other latest trackers on the VisDrone dataset.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Comparison of visualization results between ETDMOT and other latest trackers on the UAVDT dataset.

Ablation experiments

On the VisDrone and UAVDT datasets designed explicitly for drone aerial scenes, we conducted comprehensive ablation experiments to validate the effectiveness of our proposed Transformer-based end-to-end multi-object tracking method. Specifically, we focused on assessing the performance impact of several key components, including the new trajectory detection label matching strategy, cross-frame self-attention mechanism, Enhanced Self Characteristics module, and Storage Attention Layer.

Impact of ODLTM

To validate the effectiveness of our ODLTM, we compared it with traditional methods, including IOU, Cooperation label assignment, and Tracklet-Aware Label Assignment. Baseline uses IOU matching. Table 3 presents the performance evaluation results of the proposed ETDMOT combined with different matching strategies. In the evaluation results on the UAVDT dataset, the Cooperation label assignment yielded the best performance among the matching strategies, with MOTA, MOTP, and IDF1 metrics reaching 62.1%, 75.4%, and 66.4%, respectively. Additionally, using ODLTM resulted in the lowest IDSW count. However, on the more challenging VisDrone dataset, the performance advantage of ODLTM was more pronounced. It achieved 38.9% MOTA, 76.9% MOTP, and 65.7% IDF1, significantly reducing FP count.

Table 3 Performance evaluation of the proposed ETDMOT model combined with different label assignment.

Since VisDrone contains many crowded scenes, the experimental results indicate that ODLTM can effectively improve MOT tracking performance in congested situations. The experimental results demonstrate that our new strategy significantly enhances the accuracy and rationality of label assignment by comprehensively considering object appearance, spatial position, and Gaussian features, thereby improving tracking performance. The architecture of CALA is more suitable for the UAVDT dataset, while the data augmentation method used by ODLTM is not as suitable as CALA for the UAVDT dataset. However, ODLTM performs poorly in MOTA metrics, but overall performance is still acceptable and has better generalization.

Impact of cross spatial self attention block

Self-attention mechanisms excel within a single frame but perform poorly in modeling long-term dependencies. Multi-head attention mechanisms effectively segment attention but have limited capabilities in modeling long-term dependencies. Cross-attention mechanisms handle different information sources effectively but are unsuitable for modeling long-term dependencies. Cross-frame self-attention mechanisms focus on long-term dependency relationships, improving object tracking accuracy by leveraging cross-frame information. Experimental results in Table 4 demonstrate that cross-frame self-attention mechanisms excel in capturing long-term dependencies between objects, providing richer contextual information for the tracking process. This enables our tracking method to maintain high accuracy and stability when facing complex scenarios and long-duration tracking tasks.

Table 4 Performance comparison of different cross frame feature extraction methods.

Impact of enhanced self characteristics

It was compared with two other methods to gain deeper insights into the role of the Enhanced Self Characteristics (ESC) module in cross-frame long-term interactions. Firstly, we examined the scenario without the ESC module, resulting in a lack of extraction and enhancement of semantic trajectory information when processing cross-frame information, thereby affecting the stability and accuracy of tracking significantly when objects are occluded, deformed, or exhibit complex motion trajectories. Secondly, we compared with attempts to fuse features, which lacked in-depth exploration and specialized processing of trajectory information, resulting in limited effectiveness. Finally, the model with the ESC module could extract semantic features and effectively improve the continuity and accuracy of tracking by combining with Trajectory Storage. This ensures the semantic consistency of input information and enhances the model’s understanding and processing capabilities of trajectory information. As shown in Table 5, the experimental results demonstrate that the ESC module performs excellently in extracting semantic features, effectively enhancing the continuity and accuracy of tracking.

Table 5 Performance comparison of different semantic alignment methods.

Impact of storage attention layer

The Storage Attention Layer aims to optimize the utilization of stored information, thereby enhancing tracking stability and accuracy. In the absence of the Storage Attention Layer, when the model processes cross-frame information, it fails to effectively integrate historical trajectory information, leading to frequent loss and confusion of tracking information. Particularly in scenarios where objects are occluded, situated in complex backgrounds, or exhibit rapid motion, the model struggles to maintain stable tracking of the objects, resulting in a significant decline in tracking performance. Subsequently, we compared with methods integrating historical trajectory information through weighted averaging. Although this approach considers historical information to some extent, due to the lack of effective filtering and appropriate weight allocation, it cannot fully utilize stored trajectory information nor effectively distinguish the importance of different trajectories. Therefore, this method has limited effectiveness in improving tracking performance. The results are shown in Table 6.

Table 6 Performance comparison of different trajectory information storage methods.

In contrast, the Storage Attention Layer introduces an attention mechanism that adaptively adjusts the weights of different trajectory information, enabling the model to focus more on historical trajectories closely related to the objects in the current frame. Additionally, this layer effectively integrates and updates stored information, ensuring the coherence and consistency of information during the tracking process. Experimental results demonstrate that the Storage Attention Layer’s introduction significantly improves the model’s tracking performance.

Real scenario validation

To test the effectiveness and adaptability of our algorithm in drone object tracking, we selected DJI drone as the data acquisition platform. Flight tests were conducted in multiple real-world scenarios, and video data containing different lighting conditions, background complexity, and object motion states were collected. It covers various environments such as urban streets, parks and green spaces, farmland, and forests, ensuring the breadth and represent ativeness of the experiment. The result is shown in Fig. 8. Algorithms can accurately identify and track objects in videos, ensuring stable tracking even when the object is moving rapidly, changing posture, or partially occluded.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Real scenario validation.

Conclusion

The proposed Transformer-based end-to-end multi-object tracking method in the domain of drone aerial imagery has achieved significant results. Integrating object detection and tracking into a unified model and introducing new modules such as trajectory detection label matching strategy, cross-frame self-attention mechanism, and Enhanced Self Convolution enables accurate and efficient tracking of objects. Experimental results demonstrate that the proposed method performs excellently in multi-object tracking tasks, especially in complex backgrounds and occlusion scenarios. Future research directions include further optimizing model performance, improving tracking robustness and real-time capabilities, and exploring more effective feature extraction and correlation modeling methods to address increasingly complex scenarios and objects and, additionally, combining reinforcement learning and other methods to enhance the intelligence level of tracking algorithms further, enabling them to adapt to changing environments and task requirements, and realizing more intelligent, adaptive multi-object tracking systems.

Algorithms have limitations in terms of computational complexity, parameter tuning, and dependence on data quality. Due to the combination of multiple features in the method, high computational resources are usually required for computation. At the same time, spatial features may also become unreliable when objects move at high speeds or when there is occlusion. When integrating multiple features, it is necessary to carefully adjust the weights and parameters of each feature to ensure that they can work together and achieve the best results, which requires a lot of experimentation and tuning work. In addition, the effectiveness of feature extraction largely depends on the quality of the input data. If the input image quality is poor (such as low resolution, high noise, etc.), the accuracy and reliability of feature extraction will be affected. Meanwhile, given that the current model implementation heavily relies on large GPU resources, we plan to research knowledge distillation techniques with the aim of effectively reducing model size and complexity through this technology, thereby achieving seamless migration and application of models to edge devices.

In future research, efforts will be made to improve the accuracy, robustness, real-time performance, and intelligence level of multi-object tracking. We will focus on more efficient deep learning models and continue to explore and develop more efficient and lightweight deep learning models to reduce computational resource consumption and improve real-time processing capabilities. Research on how to effectively fuse different modalities of data (such as images, videos, text, sound, etc.) to provide more comprehensive information support and improve the accuracy and robustness of multi-object tracking. For example, introducing point cloud processing technology into image tracking to enhance the ability to handle complex occlusion situations55. By combining natural language processing and computer vision technology, natural language descriptions can be used to assist in object tracking.