Fig. 1 | Scientific Reports

Fig. 1

From: End-to-end multiple object tracking in high-resolution optical sensors of drones with transformer models

Fig. 1The alternative text for this image may have been generated using AI.

The framework of ETDMOT. The ETDMOT framework efficiently extracts object features from video frames using a Backbone network and integrates them with a Transformer Encoder. With ResNet-50 as the backbone, the framework detects new objects and tracks labeled ones using the P-layer decoder. We use the ODLTM strategy for label assignment, considering appearance, spatial, and Gaussian features. Each object has independent storage for long-term, stable tracking. Trajectory information from consecutive frames is fused and used as input to the Cross-frame Long-term Interaction module for extracting long-term features. The ESC module guarantees semantic consistency, whereas the Cross Spatial-SA mechanism extracts deep semantic information. Combined with trajectory storage, continuous and accurate tracking is achieved through cyclic iteration.

Back to article page