Fig. 1

The framework of ETDMOT. The ETDMOT framework efficiently extracts object features from video frames using a Backbone network and integrates them with a Transformer Encoder. With ResNet-50 as the backbone, the framework detects new objects and tracks labeled ones using the P-layer decoder. We use the ODLTM strategy for label assignment, considering appearance, spatial, and Gaussian features. Each object has independent storage for long-term, stable tracking. Trajectory information from consecutive frames is fused and used as input to the Cross-frame Long-term Interaction module for extracting long-term features. The ESC module guarantees semantic consistency, whereas the Cross Spatial-SA mechanism extracts deep semantic information. Combined with trajectory storage, continuous and accurate tracking is achieved through cyclic iteration.