End to end polysemantic cooperative mixed task trainer for UAV target detection

Liao, Xueying; Guo, Xingran; Rozi, Askar; Yu, Haizheng; Haji, Abdukerim

doi:10.1038/s41598-024-81201-8

Download PDF

Article
Open access
Published: 30 November 2024

End to end polysemantic cooperative mixed task trainer for UAV target detection

Xueying Liao¹^na1,
Xingran Guo¹^na1,
Askar Rozi¹,
Haizheng Yu^1,2 &
…
Abdukerim Haji¹

Scientific Reports volume 14, Article number: 29775 (2024) Cite this article

2046 Accesses
2 Citations
Metrics details

Subjects

Abstract

With the rapid advancement and application of Unmanned Aerial Vehicles (UAVs), target detection in urban scenes has made significant progress. Achieving precise 3D reconstruction from oblique imagery is essential for accurate urban object detection in UAV images. However, challenges persist due to low detection accuracy caused by subtle target features, complex backgrounds, and the prevalence of small targets. To address these issues, we introduce the Polysemantic Cooperative Detection Transformer (Pc-DETR), a novel end-to-end UAV image target detection network. Our primary innovation, the Polysemantic Transformer (PoT) Backbone, enhances visual representation by leveraging contextual information to guide a dynamic attention matrix. This matrix, formed through convolutions, captures both static and dynamic features, resulting in superior detection. Additionally, we propose the Polysemantic Cooperative Mixed-Task Training scheme, which employs multiple auxiliary heads for diverse label assignments, boosting the encoder’s learning capacity. This approach customizes queries and optimizes training efficiency without increasing inference costs. Comparative experiments show that Pc-DETR achieves a 3% improvement in detection accuracy over the current state-of-the-art MFEFNet, setting a new benchmark in UAV image detection and advancing methodologies for intelligent UAV surveillance systems.

An infrared dataset for partially occluded person detection in complex environment for search and rescue

Article Open access 19 February 2025

PTCDet: advanced UAV imagery target detection

Article Open access 09 November 2024

Urban traffic tiny object detection via attention and multi-scale feature driven in UAV-vision

Article Open access 04 September 2024

Introduction

The advancement of UAV technology has led to the increased use of drones for aerial imaging in urban settings, aiding tasks such as traffic monitoring and urban reconstruction. In a recent study¹, researchers introduced a template-guided frequency-attention mechanism and an adaptive cross-entropy loss function, aiming to improve the robustness of UAV visual tracking. The method performs well in frequency domain information extraction and can enhance the processing of complex motion patterns. MobileTrack² has designed an efficient mobile tracking system based on a lightweight Siamese network for high frame rate UAV applications that significantly reduces computational overhead, its advantages in computational efficiency and speed. Further³, multi-UAV cooperative single-target tracking is explored, which improves the tracking accuracy by consistency characterisation mining. Xue et al.⁴ proposed a query-guided redetection-based visual tracking method for UAVs, which effectively mitigates the target occlusion problem. In SmallTrack⁵, a UAV small target tracking method combining wavelet pooling and graph-enhanced classification is proposed, which significantly improves the tracking performance of small targets in complex scenes. Although its wavelet pooling strategy achieves good resolution performance at different scales, its real-time performance and generalisation ability to different types of small targets in dynamic scenes are still limited. Although UAV algorithms have achieved significant advances in the field of detection and tracking, they still suffer from poor real-time performance as well as low accuracy for small targets. And UAV images frequently feature small, easily obscured targets, complicating accurate detection.

To address this issue, data enhancement techniques-such as rotation, cropping, and splicing-have been employed to improve the representation of small targets in typically sparse datasets. For instance, Zhang et al.⁶ proposed cropping UAV images into smaller segments for better feature extraction, although this method risks slicing through targets. Tang et al.⁷ used rotational copying to augment datasets, but this approach fails to effectively capture small target features. Additionally, Chen et al.⁸ introduced the CSRGAN network to enhance low-resolution images, though this can amplify background noise. Zhou et al.⁹ combined SRGAN with FPN to better detect small targets, although up-sampling may degrade image quality, while Hou et al.¹⁰ developed a GAN with centroid weights to refine feature mapping, which also risks increasing background clutter.

Context learning has shown promise in enhancing small target detection by providing additional feature information. For example, Oliva et al.¹¹ highlighted the importance of context in target detection, and Liang et al.¹² measured spatial distances between targets, although this method requires high computational resources. Hong et al.¹³ created a contextual attention module that adjusts feature map scales but risks data overload. These systems, depicted in Fig. 1a, mainly rely on isolated interactions between queries and keys to form attention matrices but often overlook the contextual data provided by adjacent keys.

Traditional detection methods, such as the Viola-Jones algorithm and HOG, struggle with speed and accuracy^14,15,16,17. In contrast, CNNs dominate modern detection, with single-stage algorithms like YOLOv3¹⁸ offering faster detection at a slight accuracy cost. Two-stage algorithms, like Faster R-CNN¹⁹, require more computational resources but provide higher accuracy. Recent advancements focus on multi-scale fusion, attention mechanisms, and super-resolution feature generation.

Object detection has emerged as an important field with the R-CNN family^19,20,21 and a series of variants such as ATSS²², RetinaNet²³, FCOS²⁴, and PAA²⁵, achieving remarkable breakthroughs. The main idea is a one-to-many label matching rule, where the detector assigns multiple coordinates to the ground truth box to supervise the generation of the final target, fitting it with an anchor²³, among others. While these models demonstrate superior performance, they are heavily reliant on hand-designed components (e.g., non-maximum suppression). To simplify the implementation of end-to-end detectors and achieve convenient functionality, the DEtection TRansformer (DETR)²⁶ emerged. This framework views object detection as a prediction problem for ensembles, forming a one-to-one configuration process by introducing a Transformer encoder and decoder, whereby each ground truth box is matched with a single query, avoiding the tediousness associated with redundant hand-designed components. While Transformers have significantly improved Natural Language Processing and have been adapted for target identification in computer vision, DETR²⁶ has revolutionized target detection by conceptualizing it as an ensemble prediction task. However, its one-to-one matching mechanism often leads to inferior performance compared to classical detectors that utilize one-to-many label assignments. This is due to the drawback of the one-to-one set matching rule, which results in a small number of positive queries that can be effectively applied, leading to inefficient training.

Based on previous studies, two main algorithmic issues need to be addressed to achieve accurate target detection in UAV aerial imagery:

The traditional self-attention module fails to capture contextual information between adjacent keys, leading to reduced feature richness and difficulty in accurately identifying small targets.
One-to-one matching in end-to-end object detectors results in poorer performance compared to models that assign multiple labels to a single item.

In Fig. 1b, this work introduces the Polysemantic Transformer (PoT), which integrates contextual mining with self-attention to enhance feature representation. The PoT employs a 3$\times$3 convolution to gather contextual information from adjacent keys, followed by 1$\times$1 convolutions to generate attention matrices that leverage the relationships between queries and keys.

Furthermore, we present the Pc-DETR architecture for UAV image detection, which utilizes a one-to-many label assignment strategy to enhance training efficiency. By employing auxiliary heads with diverse label assignments, we optimize the learning capabilities of the encoder, significantly improving detection performance.

Our contributions are outlined as follows:

1.
Development of the Polysemantic Transformer (PoT) block to enhance visual recognition in UAV aerial imagery by leveraging contextual information.
2.
Introduction of the Polysemantic Cooperative Mixed-Task Training (Pc-DETR) scheme to optimize DETR-based detectors through diverse label assignment methods.
3.
Improvement in training efficiency for the decoder by extracting the coordinates of positive instances from auxiliary heads, streamlining inference without adding computational burden.
4.
Experimental results on the VisDrone dataset demonstrate that our approach outperforms leading UAV detection algorithms, achieving a 3% higher mean average precision (mAP) score compared to MFEFNet.

Review work

Review of the transformer

The Transformer’s self-attention mechanism has shown excellent results in natural language processing (NLP) tasks, which has sparked researchers’ interest in applying self-attention mechanisms to visual scenes. Initially in NLP, the self-attention mechanism²⁷ was designed to capture long-distance dependencies in sequence modeling. In the field of computer vision (CV), a straightforward way to adapt this mechanism from NLP to CV is to perform self-attention operations on feature vectors at different spatial locations within an image.

One of the early studies in ConvNet that attempted to introduce the self-attention mechanism was the non-local operation²⁸, which applies self-attention as an additional module on the convolutional output. In²⁹, convolutional operations were augmented by the introduction of a global multi-head self-attention mechanism to improve the performance of image classification and object detection. However, due to the limited scalability of global self-attention^28,29 when applied to the entire feature map,^30,31 proposed a scheme to apply self-attention in local blocks (e.g., 3$\times$3 grids). This local self-attention design effectively reduces the parameters and computation required by the network, allowing it to potentially replace convolutional operations throughout the deep architecture.

Recently, self-supervised representation learning has been investigated by reshaping the original image into a one-dimensional sequence and employing Sequence Transformer³² for autoregressive prediction. Subsequently,^26,33 directly applied pure Transformer to a sequence of local features or image blocks for object detection and image recognition. In a more recent study,³⁴ designed a high-performance backbone network by replacing the last three 3$\times$3 convolutional layers of ResNet with a global self-attention layer. To this end, we focus on the self-attention mechanism in the visual backbone. One of the main drawbacks of directly exploiting traditional self-attention is that it neglects to explicitly model the rich contextual information between neighboring elements. In contrast, the Polysemantic Transformer (PoT) fully incorporates the rich contextual information between keys and feature maps, unifying the learning process within a single architecture while maintaining a favorable number of parameters.

BiFormer³⁵ proposes a bi-level routing attention mechanism, which divides the attention computation into two levels, local and global, to enhance the efficiency and performance of Vision Transformer in multi-scale feature processing. The method enhances the computational efficiency of the model by reducing redundant computations, while maintaining the ability to express complex image features, effectively balancing performance and resource consumption. The key core lies in the bi-level routing attention mechanism, which divides the attention computation into local and global levels and focuses on computational efficiency and multi-scale feature processing to enhance the computational resource utilisation under the premise of ensuring performance. Based on the research idea of BiFormer, we propose the more advantageous Polysemantic Transformer (PoT) module, the core idea of which is to improve contextual modelling so that the model can better capture the inter-relationships of different regions of an image in visual tasks. Unlike BiFormer, which focuses on improving attention computation through the Bi-Level Routing Attention mechanism to improve efficiency and perform better in multi-scale feature processing, Polysemantic Transformer (PoT) focuses more on improving global Polysemantic Transformer (PoT) focuses more on improving the fusion of global and local information through an explicit contextual mechanism to improve the model’s understanding of complex scenes.Polysemantic Transformer (PoT) introduces an explicit contextual mechanism that does not only rely on the local features, but also takes into account the more global contextual information in the image. In this way, the model is able to recognise relationships between objects in complex visual scenes and process visual tasks more efficiently.

Review of matching rules

One-to-one target matching rules are prevalent in contemporary research. In DETR²⁶, this strategy is implemented using the Hungarian algorithm, which addresses an optimal assignment problem to minimize the total cost between predicted and true labels. The cost function typically comprises cross-entropy loss for category prediction, along with L1 loss and GIoU loss for bounding box prediction.

DN-DETR³⁶ enhances the attention mechanism in DETR by introducing deformable attention, allowing the model to focus more on smaller regions within the image rather than the entire image. Despite this modification, DN-DETR retains the Hungarian algorithm for its ensemble matching strategy, preserving the fundamental one-to-one matching framework. DINO³⁷ advances the design of DN-DETR by incorporating an improved deformable attention module to bolster detection performance. Like its predecessors, DINO also utilizes the Hungarian algorithm for the one-to-one matching strategy. DAB-DETR³⁸ integrates the concept of anchor frames into DETR, enhancing the model’s ability to detect small-sized targets. Even with the addition of anchor frames, DAB-DETR continues to employ the Hungarian algorithm for one-to-one matching between predictions and actual labels, ensuring the integrity of the ensemble predictions. Group-DETR³⁹ modifies the encoder structure of DETR by segmenting the image into multiple groups and conducting target detection within each group to elevate performance. Despite these structural adjustments, Group-DETR maintains the Hungarian algorithm for ensemble matching, facilitating one-to-one optimal matching. H-DETR⁴⁰ merges the strengths of DETR with anchor-based methods and seeks to rectify the limitations of DETR in handling objects of various sizes by introducing layered features and multi-scale detection. In its ensemble matching strategy, H-DETR persists in using the Hungarian algorithm to ensure effective correspondence between predicted and actual labels.

In summary, although DETR and its variants differ in model architecture and specific implementations, they consistently employ the Hungarian algorithm as the central strategy for one-to-one ensemble matching. The essence of this approach lies in transforming the target detection challenge into an optimization problem, where the best match is determined by calculating the minimum total cost between predicted and true labels, leading to efficient and accurate target detection.

One-to-many target matching rules are also prevalent in contemporary research. Faster R-CNN⁴¹ employs a Region Proposal Network (RPN) and subsequent Region of Interest (RoI) pooling layer to generate potential target candidate regions. During the RPN stage, each anchor point is classified as foreground or background based on its Intersection over Union (IoU) with any real bounding box. These candidate regions are further refined through a series of network layers, leading to category classification and bounding box regression.

RetinaNet²³ utilizes an approach known as Focal Loss, designed to mitigate the effects of category imbalance by focusing on a large number of easily categorized negative samples (background). It arranges anchor points at multiple scales, predicting categories and bounding box offsets for each. Anchor points are matched with true bounding boxes based on the IoU, with those above a certain threshold (e.g., 0.5) considered positive samples. FCOS⁴² (Fully Convolutional One-Stage Detector) discards traditional anchor-box mechanisms, adopting a point-level prediction approach. Each position directly predicts the category and bounding box relative to that position. FCOS uses a centrality score to optimize the matching process, reducing inaccuracies due to positional bias. ATSS²² optimizes training sample selection by dynamically choosing a positive sample matching threshold for each anchor point based on the IoU distribution between each anchor point and its nearest true bounding box. This method improves upon the biases and instabilities inherent in traditional static threshold methods. PAA²⁵ (Probabilistic Anchor Assignment) introduces a probability-based anchor assignment mechanism to address traditional anchor assignment issues. By calculating the likelihood of each anchor belonging to the foreground and dynamically selecting positive samples based on this probability, this approach aims to reduce the detrimental effects of incorrect label assignments and enhance model performance.

In summary, the matching strategies of these models focus on efficiently assigning multiple predictions to real labels and managing the relationships between these predictions and real labels. Each model employs unique mechanisms to optimize this process, thereby enhancing the accuracy and efficiency of detection.

To address the issues of low detection accuracy and imbalance in UAV aerial imagery, as well as challenges related to the extraction of feature information among contextual images or excessive extraction of irrelevant information, we introduce the Polysemantic Transformer module. This module is designed to thoroughly integrate the image feature extraction capabilities across both static and dynamic contexts. Furthermore, we propose the Pc-DETR model, which effectively addresses the issues of supervised sparsity and preservation of the encoder’s discriminative properties. This model enables the realization of an end-to-end detector equipped with multiple auxiliary heads, enhancing the efficiency of detecting positive samples in UAV surveillance and significantly ameliorating the prevalent challenges in UAV aerial image detection.

Review of DETR

DETR, introduced by Carion et al. in 2020²⁶, is a target detection methodology that relies entirely on the Transformer architecture. This model leverages the Transformer’s self-attention mechanism to address challenges in target detection, entirely eliminating the need for anchor frames and complex post-processing steps, such as non-maximum suppression (NMS), traditionally employed in target detection algorithms. DETR redefines the target detection task as a direct ensemble prediction problem, simplifying the approach while maintaining efficacy.

As illustrated in Fig. 2, the DETR (Detection Transformer) model comprises three principal components: a Convolutional Neural Network (CNN) for feature extraction, a Transformer encoder-decoder, and a simple Feedforward Network (FFN) for prediction. Initially, the input image is processed through a standard CNN, such as ResNet, to extract feature maps. These feature maps are subsequently fed into the Transformer’s encoder for further refinement. The encoder processes these features, enhancing their contextual relationships through self-attention mechanisms and a feedforward network. The decoder employs a fixed number of learned object queries, each of which progressively refines its corresponding output via self-attention and encoder-decoder attention mechanisms.

In DETR, the Transformer’s self-attention mechanism is computed as follows:

$$\begin{aligned} \text {Attention}(Q, K, V) = \text {softmax}\left( \frac{QK^T}{\sqrt{d_k}}\right) V \end{aligned}$$

Here, $Q$, $K$, and $V$ represent the query, key, and value matrices, respectively, and $d_k$ denotes the dimension of the key. This mechanism enables the model to exchange information across different parts of the input sequence, thereby capturing complex spatial relationships.

The output of the decoder is processed through a linear layer to predict the category of the targets and a vector consisting of four scalars to predict the bounding box (coordinates of the center point, height, and width).

For training DETR, the authors designed a matching loss based on the Hungarian algorithm. Each output of the decoder must match a unique real object. The loss function comprises classification loss and bounding box regression loss:

$$\begin{aligned} \textbf{L} = \sum _{i=1}^N \left[ -\log p_{\sigma (i)}(c_i) + 1_{\{c_i \ne \emptyset \}} \textbf{L}_{\text {box}}(\hat{b}_i, b_{\sigma (i)})\right] \end{aligned}$$

In this formula, $c_i$ and $b_i$ denote the category and bounding box of the $i$-th real object, respectively, $p_i$ and $\hat{b}_i$ are the corresponding predictions, $\sigma$ is the output permutation from the Hungarian algorithm that minimizes prediction cost, and $\textbf{L}_{\text {box}}$ usually combines L1 loss and the generalized Intersection over Union (GIoU) loss. This approach enables DETR to directly learn the optimal method for predicting a set of objects from an image without the need for complex post-processing, marking a novel breakthrough in the field of object detection.

Despite DETR’s achievements, it still lacks the capability to extract polysemantic contextual information from images, leading to difficulties in feature extraction. Additionally, it cannot match the powerful accuracy advantages offered by one-to-many matching rules. To address the issues of low detection accuracy and imbalance in images, as well as challenges related to the extraction of relevant feature information or the excessive extraction of irrelevant information in contextual images, we have introduced a multi-semantic transformer module. This module is designed to thoroughly integrate feature extraction capabilities in both static and dynamic contexts. Furthermore, we propose the Pc-DETR model, which effectively addresses issues of supervision sparsity and preserves encoder discriminative characteristics. This model implements an end-to-end detector equipped with multiple auxiliary heads, enhancing the detection efficiency of positive samples in images. It significantly improves common challenges in image detection and has been successfully applied to drone aerial image detection tasks, achieving excellent results and resolving difficulties in detecting small targets from drones.

Pc-DETR

This chapter provides a detailed introduction to our innovative model, the Polysemantic Cooperative Detection Transformer (Pc-DETR), as illustrated in Figure 3. We have made significant modifications to the traditional transformer in the backbone section, termed the Polysemantic Transformer Backbone (PoT). The intricate details of these modifications are discussed in section “Multi-projection attention mechanisms in visual backbone” and “Polysemantic transformer module”. Additionally, following the integration of the PoT feature extraction block, training will proceed in a parallel auxiliary manner, with specific details provided in sections “Cooperative mixed-task training” and “Specific positive query generation”. The chapter concludes with section “Explanation of how Pc-DETR works”, where we explain how the Pc-DETR operates and the functionalities it brings to enhance image detection tasks.

Multi-projection attention mechanisms in visual backbone

This study explores the limitations of conventional attention mechanisms in traditional visual backbones, particularly in detecting very small targets in high-altitude urban UAV imagery. To resolve this issue, we propose the Polysemantic Transformer (PoT), an innovative transformer block designed to handle dense and complex UAV tilt images. This design surpasses the constraints of traditional self-attention by utilizing aggregated semantics between input keys, thereby enhancing self-attention learning and improving the network’s representational abilities. Additionally, we introduce two aggregated semantic transform networks based on ResNet⁴³, where PoT blocks replace the standard $3 \times 3$ convolutions throughout the architecture.

In this section, we present a comprehensive mathematical framework for a scalable local polytope in self-attention mechanisms used in visual backbones^30,31. The framework is depicted in Fig. 4a. We start with a 2D feature map $\chi$ that has dimensions $h \times \omega \times c$, where $h$ represents the height, $\omega$ represents the width, and $c$ represents the number of channels. The variable $\chi$ is transformed into $q = \chi \omega _q$ (Queries), $k = \chi \omega _k$ (Keys), and $v = \chi \omega _v$ (Values) by utilizing the embedding matrices $\{\omega _q, \omega _k, \omega _v\}$, which are implemented as $1 \times 1$ convolutions. The local relation moments $r$ are determined by taking the dot product between $k$ and $q$, where $r$ is a tensor of shape $\mathbb {R}^{h \times \omega \times (k \times k \times c_h)}$.

$$\begin{aligned} r = k \odot q, \end{aligned}$$

(1)

The symbol $c_h$ represents the count of heads, whereas $\odot$ specifies the operation of local matrix multiplication. It is used to evaluate the similarity between each $q$ and $k$ within a grid of size $k \times k$. Each feature $r^{(i)}$ at the $i$-th position in space is a $k \times k \times c_h$ vector that contains $c_h$ local $q-k$ relationships. This method enhances the variable $r$ by incorporating positional information into every $k \times k$ grid.

$$\begin{aligned} \hat{r} = r + p \odot q, \end{aligned}$$

(2)

The variable $p$ belongs to the set of real numbers raised to the power of $k \times k \times c_k$, representing the 2-dimensional relative position embedding. This embedding is common to all $c_h$ heads. The attention matrix $a$ is generated by applying the softmax function to the vector $\hat{r}$ after normalizing it along the channel dimension for each head. The final output feature maps are produced by combining every value within each $k \times k$ grid in the $c_h$ localized attention matrices, which are derived from reshaping $a$.

$$\begin{aligned} y = v \odot a. \end{aligned}$$

(3)

Each head’s local attention matrix blends $v$, the feature maps divided by channel dimensions, resulting in the ultimate output $y$, which is a concatenation of the combined feature maps from all heads.

Polysemantic transformer module

Traditional self-attention mechanisms facilitate the interaction of features across different spatial locations based solely on the input itself. However, these mechanisms learn query-key relationships independently without considering the rich semantics between them, which presents challenges for dense target detection in urban UAV imagery. This limitation diminishes the effectiveness of self-attention in visual representation learning via 2D feature maps, resulting in lower detection accuracy.

To tackle this issue, we introduce the Polysemantic Transformer (PoT) block, depicted in Fig. 4b. This block integrates semantic information mining and self-attention learning into a unified architecture. Our goal is to exploit the polysemantic contextual information between neighboring keys to enhance self-attention learning and improve the quality of the output feature map.

Let $\chi$ be a 2D map of features with dimensions $h \times \omega \times c$. We define the keys, queries, and values as follows: $k$ is equal to $\chi$, $q$ is equal to $\chi$, and $v$ is equal to $\chi \omega _v$. Instead of utilizing $1 \times 1$ convolutions as is typically done in self-attention, the PoT block employs $k \times k$ group convolutions on neighboring keys within the $k \times k$ lattice to contextualize each key representation.

It is important to note that the context keys $k^1 \in \mathbb {R}^{h \times \omega \times c}$ serve as the static contextual representation of $\chi$ and reflect the static polysemantic contextual information that exists between different local neighborhood keys. The attention matrices are calculated using two successive $1 \times 1$ convolutions ($\omega _\theta$ with ReLU activation and $\omega _\delta$ without), which are obtained by concatenating $k^1$ and $q$, as explained in the following illustration:

$$\begin{aligned} a = [k^1, q] \omega _\theta \omega _\delta . \end{aligned}$$

(4)

Instead of using isolated query-key pairs, the local attention matrix for each head is obtained by integrating query features with polysemantic context key features. This provides a more comprehensive understanding of the local attention matrix. By incorporating information from static polysemantic contexts $k^1$, this technique significantly enhances the process of self-attention learning. Using the aggregated semantic attention matrix $a$, we construct the participation feature map $k^2$ by aggregating all of the values $v$. This is done in accordance with the traditional self-attention approach.

$$\begin{aligned} k^2 = v \odot a. \end{aligned}$$

(5)

Since $k^2$ represents dynamic feature interactions, it is referred to as the dynamic representation of semantics. Using the attention technique, the PoT module integrates the static semantics $k^1$ and dynamic semantics $k^2$ to create the final result $y$.

Backbone of the polysemantic transformer

PoT is a cohesive self-attention component designed to substitute conventional convolutions in ConvNet. This enables the improvement of the visual framework by incorporating semantic self-attention. We integrate PoT blocks into the advanced ResNet architecture⁴³ without substantially augmenting the parameter allocation. Table 1 displays different configurations of the aggregated semantic transformer network (PoT) that utilize the ResNet-50 backbone. These configurations are collectively referred to as PoT-50.

PoT-50 is generated by substituting all $3 \times 3$ convolutions in ResNet-50 (in the resnet-2, resnet-3, resnet-4, and resnet-5 stages) with PoT blocks. Due to the computational similarity between PoT blocks and standard convolutions, the number of parameters and FLOPs in PoT-50 is similar to those in ResNet-50.

In addition, we analyze the intricate connections and distinctions between the Polysemantic Transformer and other prominent visual frameworks.

1.
Blueprint Separable Convolution⁴⁴: This technique approximates traditional convolution by employing a $1 \times 1$ pointwise convolution, followed by a $k \times k$ depthwise convolution to decrease redundancy in the depth dimension. It exhibits resemblances to Transformer-style blocks, such as our PoT block, in that they both apply $1 \times 1$ pointwise convolution to transform inputs into values, then use $k \times k$ local attention matrices depth-wise to perform aggregation calculations. Furthermore, Transformer-style blocks employ a channel-sharing technique for aggregation, which can be perceived as a packaged block convolution. This strategy involves sharing filters across channel blocks without compromising accuracy.
2.
Dynamic Region-Aware Convolution⁴⁵: This technique employs a filter generator module that utilizes two $1 \times 1$ convolutions to obtain filters for area characteristics at various spatial locations. The attention matrix generator in our PoT block is designed as follows: the attention matrix generator in our system employs intricate feature interactions between polysemantic context keys and queries to facilitate self-attention learning. In contrast, the filter generator described relies solely on the principal feature input maps.
3.
Bottleneck Transformer³⁴: This approach enhances ConvNet’s self-attention by replacing $3 \times 3$ convolutions with Transformer-style modules. Its global multi-head self-attention layer is more computationally costly than our PoT block’s local self-attention. BoT50 uses Bottleneck Transformer blocks to replace the last three convolutions in the ResNet backbone. In comparison, our PoT block may replace all $3 \times 3$ convolutions in the architecture. By using the broad polysemantic context of input keys, our PoT block improves self-attention learning.

Utilizing the conventional Transformer design, we incorporate input detection images into a backbone network and encoder to generate an initial feature map of possible items. Within the decoder, pre-established object queries engage with these features through cross-attention. To enhance the learning of features in the encoder and attention in the decoder, we propose the use of Pc-DETR. This approach combines a collaborative hybrid task training scheme with tailored forward query generation. We provide a comprehensive explanation of these processes and analyze their efficacy.

Cooperative mixed-task training

Using auxiliary heads that leverage label assignment paradigms such as ATSS and Faster R-CNN is an example of one-to-many approaches that can reduce the sparse supervision of encoder output caused by minimizing decoder active queries. Various label assignments improve encoder output monitoring, ensuring it remains discriminative enough to permit head training convergence.

To convert the potential features $\varvec{A}$ of the encoder into a feature pyramid, we employ a multiscale adapter. This adapter generates a set of feature maps $\{\varvec{A}_1, \ldots , \varvec{A}_\textbf{J}\}$, where $\varvec{A}_\textbf{J}$ corresponds to feature maps that have undergone a downsampling step of $2^{2 + \textbf{J}}$. Like ViTDet⁴⁶, we generate feature pyramids by extracting features from separate feature maps in the single-scale encoder. However, we utilize bilinear interpolation and PoT blocks for the upsampling process. For example, the encoder generates single-scale features, which are then transformed into a feature pyramid by repeatedly reducing the size (e.g., using convolution with a step size of 2 and a $3 \times 3$ kernel) or increasing the size of the features.

The feature pyramid in the multiscale encoder is built by downsampling only the coarsest features. We define a set of collaborating heads, denoted as $\textbf{K}$, along with their corresponding label assignments, denoted as $\textbf{A}_\textbf{K}$. To acquire the prediction $\hat{\textbf{P}}_i$, we send $\{\varvec{A}_1, \ldots , \varvec{A}_\textbf{J}\}$ to the $i$th collaborating head. At the $i$th head, the supervised aim for the positive and negative samples is computed by that head, denoted by the symbol $\textbf{A}_i$. The method requires the use of the ground truth set, denoted by the mathematical symbol $\textbf{G}$.

$$\begin{aligned} \textbf{P}_i^{\{Pos\}}, \textbf{B}_i^{\{Pos\}}, \textbf{P}_i^{\{Neg\}} = \ (\textbf{A}_i)(\hat{\textbf{P}}_i, \textbf{G}), \end{aligned}$$

(6)

The sets $\{Pos\}$ and $\{Neg\}$ represent the positive and negative coordinate pairings in $\textbf{J}$, respectively. The value of $\varvec{A}_j$ is determined by $\textbf{A}_i$, where $\textbf{J}$ represents the index of a feature in the set $\{\varvec{A}, \ldots , \varvec{A}_\textbf{J}\}$. $\textbf{B}_i^{\{Pos\}}$ represents the collection of positive spatial coordinates, while $\textbf{P}_i^{\{Pos\}}$ and $\textbf{P}_i^{\{Neg\}}$ refer to the supervised targets at those specific coordinates, which include categories and regression offsets. The specific characteristics of each variable are outlined in Table 2. The loss function is defined as follows:

$$\begin{aligned} \mathfrak {L}_i^{Enc} = \mathfrak {L}_i \left(\hat{\textbf{P}}_i^{\{Pos\}}, \textbf{P}_i^{\{Pos\}}\right) + \mathfrak {L}_i \left(\hat{\textbf{P}}_i^{\{Neg\}}, \textbf{P}_i^{\{Neg\}}\right), \end{aligned}$$

(7)

The regression loss for negative samples is disregarded. The training objective for maximizing the $K$ auxiliary heads is stated as follows:

$$\begin{aligned} \mathfrak {L}^{Enc} = \sum _{i=1}^{\textbf{K}} \mathfrak {L}_i^{Enc} \end{aligned}$$

(8)

Specific positive query generation

In the one-to-one matching paradigm, each ground truth box is assigned as a supervised target to a given query. This ensures that the data is accurate and reliable. However, as shown in Fig. 5, a restricted number of good queries might lead to inefficient cross-attention learning in the Transformer decoder. To address this issue, we assign labels to every auxiliary head based on the value of $\textbf{A}_i$. This allows us to generate a sufficient number of customized affirmative queries, ultimately enhancing the effectiveness of the model’s training.

The positive samples for the $i$th auxiliary head are represented by the coordinates $\textbf{B}_i^{\{Pos\}}$ in the $\mathbb {R}^{\textbf{M}_i \times 4}$ space, where $\textbf{M}_i$ is the total number of positive samples. To generate more personalized favorable queries, we employ the following approach:

$$\begin{aligned} \textbf{Q}_i = {\textbf {Linear}}({\textbf {PE}}(\textbf{B}_i^{\{Pos\}})) + {\textbf {Linear}}({\textbf {E}}(\{\{\varvec{A}_*\}, \{Pos\}\})), \end{aligned}$$

(9)

Here, the notation $\text {PE}(\cdot )$ represents positional encoding. We utilize the index pair $\textbf{J}$ and $\varvec{A}_\textbf{J}$ to select the relevant features from $\text {E}(\cdot )$ based on positive or negative coordinates.

Table 1 Backbone contrast structure.

Subjects

Abstract

Similar content being viewed by others

An infrared dataset for partially occluded person detection in complex environment for search and rescue

PTCDet: advanced UAV imagery target detection

Urban traffic tiny object detection via attention and multi-scale feature driven in UAV-vision

Introduction

Review work

Review of the transformer

Review of matching rules

Review of DETR

Pc-DETR

Multi-projection attention mechanisms in visual backbone

Polysemantic transformer module

Backbone of the polysemantic transformer

Cooperative mixed-task training

Specific positive query generation

Explanation of how Pc-DETR works

Experiments

Datasets

Polysemantic transformer evaluation

Setup

Comparative performance experiments

Ablation experiments on tilted images of UAV cities

Experiments on target detection in tilted images of UAVs

Depth experiments with Pc-DETR

Experiment details

Core results

Comparative analysis of advanced model experiments

Ablation experiments on Pc-DETR

Elimination of various auxiliary heads

Composition analysis experiments

Comparison of differences in original query distribution

Comparing and visualizing advanced models specifically designed for UAV image inspection

Discussion

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links