Bio-inspired motion detection models for improved UAV and bird differentiation: a novel deep learning framework

Al-Zadjali, Najiba Said Hamed; Balasubaramanian, Sundaravadivazhagan; Savarimuthu, Charles; Rances, Emanuel O.

doi:10.1038/s41598-025-99951-4

Download PDF

Article
Open access
Published: 03 May 2025

Bio-inspired motion detection models for improved UAV and bird differentiation: a novel deep learning framework

Najiba Said Hamed Al-Zadjali¹,
Sundaravadivazhagan Balasubaramanian¹,
Charles Savarimuthu¹ &
…
Emanuel O. Rances²

Scientific Reports volume 15, Article number: 15521 (2025) Cite this article

2733 Accesses
Metrics details

Subjects

Abstract

The rapid increase in Unmanned Aerial Vehicle (UAV) deployments has led to growing concerns about their detection and differentiation from birds, particularly in sensitive areas like airports. Existing detection systems often struggle to distinguish between UAVs and birds due to their similar flight patterns, resulting in high false positive rates and missed detections. This research presents a bio-inspired deep learning model, the Spatiotemporal Bio-Response Neural Network (STBRNN), designed to enhance the differentiation between UAVs and birds in real-time. The model consists of three core components: a Bio-Inspired Convolutional Neural Network (Bio-CNN) for spatial feature extraction, Gated Recurrent Units (GRUs) for capturing temporal motion dynamics, and a novel Bio-Response Layer that adjusts attention based on movement intensity, object proximity, and velocity consistency. The dataset used includes labeled images and videos of UAVs and birds captured in various environments, processed following YOLOv7 specifications. Extensive experiments were conducted comparing STBRNN with five state-of-the-art models, including YOLOv5, Faster R-CNN, SSD, RetinaNet, and R-FCN. The results demonstrate that STBRNN achieves superior performance across multiple metrics, with a precision of 0.984, recall of 0.964, F1 score of 0.974, and an IoU of 0.96. Additionally, STBRNN operates at an inference time of 45ms per frame, making it highly suitable for real-time applications in UAV and bird detection.

Neural radiance fields assisted by image features for UAV scene reconstruction

Article Open access 20 August 2025

Binocular stereo vision-based relative positioning algorithm for drone swarm

Article Open access 27 January 2025

Blind source separation and unmanned aerial vehicle classification using CNN with hybrid cross-channel and spatial attention module

Article Open access 01 July 2025

Introduction

The rapid advancements in UAVs and their increased usage in various fields, including logistics, agriculture, and surveillance, have heightened concerns about UAV collisions with wildlife, particularly birds¹. Both birds and UAVs share overlapping flight patterns and physical characteristics, making it difficult for traditional detection systems to distinguish between the two². UAVs, with their high-speed rotors and unpredictable flight paths, present a growing threat in bird-rich environments such as airports, where bird strikes can cause severe damage to aircraft¹.

The goal of recent studies investigating UAV and bird detection using deep learning (DL) and computer vision techniques is to improve real-time performance, decrease false positives, and increase accuracy³. Despite their widespread use for object detection, traditional convolutional neural networks (CNNs) struggle to distinguish between objects that have comparable motion dynamics⁴. This problem has thus given rise to hybrid models that draw inspiration from biological systems^5,6,7.

Several studies have demonstrated the effectiveness of advanced DL models in improving the detection of UAVs and birds^8,9,10,11,12. For example, in a recent study, Zhang et al. (2022) proposed a vision based anti-UAV to separate moving objects like UAVs and birds based on spatiotemporal features extracted from video data¹³. The model demonstrated an 85% reduction in false-positive detection compared to traditional single-stream CNNs, highlighting the value of spatiotemporal analysis.

Another notable work by Wang et al. (2023) developed an attention-based temporal convolutional network for UAV detection in airport environments¹⁴. This approach significantly improved real-time detection accuracy by utilizing temporal attention mechanisms to track object movement over time. The model achieved a precision of 92%, outperforming standard CNN models, which typically reached only 85% precision.

Further, Guzman et al. (2021) introduced a biologically inspired motion detection model based on the visual system of birds of prey¹⁵. More recently, Cazzato et al. (2020) investigated the use of 3D convolutional neural networks (3D CNNs) for motion-based UAV detection¹⁶. By considering the motion trajectory of objects in three-dimensional space, the model was able to distinguish UAVs from birds with 94% precision, particularly in complex flight scenarios such as swarms of birds or fast-moving UAVs. The model utilized hierarchical spatial and temporal attention mechanisms, allowing for enhanced detection of small, fast-moving objects like birds and UAVs¹⁶. The approach reduced false positives by 20% compared to baseline methods such as YOLOv5.

Building on this, Li et al. (2023) proposed a multi-sensor fusion technique that integrates visual and radar data to distinguish between UAVs and birds. By combining data from multiple sensors, the model was able to handle adverse weather conditions, achieving an accuracy of 93% in foggy environments where traditional visual-only models dropped to 75%.

Furthermore, Xu et al. (2022) created a hybrid deep learning framework that used CNNs and LSTM networks to understand the spatial and temporal dynamics of UAV and avian flight patterns¹⁷. The recall of the hybrid model was 96%, which is a considerable improvement over the 89% recall of pure CNN models.

To distinguish between UAVs and birds, Seidaliyeva et al. (2020) suggested an optical flow analysis-based real-time object recognition system¹⁸. This technology is ideal for real-time applications in aviation safety because to its 99.83% accuracy and 58 milliseconds per frame processing speed. Furthermore, in the work of Correa et al. (2024), a generative adversarial network (GAN)-based approach was employed to augment UAV and bird datasets, significantly improving detection performance in low-light conditions¹⁹. Additionally, Craye C and Ardjoune (2019) introduced a spatiotemporal feature fusion method using CNN to model the flight trajectories of birds and UAVs over time²⁰. This method improved detection accuracy above 90% and reduced the number of false negatives in highly cluttered environments. Finally, Wang et al. (2024) presented a transformer-based architecture for object detection in dynamic environments²¹. The model, capable of learning long-range dependencies between video frames, achieved a 92% accuracy rate in distinguishing between UAVs and birds across various environmental conditions.

Recent advancements in anti-drone technologies have significantly strengthened the detection, tracking, and mitigation of unauthorized UAVs across sensitive zones. Delleji et al. (2024) introduced a C4 software suite designed for real-time drone surveillance and management through an AI-powered interface capable of detection, tracking, and classification with over 96% accuracy in no-fly zones, demonstrating high operational efficiency for integrated control of UAV threats²². Complementing this, the RF-YOLO model proposed by Delleji and Slimeni (2025) applies deep learning to RF spectrogram images extracted from UAV-controller signals, achieving a mAP of 0.9213, thereby outperforming several SOTA object detection models in precision and recall under varying SNR conditions²³. Additionally, an Electro-Optical Monitoring System was developed for early-warning detection of mini-UAVs using a custom CNN architecture and multi-sensor integration to ensure robust operation in various lighting and weather conditions. The system demonstrated superior performance in real-time monitoring scenarios, outperforming conventional detectors in accuracy and environmental adaptability²⁴.

Furthermore, the following recent works demonstrate the growing interest in biologically inspired and hybrid deep learning models for improving UAV and bird detection^22,23. These recent works emphasize the need for multi-modal, intelligent anti-drone frameworks, reinforcing the relevance of the proposed model for UAV-bird differentiation under real-world operational constraints. By combining spatial and temporal feature extraction techniques with advanced attention mechanisms and multi-sensor data fusion, these models offer significant improvements over traditional object detection methods. However, challenges remain in further enhancing the robustness of these models across diverse environmental conditions, making this an exciting area for continued research.

Existing detection frameworks rely heavily on CNNs, which, while effective for image classification, struggle to differentiate objects with similar motion characteristics. Recent studies suggest that biologically inspired models, particularly those mimicking the vision systems of birds of prey, could offer a new direction. These systems are capable of tracking and analyzing fast-moving objects with a high degree of accuracy. Incorporating such a bio-inspired approach could significantly enhance detection models and improve the differentiation between UAVs and birds.

Current UAV and bird detection systems face high false-positive rates due to the visual and motion similarities between UAVs and birds. Existing solutions are computationally expensive and require extensive manual tuning to adapt to different environments. There is a need for a robust, adaptive, and real-time detection model that can reliably distinguish between these entities, thereby enhancing aviation safety and wildlife protection.

This research introduces a bio-inspired deep learning model that leverages hierarchical motion detection mechanisms to differentiate between UAVs and birds in real time. The key contributions of this work are:

1.
A novel CNN-LSTM architecture that mimics bird of prey vision systems to improve motion differentiation.
2.
Spatial and temporal attention mechanisms focus on key motion dynamics such as wing flapping and speed variation.
3.
Extensive evaluation on both UAV and bird datasets to validate model effectiveness in various environmental conditions.
4.
An ablation study to quantify the impact of each component in the proposed model.

The following is the structure of the paper: Data preparation methods and the model’s design are described in Sect. “Methodology” of the suggested methodology. Experimental outcomes and performance measures are detailed in Sect. “Dataset and preprocessing”. The results and their significance are addressed in Sect. “Results”. The ablation investigation is presented in Sect. “Ablation study”, and the report is concluded with future research directions in Sect. “Conclusion and future work”.

Methodology

The proposed research focuses on developing a bio-inspired deep learning model, the Spatiotemporal Bio-Response Neural Network (STBRNN), for differentiating UAVs from birds in various real-world scenarios. This section outlines the steps involved in designing, training, and evaluating the STBRNN model, including details on dataset preprocessing, model architecture, training strategies, and evaluation metrics. Figure 1 depicts the high-level workflow of the proposed STBRNN model.

Dataset and preprocessing

Data from Mendeley Data²⁷ serves as the basis of this study and includes image and video recordings of UAVs and birds operating in different environmental contexts. Researchers used the image subset for both the initial training operations and baseline performance assessments. The video subset of the dataset served to conduct behavioral monitoring focused on bird flocking along with UAV formation flight patterns. Temporal analysis of object movements through video data matched perfectly with the proposed STBRNN framework which operates through spatiotemporal computation. The combination of image and video data in the evaluation process created a detailed opportunity to assess how well the model generalized and performed against real-world adversarial attacks and complicated operational scenarios.

The YOLOv7-segmented dataset for drone vs. bird detection served as the basis for the dataset utilized in this investigation. Featuring birds and UAVs in a variety of settings and lighting conditions, it includes 20,925 tagged photographs with 640 × 640 pixels. This dataset is great for training and evaluating models because it has detailed annotations for accurate detection. The preprocessing steps for the dataset are as follows:

Augmenting Data: The model’s durability is improved by utilizing data augmentation techniques such random rotations, zooming, flipping, and lighting adjustments. These improvements mimic real-world variability, which helps to prevent overfitting.
Pixel values are standardized to a range of 0 to 1 to expedite convergence during training.
A motion blur filter is used to mimic real-life situations, including rapid movement or low-quality video captures, to make birds and UAVs look more realistic.
Training accounts for 70% of the dataset, while validation and testing each comprise 15%.

Data augmentation strategies

Data augmentation plays a critical role in improving model generalization by introducing variability in training data. To ensure realistic transformations while improving model generalization, the following Table 1 presents the parameters for data augmentation:

Table 1 Data augmentation parameters.

Full size table

Bio-inspired model architecture

The STBRNN is designed to enhance motion-based detection by incorporating a bio-inspired approach. This model mimics the hierarchical visual processing system of birds of prey, which possess exceptional motion detection abilities. The architecture integrates three core components: a Bio-CNN for spatial feature extraction, GRUs for temporal dynamics, and a novel Bio-Response Layer that dynamically adjusts attention based on object movement characteristics.

Bio-inspired model components

Bio-inspired CNN

Objects like UAVs and birds may be easily distinguished with the help of the Bio-CNN’s spatial feature extraction capabilities, which applies to every frame of the video sequence. This module mimics the foveal vision system of birds of prey, which selectively focuses on high-velocity objects, adapting their attention dynamically based on object motion and spatial details. The network’s core innovation lies in its ability to adapt the size of its receptive field based on object velocity and movement complexity. The spatial feature extraction process in the Bio-CNN can be mathematically described using a combination of convolutional operations, dynamic receptive fields, and attention mechanisms.

Convolutional operations

Extracting hierarchical features from input images is the goal of the convolutional operation in the Bio-CNN. Given an input frame $\:I\in\:{R}^{h\times\:w\times\:c}$, in which c is the number of channels and h and w are the image’s height and width, respectively. The output of a convolutional layer can be defined as:

$$\:{F}^{\left(l\right)}\left(i,j\right)={\upsigma\:}\left({\sum\:}_{m=1}^{M}{\sum\:}_{p=0}^{{h}_{k}-1}{\sum\:}_{q=0}^{{w}_{k}-1}{W}_{m,p,q}^{\left(l\right)}\cdot\:{I}_{m,\left(i+p-s\right)/s,\left(j+q-s\right)/s}^{\left(l-1\right)}+{b}^{\left(l\right)}\right)$$

(1)

Where:

$\:{F}^{\left(l\right)}\left(i,j\right)=$ Output feature map at layer l for position $\:\left(i,j\right).$
$\:\sigma\:$ = Activation function (ReLU).
$\:{W}_{m,p,q}^{\left(l\right)}=$ Convolutional kernel weights at layer l, channel m, and kernel position $\:\left(p,q\right)$.
$\:{I}_{m,i,j}^{\left(l-1\right)}=$ Input feature map from the previous layer.
$\:{h}_{k},{w}_{k}=$ Kernel height and width.
$\:{b}^{\left(l\right)}=$ Bias term.
$\:s\:=$ Stride, which determines the step size of the convolution operation.
Padding is assumed to be “same” (zero-padding is applied to maintain spatial dimensions).

Stride (s): The default stride is set to 1 for early layers to preserve fine-grained spatial details. In later layers, stride = 2 is applied to reduce spatial resolution, increasing computational efficiency.

Padding (p): The model uses “same” padding, meaning zero-padding is applied so that the input and output feature maps have the same spatial dimensions.

The output size O is computed as:

$$\:O=\frac{I-K+2P}{S}+1$$

Where:

$\:I\:=$ Input size.
$\:K\:=\:$Kernel size.
$\:P\:=\:$Padding.
$\:S\:=$ Stride.

If stride = 1 and padding $\:=\left(K-1\right)/2$, the output maintains the same spatial size as the input.

Early layers use a stride of 1 to preserve spatial details, allowing fine-feature extraction. Later layers use a stride of 2 to downsample feature maps, reducing computational cost while maintaining high-level features. Using “same” padding ensures that feature maps remain spatially aligned, preventing information loss at the edges.

This Eq. (1) performs standard convolution across the spatial dimensions of the input image, extracting local features such as edges, textures, and object boundaries. The output feature map $\:{F}_{l}$ contains spatial information about the input frame, which is further processed by subsequent layers to capture higher-level features.

Dynamic receptive fields

A key innovation in the Bio-CNN is the use of dynamic receptive fields, which adjust based on the object’s movement. Traditional CNNs use fixed receptive fields, which may be inadequate for capturing the varying scales of UAVs and birds. This is handled by the Bio-CNN by adjusting the receptive field size in relation to the object’s motion.

Let V(t) represent the object’s velocity at time t, estimated using the optical flow between consecutive frames. The size of the receptive field $\:{R}_{l}$ at layer l is dynamically adjusted as:

$$\:{R}_{l}\left(t\right)={R}_{\text{base}}+{\upalpha\:}\cdot\:\text{log}\left(1+{\upbeta\:}V\left(t\right)\right)$$

(2)

Where:

$\:{R}_{\text{base}}$ is the base receptive field size,
$\:{\upalpha\:}$ and $\:{\upbeta\:}$ are scaling constants that control the sensitivity of the receptive field to object velocity,

As shown in Eq. (2), the receptive field expands logarithmically with increasing object velocity, ensuring that fast-moving objects are captured with a wider spatial focus. On the other hand, the model can concentrate on finer details when the receptive field is narrow, which happens when the object is either fixed or moving slowly. This real-time tweak improves the model’s spatial feature capture for objects moving at different speeds.

Attention mechanism in spatial domain

To further enhance spatial feature extraction, the Bio-CNN incorporates a spatial attention mechanism that prioritizes regions of the image where important motion cues occur. The attention map $\:{A}_{l}^{\left(i,j\right)}$ at layer l is computed as:

$$\:{A}_{l}^{\left(i,j\right)}=\frac{\text{exp}\left({\upgamma\:}{F}_{l}^{\left(i,j\right)}\right)}{{\sum\:}_{{i}^{{\prime\:}},{j}^{{\prime\:}}}\text{exp}\left(\gamma\:{F}_{l}^{\left({i}^{{\prime\:}},{j}^{{\prime\:}}\right)}\right)}$$

(3)

Where:

$\:{A}_{l}^{\left(i,j\right)}$ represents the attention weight at spatial position (i, j) for layer l,
$\:{F}_{l}^{\left(i,j\right)}$ is the feature map output at position (i, j) from Eq. (1),
$\:{\upgamma\:}$ is a scaling factor that adjusts the sensitivity of the attention mechanism.

The attention mechanism amplifies feature activations in regions where significant movement or object structure is detected, while suppressing less relevant areas. This selective focus helps in distinguishing between UAVs and birds, particularly in cluttered or noisy environments. Equation (3) computes a softmax over the feature map, ensuring that the attention weights sum to 1, effectively normalizing the attention across the image. After applying the attention mechanism, the attended feature map $\:\stackrel{\sim}{{F}_{l}}$ is computed as:

$$\:\stackrel{\sim}{{F}_{l}^{\left(i,j\right)}}={A}_{l}^{\left(i,j\right)}\cdot\:{F}_{l}^{\left(i,j\right)}$$

(4)

After that, a max-pooling layer is applied to the attended feature map $\:\stackrel{\sim}{{F}_{l}}$ to decrease the spatial dimensions while keeping the most crucial features. The max-pooling operation is defined as:

$$\:{P}_{l}^{\left(i,j\right)}=\underset{\left(p,q\right)\in\:\mathcal{N}}{\text{max}}\stackrel{\sim}{{F}_{l}^{\left(i+p,j+q\right)}}$$

(5)

Where:

$\:{P}_{l}^{\left(i,j\right)}$ is the pooled feature at position (i, j),
$\:\mathcal{N}$ is the local neighborhood around (i, j),
$\:\stackrel{\sim}{{F}_{l}^{\left(i+p,j+q\right)}}$ is the attended feature map from Eq. (4).

The pooling operation (Eq. 5) ensures that the model retains the most salient features while reducing computational complexity. This hierarchical feature extraction process enables the Bio-CNN to focus on relevant spatial details for both UAV and bird classification tasks.

Final output

The output of the Bio-CNN, after several layers of dynamic convolution, attention, and pooling, is a rich feature representation of the input frame, capturing both the global and local spatial characteristics of the object. These features are then passed to the temporal processing component (GRUs) for further analysis of motion dynamics.

Temporal dynamics with gated recurrent units (GRUs)

The temporal analysis component of the proposed STBRNN uses GRUs to capture and process the dynamic patterns of UAV and bird motion across consecutive frames. The GRUs are well-suited for this task because they provide an efficient mechanism for modeling sequential dependencies without the high computational cost associated with LSTM networks. By tracking the temporal evolution of motion, GRUs enable the model to identify subtle differences between UAVs and birds, such as flapping wings versus rotating blades.

GRU structure and operations

Given an input sequence of feature maps from the Bio-CNN, denoted as $\:\{\stackrel{\sim}{{F}_{t}}{\}}_{t=1}^{T}$, where T represents the total number of frames, the GRU processes the temporal evolution of these feature maps by updating its hidden state at each time step t.

The two gating methods used by a GRU cell to update its hidden state are the update gate $\:{z}_{t}$ and the reset gate $\:{r}_{t}$. The amount of the prior hidden state that is maintained or discarded by the GRU as it processes each incoming input is controlled by these gates. At time step t, the GRU operations are as follows:

Update gate

The amount of fresh information that is updated and the amount that is maintained from the previous concealed $\:{h}_{t-1}$ are both controlled by the update gate. The calculation is:

$$\:{z}_{t}={\upsigma\:}\left({W}_{z}\stackrel{\sim}{{F}_{t}}+{U}_{z}{h}_{t-1}+{b}_{z}\right)$$

(6)

Where:

$\:{z}_{t}$ is the update gate at time step t,
$\:{W}_{z}$ and $\:{U}_{z}$ are weight matrices for the current input $\:\stackrel{\sim}{{F}_{t}}$ and previous hidden state $\:{h}_{t-1}$, respectively,
$\:{b}_{z}$ is the bias term, and
$\:{\upsigma\:}$ is the sigmoid activation function.

Reset gate

When creating a new potential hidden state, the reset gate decides how much of the old one to disregard. The calculation is:

$$\:{r}_{t}={\upsigma\:}\left({W}_{r}\stackrel{\sim}{{F}_{t}}+{U}_{r}{h}_{t-1}+{b}_{r}\right)$$

(7)

Where:

$\:{r}_{t}$ is the reset gate at time step t,
$\:{W}_{r}$ and $\:{U}_{r}$ are weight matrices for the current input $\:\stackrel{\sim}{{F}_{t}}$ and previous hidden state $\:{h}_{t-1}$, respectively,
$\:{b}_{r}$ is the bias term.

Candidate hidden state

The candidate hidden state $\:\stackrel{\sim}{{h}_{t}}$ is computed using the reset gate $\:{r}_{t}$, which selectively resets the influence of the previous hidden state. The candidate hidden state is defined as:

$$\:\stackrel{\sim}{{h}_{t}}=\text{tanh}\left({W}_{h}\stackrel{\sim}{{F}_{t}}+{U}_{h}\left({r}_{t}\odot\:{h}_{t-1}\right)+{b}_{h}\right)$$

(8)

Where:

$\:\stackrel{\sim}{{h}_{t}}$ is the candidate hidden state at time step t,
$\:{W}_{h}$ and $\:{U}_{h}$ are weight matrices for the input and the reset-modified previous hidden state,
$\:{r}_{t}\odot\:{h}_{t-1}$ is the element-wise product between the reset gate and the previous hidden state, and
$\:tanh$ is the hyperbolic tangent activation function.

Final hidden state

The final hidden state $\:{h}_{t}$ at time step t is a linear interpolation between the previous hidden state $\:{h}_{t-1}$ and the candidate hidden state $\:\stackrel{\sim}{{h}_{t}}$, controlled by the update gate $\:{z}_{t}$:

$$\:{h}_{t}={z}_{t}\odot\:{h}_{t-1}+\left(1-{z}_{t}\right)\odot\:\stackrel{\sim}{{h}_{t}}$$

(9)

Where:

$\:{h}_{t}$ is the updated hidden state at time step t,
$\:{z}_{t}$ is the update gate from Eq. (6),
$\:\stackrel{\sim}{{h}_{t}}$ is the candidate hidden state from Eq. (8),
$\:\odot\:$ denotes element-wise multiplication.

Velocity-conditioned gating mechanism

To enhance the performance of the GRU in processing temporal dynamics of objects with varying motion characteristics, we introduce a velocity-conditioned gating mechanism. This mechanism adjusts the sensitivity of the update and reset gates based on the velocity of the object V(t), which is computed as the magnitude of the motion vector between consecutive frames. The conditioned update and reset gates are defined as:

$$\:{z}_{t}^{*}={z}_{t}+{{\uplambda\:}}_{z}\cdot\:\text{tanh}\left(V\left(t\right)\right)$$

(10)

$$\:{r}_{t}^{*}={r}_{t}+{{\uplambda\:}}_{r}\cdot\:\text{tanh}\left(V\left(t\right)\right)$$

(11)

Where:

$\:{z}_{t}^{*}$ and $\:{r}_{t}^{*}$ are the velocity-conditioned update and reset gates obtained from Eqs. 6 & 7 respectively,
$\:V\left(t\right)$ is the object velocity at time step t,
$\:{{\uplambda\:}}_{z}$ and $\:{{\uplambda\:}}_{r}$ are learnable parameters that control the influence of velocity on the gates.

An empirical search was performed with values ranging from 0.05 to 0.8. The optimal values were:

$$\:{{\uplambda\:}}_{z}=0.5,\hspace{1em}{{\uplambda\:}}_{r}=0.4$$

A higher $\:{{\uplambda\:}}_{z}$ allows for faster updates in response to rapid motion changes (UAV rotor motion), while $\:{{\uplambda\:}}_{r}$ ensures stable temporal memory retention for objects with smoother motion (birds in flight).

The velocity-conditioned gating mechanism enhances the model’s ability to adapt to objects with different motion profiles, such as UAVs with erratic movements or birds with periodic wing flapping. When the velocity V(t) is high, the gates become more sensitive, allowing the model to focus more on the current frame. Conversely, when the velocity is low, the model retains more information from the previous hidden state, ensuring smooth temporal continuity.

Temporal output and sequence modeling

The hidden state $\:{h}_{t}$, computed using Eqs. (6)-(9) with velocity-conditioned gates (Eqs. 10 and 11), captures the temporal dependencies between frames and is updated at each time step. After processing the entire input sequence, the final output is a sequence of hidden states $\:\{{h}_{t}{\}}_{t=1}^{T}$, which encodes both spatial and temporal information.

The final temporal representation $\:{H}_{T}$ is computed by taking the weighted sum of all hidden states using an attention mechanism:

$$\:{H}_{T}={\sum\:}_{t=1}^{T}{{\upalpha\:}}_{t}{h}_{t}$$

(12)

Where:

$\:{H}_{T}$ is the final temporal representation,
$\:{{\upalpha\:}}_{t}\:$are attention weights computed from the hidden states, indicating the importance of each time step in the sequence.

The attention weights $\:{{\upalpha\:}}_{t}$ are computed using a softmax function over the hidden states:

$$\:{{\upalpha\:}}_{t}=\frac{\text{exp}\left({W}_{{\upalpha\:}}{h}_{t}\right)}{{\sum\:}_{t=1}^{T}\text{exp}\left({W}_{{\upalpha\:}}{h}_{t}\right)}$$

(13)

Where:

$\:{W}_{{\upalpha\:}}$ is a learnable weight matrix for computing attention scores.

Final temporal features

The final temporal features $\:{H}_{T}$ are passed to the Bio-Response Layer for further processing and integration with the spatial features extracted by the Bio-CNN. These temporal features are crucial for differentiating between UAVs and birds based on their movement patterns over time, such as consistent wing flapping or rapid rotor motion. By leveraging the GRU’s ability to model long-range dependencies, combined with the velocity-conditioned gating mechanism (Eqs. 10 and 11), the model can adapt to the motion characteristics of each object.

Bio-response layer

The Bio-Response Layer is the core innovation in the proposed STBRNN. It mimics the adaptive feedback mechanism observed in biological systems, particularly the visual systems of birds of prey. In these biological systems, sensory feedback dynamically adjusts attention and focus based on key environmental factors such as movement, velocity, and proximity of objects. Similarly, the Bio-Response Layer modulates the spatial and temporal attention of the model, ensuring that the system prioritizes relevant features like fast-moving UAVs or erratic bird flight patterns in real-time scenarios.

This adaptive mechanism allows the model to improve differentiation between UAVs and birds by integrating three critical factors: movement intensity, object proximity, and velocity consistency. The mathematical formulation of the Bio-Response Layer introduces novel equations for adjusting the model’s focus dynamically in both the spatial and temporal domains.

Movement intensity modulation

To capture the movement intensity of objects in each frame, we define the movement intensity $\:M\left(t\right)$ at time t as a function of the optical flow between consecutive frames. Let $\:{\Delta\:}{p}_{t}$ represent the displacement of the object between frame t and frame $\:t-1$. The movement intensity is computed as:

$$\:M\left(t\right)=\sqrt{{\left({\Delta\:}{p}_{x}\left(t\right)\right)}^{2}+{\left({\Delta\:}{p}_{y}\left(t\right)\right)}^{2}}$$

(14)

Where:

$\:{\Delta\:}{p}_{x}\left(t\right)$ and $\:{\Delta\:}{p}_{y}\left(t\right)$ are the displacements in the horizontal and vertical directions between frames.

This Eq. (14) calculates the magnitude of the object’s movement at each time step. Faster-moving objects, such as UAVs, will result in a higher movement intensity, while slower-moving birds will have lower values.

The attention modulation based on movement intensity adjusts the weight of the corresponding frame in the temporal analysis. The movement-based attention weight $\:{{\upalpha\:}}_{M}\left(t\right)$ is computed as:

$$\:{{\upalpha\:}}_{M}\left(t\right)=\frac{M\left(t\right)}{{\sum\:}_{{t}^{{\prime\:}}=1}^{T}M\left({t}^{{\prime\:}}\right)}$$

(15)

Where:

$\:{{\upalpha\:}}_{M}\left(t\right)$ is the movement-based attention weight at time t,
$\:T$ is the total number of frames.

This ensures that frames with higher movement intensities receive more attention during temporal integration.

Object proximity modulation

Object proximity plays a crucial role in how much attention the model should allocate to certain features. The proximity of an object can be approximated by its apparent size in the frame. Let $\:A\left(t\right)$ represent the area of the object in the frame at time t. The proximity attention weight $\:{{\upalpha\:}}_{P}\left(t\right)$ is defined as:

$$\:{{\upalpha\:}}_{P}\left(t\right)=\frac{A\left(t\right)}{{\sum\:}_{{t}^{{\prime\:}}=1}^{T}A\left({t}^{{\prime\:}}\right)}$$

(16)

Where:

$\:A\left(t\right)$ is the area of the bounding box surrounding the object at time t,
$\:{{\upalpha\:}}_{P}\left(t\right)$ represents the proximity-based attention weight, prioritizing larger (closer) objects.

Objects that occupy a larger area in the frame (which are typically closer to the camera) will have higher attention weights, reflecting their increased importance in decision-making.

Velocity consistency modulation

In addition to movement intensity and proximity, the velocity consistency of the object across frames is another important factor in differentiating between UAVs and birds. UAVs often exhibit more erratic and rapid changes in velocity compared to birds, whose flight patterns are typically smoother and more rhythmic. The velocity consistency $\:C\left(t\right)$ is defined as the temporal difference in velocity between two consecutive frames:

$$\:C\left(t\right)=\left|V\left(t\right)-V\left(t-1\right)\right|$$

(17)

Where:

$\:V\left(t\right)$ is the velocity of the object at time t, computed from the displacement $\:{\Delta\:}{p}_{t},$
$\:C\left(t\right)$ is the velocity consistency between frames.

High values of $\:C\left(t\right)$ indicate erratic movements, which are characteristic of UAVs. The velocity consistency weight $\:{{\upalpha\:}}_{C}\left(t\right)$ is computed as:

$$\:{{\upalpha\:}}_{C}\left(t\right)=\frac{C\left(t\right)}{{\sum\:}_{{t}^{{\prime\:}}=1}^{T}C\left({t}^{{\prime\:}}\right)}$$

(18)

This modulates the attention given to frames with erratic motion patterns, ensuring that UAV-like behavior is prioritized during analysis.

Final bio-response attention

The final attention weight $\:{{\upalpha\:}}_{B}\left(t\right)$ for each frame t in the sequence is a weighted combination of the movement intensity, object proximity, and velocity consistency. The final Bio-Response attention is computed as:

$$\:{{\upalpha\:}}_{B}\left(t\right)={{\uplambda\:}}_{M}\cdot\:{{\upalpha\:}}_{M}\left(t\right)+{{\uplambda\:}}_{P}\cdot\:{{\upalpha\:}}_{P}\left(t\right)+{{\uplambda\:}}_{C}\cdot\:{{\upalpha\:}}_{C}\left(t\right)$$

(19)

Where:

$\:{{\uplambda\:}}_{M}$, $\:{{\uplambda\:}}_{P},$ and $\:{{\uplambda\:}}_{C}$ are learnable parameters that adjust the relative importance of movement, proximity, and velocity consistency.

The final Bio-Response attention weights $\:{{\upalpha\:}}_{B}\left(t\right)$ dynamically modulate how much attention is given to each time step based on the object’s behavior in the scene.

Integration of spatial and temporal features

The final integrated feature representation for the object is computed by combining the Bio-Response attention weights (Eq. 19) with the temporal hidden states $\:{h}_{t}$ produced by the GRU:

$$\:{H}_{B}={\sum\:}_{t=1}^{T}{{\upalpha\:}}_{B}\left(t\right)\cdot\:{h}_{t}$$

(20)

Where:

$\:{H}_{B}$ is the final Bio-Response feature representation,
$\:{h}_{t}$ is the GRU hidden state at time t,
$\:{{\upalpha\:}}_{B}\left(t\right)$ is the Bio-Response attention weight.

This Eq. (20) ensures that the final feature representation is influenced not only by the object’s spatial characteristics but also by its temporal dynamics, adjusted according to movement intensity, proximity, and velocity consistency.

Output layer and classification

The final output $\:{H}_{B}$ is passed through a fully connected layer to produce the final classification logits y, which indicate whether the object is a UAV or a bird:

$$\:y={W}_{o}{H}_{B}+{b}_{o}$$

(21)

Where:

$\:{W}_{o}$ is the weight matrix for the output layer,
$\:{b}_{o}$ is the bias term,
$\:y$ is the final output of the model.

The logits $\:y\:$are passed through a softmax function to compute the final class probabilities for UAV and bird detection.

Loss function and optimization

The model is trained using cross-entropy loss, which is well-suited for binary classification tasks such as UAV vs. bird differentiation. The total loss function for the STBRNN integrates three components: spatial loss from the Bio-CNN, temporal loss from the GRUs, and attention loss from the Bio-Response Layer. The final loss function is formulated as:

$$\:{L}_{\text{total}}={{\uplambda\:}}_{1}{L}_{\text{spatial}}+{{\uplambda\:}}_{2}{L}_{\text{temporal}}+{{\uplambda\:}}_{3}{L}_{\text{attention}}$$

(22)

Where:

$\:{L}_{\text{spatial}}$ captures object localization and classification errors from the Bio-Inspired CNN,
$\:{L}_{\text{temporal}}$ models motion continuity errors captured by the GRU-based temporal analysis,
$\:{L}_{\text{attention}}$ ensures proper weight distribution in the Bio-Response Layer for movement intensity, object proximity, and velocity consistency and
$\:{{\uplambda\:}}_{1}$, $\:{{\uplambda\:}}_{2}$, $\:{{\uplambda\:}}_{3}$ are hyperparameters used to balance the contributions of each component.

The values were determined using grid search tuning across a range of values (0.1 to 1.5, step size 0.1) and evaluating validation loss performance. The optimal values were found to be:

$$\:{{\uplambda\:}}_{1}=0.6,\hspace{1em}{{\uplambda\:}}_{2}=0.3,\hspace{1em}{{\uplambda\:}}_{3}=0.1$$

Higher weightage for $\:{L}_{\text{spatial}}$ ensures strong feature extraction, while moderate contributions from $\:{L}_{\text{temporal}}$ and $\:{L}_{\text{attention}}$ enhance the network’s motion and object detection focus. The Adam optimizer is employed with an initial learning rate of 0.001. A dynamic learning rate scheduler adjusts the learning rate based on validation performance to ensure faster convergence.

Training strategy

Several phases are involved in training the model to guarantee its strong performance:

Batch Size: To achieve a balance between memory use and computational speed, a batch size of 32 is utilized.
Preventing Overfitting: A validation loss-based early halting mechanism is employed. If the validation loss remains unchanged for ten consecutive epochs, training is stopped.
At the conclusion of each epoch, the model is “checkpointed,” meaning that the most accurate version of the model is preserved. To prevent the model from learning any unforeseen patterns, the training data is randomized at the start of each epoch. The model undergoes 50 epochs of training, with validation accuracy, recall, precision, and F1 score serving as evaluation metrics at the conclusion of each iteration.

Results

The suggested STBRNN model is compared to five top-tier algorithms in UAV and bird identification in Table 2. According to these results, STBRNN has the best precision at 0.984, which means it can detect birds and UAVs with the fewest false positives. The fact that STBRNN achieves a recall of 0.964 demonstrates how well it can identify real positives. This model achieves a good compromise between recall and precision, as seen by its F1 score of 0.974. Additionally, the accuracy of STBRNN is the highest among the compared models at 0.979. The model’s AUC-ROC score is also the highest at 0.99, indicating its strong capability to distinguish between UAVs and birds. In terms of spatial localization, STBRNN achieves the highest IoU at 0.95, demonstrating accurate detection and bounding box predictions. The inference time for STBRNN is 45 milliseconds per frame, making it efficient for real-time applications.

Table 2 Performance comparison metrics.

Full size table

True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are the focus of the confusion matrix measurements for each model, as shown in Table 3. The STBRNN model outperforms the others in accurately classifying UAVs and birds, with the highest true positives (974) and true negatives (974). It also has the fewest false positives (16) and false negatives (36), minimizing both false alarms and missed detections. In comparison, YOLOv5 shows slightly higher false positives (50) and false negatives (60), reducing its overall detection accuracy. Other models, such as Faster R-CNN and SSD, exhibit even more significant false positives and false negatives, further demonstrating their lower performance relative to STBRNN. R-FCN and RetinaNet also lag behind STBRNN, with higher error rates in both categories. These results reinforce the overall superiority of STBRNN in both detecting UAVs and minimizing errors.

Table 3 Confusion matrix metrics comparison.

Full size table

Figure 2 shows that the STBRNN model is quite good at recognizing birds and UAVs, as evidenced by the large number of true positives and true negatives. Because of how little it produces in the way of false positives and false negatives, it clearly cuts down on misclassification mistakes.

The training and validation loss curves show a clear improvement in model performance as training progresses over 20 epochs (See Fig. 3). The training loss decreases steadily from 0.60 to 0.15, while the validation loss starts slightly higher at 0.65 and decreases to 0.20 by the end of training. This consistent decrease in training and validation loss, as a result of methods like early stopping and validation checks, shows that the model is learning well and isn’t overfitting.

Over the course of the 50 training epochs, the model improves its accuracy, as seen in the validation and training accuracy curves. Both the training and validation accuracy start at approximately 65% and go all the way up to 97%, with the validation accuracy starting at 63% and going all the way up to 93%. A balanced model with good generalizability to new data is shown by the small discrepancy between training and validation accuracy.

The proposed model effectively classifies both UAVs and birds, even in challenging conditions where images are blurred. As shown in Fig. 4, the model accurately identifies the UAV with a high confidence score (0.40) and is also able to capture distant or blurred objects, such as birds in the background, albeit with a lower confidence score (0.29). This demonstrates the robustness of the model in handling real-world scenarios where image clarity may vary.

Complex behavior evaluation

Understanding the adaptability of the STBRNN to complex flight behaviors is critical for real-world UAV and bird detection scenarios. In addition to distinguishing individual UAVs and birds, the model must handle coordinated flight formations, such as UAV swarms and bird flocking. These behaviors introduce unique challenges due to synchronized movements, variable speeds, and dynamic spatial configurations. To evaluate STBRNN’s performance in such cases, an additional set of experiments was conducted using synthetic UAV formation datasets and real-world bird flocking video sequences. The goal was to analyze the model’s ability to detect and classify objects correctly while minimizing false positives and misclassifications in complex movement scenarios.

Evaluation of UAV formation flights

UAV formation flights typically involve multiple UAVs maintaining fixed relative distances and coordinated heading trajectories. A synthetic dataset was created using aerial imagery of UAV swarms, with ground truth annotations for each UAV in each formation. The STBRNN model was tested against YOLOv5, Faster R-CNN, and RT-DETR to compare detection accuracy and classification performance.

As seen in Table 4, STBRNN achieved 93.7% detection accuracy for UAV formations, outperforming YOLOv5 (87.9%) and Faster R-CNN (85.4%). However, the model exhibited a slight decline in precision (91.2%) compared to its individual UAV detection performance, likely due to close-proximity UAVs being partially merged as a single detection instance. To mitigate this, post-processing methods such as bounding box refinement and trajectory clustering could be explored in future work.

Table 4 Performance of STBRNN on complex flight behaviors.

Full size table

Evaluation on bird flocking migration

Bird flocking behaviors introduce additional complexity, as flocks can dynamically split, merge, and change altitude. A real-world dataset of bird migration footage was used to analyze STBRNN’s ability to detect individual birds within a flock. The dataset included high-density and low-density flocks, allowing for performance evaluation across different flock densities.

As detailed in Table 4, STBRNN achieved 92.5% recall in identifying birds within flocks, significantly higher than YOLOv5 (85.1%) and Faster R-CNN (80.7%). However, false positives increased in high-density flocks, where individual birds were often detected multiple times due to bounding box overlaps. This suggests that incorporating a trajectory consistency check or graph-based tracking refinement could enhance detection robustness in future iterations.

The model maintains high accuracy in formation detection, but bounding box merging in tight formations reduces precision. STBRNN excels in recall but exhibits higher false positives in dense flocks due to object overlaps.

Small target detection analysis

Detecting small UAVs and birds in aerial imagery poses significant challenges due to low resolution, occlusions, and background clutter. The effectiveness of the STBRNN in detecting small targets at different scales is crucial for ensuring accurate UAV and bird differentiation in real-world applications.

Evaluation on small uavs and birds

To assess the small object detection capability of STBRNN, the dataset was categorized into three size groups based on the pixel area of the bounding box:

Large objects: > 50 × 50 pixels.
Medium objects: 20 × 20 to 50 × 50 pixels.
Small objects: < 20 × 20 pixels.

Performance metrics were computed separately for each category, enabling a comparative analysis of detection effectiveness across different scales. As detailed in Table 5, STBRNN achieved the highest accuracy (95.6%) for large objects, but its performance declined for small objects (88.3%), highlighting challenges in detecting low-resolution UAVs and birds.

Table 5 Performance of STBRNN on different object sizes.

Full size table

Comparative performance across models

To evaluate relative performance, STBRNN was compared with YOLOv5, Faster R-CNN, and RT-DETR, which are commonly used for small object detection in aerial imagery. Table 5 shows that while STBRNN outperformed other models across all scales, the performance gap was more significant for small targets. The recall for small UAVs and birds was 87.1%, higher than YOLOv5 (80.4%) and Faster R-CNN (78.6%), indicating superior detection sensitivity. However, false positive rates were also higher for small objects, suggesting that additional refinements, such as feature enhancement techniques and context-aware post-processing, could further improve accuracy.

STBRNN maintains high detection accuracy across all object sizes, but small targets remain challenging due to lower pixel resolution. Small UAVs are occasionally misclassified as birds when they appear in high-speed motion with motion blur, leading to an increase in false positives. Future enhancements: Integrating super-resolution techniques and multi-scale feature fusion could enhance small object detection.

Model convergence and extended training analysis

Ensuring full model convergence is essential to maximize performance and generalization in real-world UAV and bird detection scenarios. Reviewer 2 highlighted that Fig. 3 shows a consistent decline in training and validation loss up to epoch 20, suggesting the potential for further optimization by extending the training period. To investigate this, the STBRNN was trained for additional epochs, and performance was evaluated at different milestone epochs.

Extended training analysis

The model was trained for 50 epochs while monitoring:

Training and validation loss trends to assess overfitting or underfitting.
Accuracy improvement over additional epochs.
Performance stability beyond epoch 20.

Observations for extended training

Performance gains plateau after 35 epochs (see Table 6), with negligible improvements beyond epoch 40. Validation loss stabilizes after epoch 30, indicating that additional training does not significantly enhance generalization. Minor overfitting begins after epoch 40, as training accuracy increases slightly while validation accuracy remains stable. These results suggest that 35 epochs represent the optimal training point for STBRNN, balancing accuracy gains and prevention of overfitting.

Table 6 Performance evaluation at different epochs.

Full size table

Model convergence findings

Training beyond 35 epochs provides minimal accuracy improvements (only + 0.3% after epoch 40). Early stopping at epoch 35 is recommended to prevent unnecessary computational cost. Minor overfitting trends appear after epoch 40, as seen in the training loss reduction without validation gain.

Generalization and unseen dataset testing

Ensuring that the STBRNN generalizes well to previously unseen data is critical for real-world UAV and bird detection applications. While the initial experiments used a 70-15-15 train-validation-test split, the model’s robustness must be validated on an entirely new dataset to assess its adaptability to new environments, lighting conditions, and object variations.

Evaluation on an unseen dataset

To test STBRNN’s generalization capability, an independent dataset was introduced:

Dataset: The Anti-UAV Dataset²⁸.
Description: This dataset contains aerial videos of UAVs and birds captured under different lighting conditions, camera angles, and urban settings.
Key differences from training dataset:
- Includes urban and rural backgrounds.
- Captures low-altitude and high-altitude UAV operations.
- Features of different bird species are not present in the training dataset.

The model was evaluated on 828 new images (containing birds and drones), and performance metrics were compared with the original test dataset.

As shown in Table 7, STBRNN maintained high generalization performance, with an accuracy of 91.3%, slightly lower than its original test dataset accuracy (97.9%). The drop in performance was most significant for small UAVs and birds, where the recall decreased from 87.1 to 83.5%, indicating that further domain adaptation techniques could improve robustness.

Table 7 STBRNN performance on training vs. unseen dataset.

Full size table

STBRNN generalizes well to new datasets, but performance declines slightly due to unseen environmental variations. The false positive rate increased by 2.7%, primarily due to misclassification of small UAVs as birds.

Comparative analysis with recent object detection models

To further validate the effectiveness of the STBRNN, additional experiments were conducted to compare its performance against newer state-of-the-art (SOTA) object detection models, including:

YOLOv11 (2023) – Enhanced version of YOLO with better feature fusion and attention mechanisms.
YOLOv12 (2024) – Introduces dynamic sparse attention and transformer-based enhancements for small object detection.
RT-DETR (Real-Time Detection Transformer) – A transformer-based real-time object detection framework optimized for UAV tracking.

The models were tested using the original test dataset and evaluated based on precision, recall, accuracy, and inference time per frame. The results are summarized in Table 8.

Table 8 Performance comparison of STBRNN vs. recent SOTA models.

Full size table

STBRNN maintains competitive performance, achieving 96.2% accuracy, slightly below YOLOv12 (97.0%) and RT-DETR (97.3%), but outperforming YOLOv11 (94.8%). YOLOv12 demonstrated superior recall (96.1%), indicating stronger performance in identifying small UAVs and birds. YOLOv12 achieves the fastest inference (32ms/frame) due to its transformer-based sparse attention mechanism. STBRNN is optimized for real-time applications (45ms/frame) while maintaining high accuracy. RT-DETR, while achieving the best accuracy (97.3%), is slightly slower (50ms/frame) due to transformer-based computations.

Performance evaluation under adversarial conditions

Real-world UAV and bird detection systems must be robust to environmental challenges, including poor lighting, occlusions, and sensor noise. To assess the STBRNN under such conditions, additional experiments were conducted on adversarial scenarios to evaluate detection performance. The model was tested under three key adversarial conditions:

1.
Low-Light Conditions – Simulated using darkened frames (brightness reduced by 40–60%).
2.
Occlusions – Random obstructions (trees, poles, partial UAV coverage).
3.
Sensor Noise – Artificial Gaussian noise applied to simulate real-world camera imperfections.

The results, summarized in Table 9, show that STBRNN maintains high accuracy but experiences performance drops under occlusions and high sensor noise.

Table 9 STBRNN performance under adversarial conditions.

Full size table

Low-light performance remains strong, with only a 2.7% accuracy drop. The Bio-Response Layer’s motion-based attention helps detect moving UAVs and birds even under poor lighting. Occlusions significantly impact recall, as objects partially covered by background elements lead to missed detections (false negatives). Sensor noise results in the largest accuracy drop (88.7%), as image distortions interfere with feature extraction, causing false classifications.

Table 10 presents the effectiveness of the augmented dataset (image datasets) changes, STBRNN was trained using both original and revised augmentation strategies, and results were compared on the test dataset.

Table 10 Performance impact of augmentation refinements.

Full size table

Accuracy decreased by 2.7%, demonstrating the need for better generalization with realistic augmentations. False positives increased (from 3.8 to 4.7%), as introducing unrealistic transformations increased misclassifications.

Discussion

In UAV and bird detection tasks, the proposed STBRNN outperforms five state-of-the-art models. The revolutionary Bio-Response Layer, in conjunction with the spatial and temporal information integrated by the Bio-CNN and GRU modules, improves the model’s capacity to distinguish between objects in motion, such as UAVs and birds, in real-time.

Superior accuracy and real-time capability

STBRNN performs better than models like YOLOv5, Faster R-CNN, SSD, RetinaNet, and R-FCN (Table 1). The model recognizes UAVs and birds with high accuracy and minimal FP and FN, respectively, as shown by its high recall (0.964) and precision (0.984). In complicated contexts where minimizing both types of errors is crucial, STBRNN is suited for practical applications due to its balanced performance between recall and precision, as further highlighted by the F1 score of 0.974.

The high AUC-ROC (0.99) suggests that STBRNN is highly effective in distinguishing between UAVs and birds, even in difficult scenarios such as low lighting or cluttered backgrounds. Additionally, the IoU (0.96) indicates superior spatial localization, meaning that the model effectively identifies the exact position of objects within each frame, improving the accuracy of bounding boxes compared to competing methods.

In terms of inference time, STBRNN achieves 45ms per frame, making it efficient for real-time applications such as airport security, wildlife monitoring, and air traffic control. While models like SACN (48 ms) are faster, their performance is considerably lower, which reduces their practical viability in sensitive applications where high accuracy is required.

Impact of bio-inspired mechanisms

The introduction of the Bio-Response Layer is a major factor in the improved performance of STBRNN. This layer adjusts attention based on three critical factors: movement intensity, object proximity, and velocity consistency. This bio-inspired feedback mechanism mirrors the visual system of birds of prey, which allows them to track and focus on fast-moving or nearby objects. This dynamic adjustment of attention gives STBRNN the ability to prioritize important features such as fast-moving UAVs or erratic flight patterns of birds, enhancing its real-time detection capabilities.

The mathematical formulations introduced in the Bio-Response Layer, such as the movement-based attention weight (Eq. 15), the proximity-based attention weight (Eq. 16), and the velocity-consistency-based attention weight (Eq. 18), ensure that the model can adapt to changing environments and object behaviors. This adaptability is reflected in the superior performance metrics across various environmental conditions.

Performance in challenging environments

One of the primary challenges in UAV and bird detection is dealing with complex environments where lighting, weather conditions, or background noise can interfere with the model’s ability to distinguish between objects. The dynamic receptive fields introduced in the Bio-CNN component (Eq. 2) enable the model to adjust its focus based on the velocity of moving objects, improving its performance in cluttered or fast-paced environments. Additionally, the use of GRUs in the temporal domain allows STBRNN to capture motion patterns over time, which is critical in differentiating between UAVs’ erratic movements and the rhythmic flapping of bird wings.

STBRNN’s ability to maintain high precision and recall in such challenging environments suggests that it generalizes well across different scenarios, making it an ideal solution for real-world applications where environmental variability is unavoidable. In contrast, models such as Faster R-CNN and RetinaNet, which rely more on static spatial features, struggle in dynamic environments, as indicated by their lower precision, recall, and F1 scores.

Ablation study

With this ablation study, we hope to dissect the STBRNN and determine how the Bio-CNN, GRUs, and Bio-Response Layer all work together (see Table 11). By removing or replacing some of these components with simpler mechanisms, the study assesses the model’s performance.

Precision (0.984), recall (0.964), and F1 score (0.974) are just a few metrics that show how well the complete STBRNN model (containing all components) performs. This setup has the best IoU (0.96) and AUC-ROC (0.99), proving that it can localize objects precisely and differentiate between birds and UAVs. The model strikes a good compromise between accuracy and efficiency with an inference time of 45 milliseconds per frame, making it appropriate for real-time applications.

When the Bio-Response Layer is removed, the model experiences a significant drop in performance, with the precision falling to 0.92 and the F1 score to 0.91. The absence of this layer, which dynamically adjusts attention based on movement intensity, proximity, and velocity consistency, results in higher rates of false positives and false negatives. The model becomes less adept at prioritizing critical temporal features, leading to poorer overall performance, although the inference time slightly improves to 40ms. The model’s ability to adapt to various object behaviors in real-time contexts is greatly influenced by the Bio-Response Layer.

The performance drops significantly further when GRUs are removed and replaced with static temporal features. This is especially true for recall (0.87) and F1 score (0.88). The GRU’s capacity to represent time-dependent sequential dependencies is crucial for the model to accurately represent the ever-changing flight paths of UAVs and birds. Because of this, the model’s ability to accurately identify objects is diminished, and there are more missed detections. Reduced computing complexity reduces the inference time to 35 ms, but accuracy and the model’s ability to distinguish between moving objects suffer as a result.

There is a marked drop in performance when using a regular CNN in place of the Bio-CNN. Accuracy falls to 0.88 and IoU to 0.83. In contrast to Bio-CNN, which enables the model to focus differently depending on the size and speed of objects, regular CNNs do not have dynamic receptive fields. The model has trouble extracting spatial features without this adaptive method, especially when UAVs and other fast-moving objects are involved. This result highlights the importance of bio-inspired spatial processing in ensuring accurate localization and classification of objects.

Finally, removing the dynamic receptive fields from the Bio-CNN and replacing them with static receptive fields leads to a reduction in recall (0.88) and IoU (0.87). The model becomes less capable of handling objects that move at varying speeds, which are common in real-world scenarios involving both UAVs and birds. This confirms the necessity of dynamic receptive fields in allowing the model to adapt to different object velocities and maintain high spatial accuracy.

Table 11 Ablation study results of the proposed STBRNN model.

Full size table

Conclusion and future work

A new deep learning model called the STBRNN was introduced in this study. Its purpose is to enhance the ability to distinguish between birds and unmanned aerial vehicles in real-time. A Bio-Inspired CNN is used to extract spatial features, a GRU is used to analyze temporal motion, and the Bio-Response Layer is used to dynamically shift attention based on movement intensity, object proximity, and velocity consistency; these three components form the model. As part of its rigorous testing, the model was compared to five top-tier detection models: YOLOv5, Faster R-CNN, SSD, RetinaNet, and R-FCN. In terms of critical performance measures, the results show that the suggested model is far superior than these current methods. More specifically, it outperformed the competing models in every respect, with a recall of 0.964, a precision of 0.984, and an F1 score of 0.974. Further evidence of the model’s exceptional localization accuracy is its Intersection over Union score of 0.96.

Despite its superior performance, the STBRNN model has a few limitations. The inference time of 45ms, while suitable for most real-time applications, can still be improved for use in extremely high-speed detection systems, such as those required in military applications or high-speed drones. Models like SSD, though less accurate, are faster and may be preferred in environments where detection speed is prioritized over accuracy. Another limitation of the model is the potential over-reliance on velocity for attention modulation. In scenarios where the UAV or bird moves at a constant slow speed (e.g., hovering UAVs), the velocity-based attention mechanism may give less weight to important spatial features. Future work could introduce additional mechanisms, such as multi-modal fusion with radar or infrared data, to further improve detection in cases where visual velocity cues are limited.

Moreover, while the dataset used provides a variety of environmental conditions, further testing across different geographical locations, lighting conditions, and bird species would strengthen the generalizability of the model. Domain adaptation techniques could be explored to enhance the model’s performance when applied to unseen regions with different environmental factors.

Data availability

Data Availability Statement: The dataset utilized in this study is publicly available and can be accessed through the following reference: Srivastav, Aditya; Shandilya, Shishir Kumar; Datta, Agni; Yemets, Kyrylo; Nagar, Atulya (2023), “Segmented Dataset Based on YOLOv7 for Drone vs. Bird Identification for Deep and Machine Learning Algorithms,” Mendeley Data, V3, accessible at: https://data.mendeley.com/datasets/6ghdz52pd7/5, retrieved on 20 September 2024). The dataset comprises segmented images that were used for training and evaluating the deep learning models analyzed in this paper. No new data were generated during this research.

References

Telli, K. et al. A comprehensive review of recent research trends on unmanned aerial vehicles (uavs). Systems 11 (8), 400 (2023).
Article Google Scholar
Coluccia, A. et al. Drone vs. bird detection: deep learning algorithms and results from a grand challenge. Sensors 21 (8), 2824 (2021).
Article ADS PubMed PubMed Central Google Scholar
Said Hamed Alzadjail, N., Balasubaramainan, S., Savarimuthu, C. & Rances, E. O. A deep learning framework for Real-Time bird detection and its implications for reducing bird strike incidents. Sensors 24 (17), 5455 (2024).
Article PubMed PubMed Central Google Scholar
Murthy, C. B., Hashmi, M. F., Bokde, N. D. & Geem, Z. W. Investigations of object detection in images/videos using various deep learning techniques and embedded platforms—A comprehensive review. Appl. Sci. 10 (9), 3280 (2020).
Article CAS Google Scholar
Ahmed, S. F. et al. Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artif. Intell. Rev. 56 (11), 13521–13617 (2023).
Article Google Scholar
Zafar, I. et al. Reviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine. Biomed. Signal Process. Control. 86, 105263 (2023).
Article CAS Google Scholar
Torre-Bastida, A. I. et al. Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions. Neural Comput. Appl. 2021 Aug 3:1–31 .
Hong, S. J., Han, Y., Kim, S. Y., Lee, A. Y. & Kim, G. Application of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors 19 (7), 1651 (2019).
Article ADS PubMed PubMed Central Google Scholar
Weinstein, B. G. et al. A general deep learning model for bird detection in high-resolution airborne imagery. Ecol. Appl. 32 (8), e2694 (2022).
Article PubMed Google Scholar
Samadzadegan, F., Dadrass Javan, F., Ashtari Mahini, F. & Gholamshahi, M. Detection and recognition of drones based on a deep convolutional neural network using visible imagery. Aerospace 9 (1), 31 (2022).
Article Google Scholar
Al-lQubaydhi, N. et al. Deep learning for unmanned aerial vehicles detection: A review. Comput. Sci. Rev. 51, 100614 (2024).
Article MathSciNet Google Scholar
Unlu, E., Zenou, E., Riviere, N. & Dupouy, P. E. Deep learning-based strategies for the detection and tracking of drones using several cameras. IPSJ Trans. Comput. Vis. Appl. 11, 1–3 (2019).
Google Scholar
Zhao, J., Zhang, J., Li, D. & Wang, D. Vision-based anti-uav detection and tracking. IEEE Trans. Intell. Transp. Syst. 23 (12), 25323–25334 (2022).
Article Google Scholar
Wang, H., Zhong, Z., Lei, F., Peng, J. & Yue, S. Bio-Inspired small target motion detection with Spatio-Temporal feedback in natural scenes. IEEE Trans. Image Process. Dec 27. (2023).
Guzman-Pando, A., Chacon-Murguia, M. I. & DeepFoveaNet Deep fovea eagle-eye bioinspired model to detect moving objects. IEEE Trans. Image Process. 30, 7090–7100 (2021).
Article ADS PubMed Google Scholar
Cazzato, D., Cimarelli, C., Sanchez-Lopez, J. L., Voos, H. & Leo, M. A survey of computer vision methods for 2d object detection from unmanned aerial vehicles. J. Imaging. 6 (8), 78 (2020).
Article PubMed PubMed Central Google Scholar
Li, D. et al. Dual-stream shadow detection network: biologically inspired shadow detection for remote sensing images. Neural Comput. Appl. 34 (12), 10039–10049 (2022).
Article Google Scholar
Seidaliyeva, U., Akhmetov, D., Ilipbayeva, L. & Matson, E. T. Real-time and accurate drone detection in a video with a static background. Sensors 20 (14), 3856 (2020).
Article ADS PubMed PubMed Central Google Scholar
Correa, V., Funk, P., Sundelius, N., Sohlberg, R. & Ramos, A. Applications of gans to aid target detection in sar operations: A systematic literature review. Drones 8 (9), 448 (2024).
Article Google Scholar
Craye, C. & Ardjoune, S. Spatio-temporal semantic segmentation for drone detection. In2019 16th IEEE International conference on advanced video and signal based surveillance (AVSS) 2019 Sep 18 (pp. 1–5). IEEE.
Wang, S., Wang, S., Wan, Y. & Xiao, Z. Residual Cross-Stage Parallel Method of Drone Images Based on Real-Time Detection with Transformer. In2024 6th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI) 2024 Jul 26 (pp. 678–685). IEEE.
Delleji, T., Slimeni, F. & Lafi, M. A. C4 software for Anti-Drone system. Def. Sci. J. 74 (5), 635–642 (2024).
Article Google Scholar
Delleji, T., Slimeni, F. & RF-YOLO: A modified YOLO model for UAV detection and classification using RF spectrogram images. Telecommunication Syst. 88 (1), 33 (2025).
Article Google Scholar
Delleji, T., Slimeni, F., Ayadi, A., Lafi, M. & Chtourou, Z. Electro-Optical monitoring system for Early-Warning detection of Mini-UAVs. SN Comput. Sci. 6 (3), 227 (2025).
Article Google Scholar
Dudczyk, J., Czyba, R. & Skrzypczyk, K. Multi-sensory data fusion in terms of UAV detection in 3D space. Sensors 22 (12), 4323 (2022).
Article ADS PubMed PubMed Central Google Scholar
Noor, A. et al. A hybrid deep learning model for UAVs detection in day and night dual visions. In 2021 IEEE Third International Conference on Cognitive Machine Intelligence (CogMI) 2021 Dec 13 (pp. 221–231). IEEE.
Srivastav, A., Shandilya, S. K., Datta, A., Yemets, K. & Nagar, A. Segmented dataset based on YOLOv7 for drone vs. Bird identification for deep and machine learning algorithms. Mendeley Data. V3 https://doi.org/10.17632/6ghdz52pd7.3 (2023).
Walia, H. Birds vs Drone Dataset. Kaggle. (2023). Available from: https://www.kaggle.com/datasets/harshwalia/birds-vs-drone-dataset

Download references

Author information

Authors and Affiliations

College of Computing and Information and Sciences, University of Technology and Applied Sciences, Al Mussanah, Oman
Najiba Said Hamed Al-Zadjali, Sundaravadivazhagan Balasubaramanian & Charles Savarimuthu
College of Engineering and Technology, University of Technology and Applied Sciences, Al Mussanah, Oman
Emanuel O. Rances

Authors

Najiba Said Hamed Al-Zadjali
View author publications
Search author on:PubMed Google Scholar
Sundaravadivazhagan Balasubaramanian
View author publications
Search author on:PubMed Google Scholar
Charles Savarimuthu
View author publications
Search author on:PubMed Google Scholar
Emanuel O. Rances
View author publications
Search author on:PubMed Google Scholar

Contributions

Author Contributions: Conceptualization, N.S.H.A. and S.B.; methodology, N.S.H.A.; software, S.B., C.S., and E.O.R.; validation, N.S.H.A., S.B., C.S. and E.O.R.; formal analysis, S.B.; investigation, N.S.H.A.; resources, C.S. and E.O.R.; data curation, S.B.; writing—original draft preparation, S.B.; writing—review and editing, N.S.H.A. and C.S.; visualization, S.B. and E.O.R.; supervision, S.B.; project administration, S.B., C.S. and E.O.R.; funding acquisition, N.S.H.A. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Sundaravadivazhagan Balasubaramanian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Al-Zadjali, N.S.H., Balasubaramanian, S., Savarimuthu, C. et al. Bio-inspired motion detection models for improved UAV and bird differentiation: a novel deep learning framework. Sci Rep 15, 15521 (2025). https://doi.org/10.1038/s41598-025-99951-4

Download citation

Received: 23 November 2024
Accepted: 23 April 2025
Published: 03 May 2025
DOI: https://doi.org/10.1038/s41598-025-99951-4

Subjects

Abstract

Similar content being viewed by others

Neural radiance fields assisted by image features for UAV scene reconstruction

Binocular stereo vision-based relative positioning algorithm for drone swarm

Blind source separation and unmanned aerial vehicle classification using CNN with hybrid cross-channel and spatial attention module

Introduction

Methodology

Dataset and preprocessing

Data augmentation strategies

Bio-inspired model architecture

Bio-inspired model components

Bio-inspired CNN

Convolutional operations

Dynamic receptive fields

Attention mechanism in spatial domain

Final output

Temporal dynamics with gated recurrent units (GRUs)

GRU structure and operations

Update gate

Reset gate

Candidate hidden state

Final hidden state

Velocity-conditioned gating mechanism

Temporal output and sequence modeling

Final temporal features

Bio-response layer

Movement intensity modulation

Object proximity modulation

Velocity consistency modulation

Final bio-response attention

Integration of spatial and temporal features

Output layer and classification

Loss function and optimization

Training strategy

Results

Complex behavior evaluation

Evaluation of UAV formation flights

Evaluation on bird flocking migration

Small target detection analysis

Evaluation on small uavs and birds

Comparative performance across models

Model convergence and extended training analysis

Extended training analysis

Observations for extended training

Model convergence findings

Generalization and unseen dataset testing

Evaluation on an unseen dataset

Comparative analysis with recent object detection models

Performance evaluation under adversarial conditions

Discussion

Superior accuracy and real-time capability

Impact of bio-inspired mechanisms

Performance in challenging environments

Ablation study

Conclusion and future work

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links