Anomaly detection in cropland monitoring using multiple view vision transformer

Liu, Xuesong; Liu, Yansong; Sui, He; Qin, Chuan; Che, Yuanxi; Guo, Zhaobo

doi:10.1038/s41598-025-98405-1

Download PDF

Article
Open access
Published: 23 April 2025

Anomaly detection in cropland monitoring using multiple view vision transformer

Xuesong Liu¹,
Yansong Liu²,
He Sui³,
Chuan Qin⁴,
Yuanxi Che⁵ &
…
Zhaobo Guo¹

Scientific Reports volume 15, Article number: 14147 (2025) Cite this article

3166 Accesses
3 Citations
Metrics details

Subjects

Abstract

In recent times, the importance of low-altitude security, especially in agricultural surveillance, has seen a remarkable upswing. This paper puts forward a novel Internet of Drones framework tailored for low-altitude operations. Anomaly detection, which is pivotal for ensuring the integrity of the entire system, poses a substantial challenge. Such anomalies can range from unpredictable weather patterns in farmlands to unauthorized intrusions. To surmount this, a comprehensive deep learning pipeline is proposed in this study. It deploys a vision transformer model featuring a unique attention mechanism. The pipeline includes the meticulous collection of a vast array of normal and abnormal farmland images, followed by preprocessing to standardize data. Anomaly detection is then carried out, and the model’s performance is evaluated using metrics like sensitivity (92.8%), specificity (93.1%), accuracy (93.5%), and F1 score (94.1%). Comparative analysis with state-of-the-art algorithms reveals the superiority of the proposed model. In the future, this study plans to explore integrating data from thermal, infrared, or LIDAR sensors, enhance the interpretability of the vision transformer model, and optimize the deep learning pipeline to reduce computational complexity.

AI-driven drone technology and computer vision for early detection of crop disease in large agricultural areas

Article Open access 17 December 2025

Enhancing surface drainage mapping in eastern Canada with deep learning applied to LiDAR-derived elevation data

Article Open access 01 May 2024

Land use classification using multi-year Sentinel-2 images with deep learning ensemble network

Article Open access 08 August 2025

Introduction

In the realm of low-altitude security, especially in cropland monitoring, anomaly detection has emerged as a crucial aspect^1,2,3,4. Anomaly detection in this context refers to the identification of any deviation from normal patterns in the cropland environment. These anomalies can range from irregular crop growth due to pests, diseases, or adverse weather conditions to unauthorized intrusions. Detecting such anomalies at an early stage is vital for ensuring crop health, maximizing yields, and maintaining security in agricultural areas.

In the past, various approaches have the potentials to address the anomaly detection in cropland monitoring. On one hand, the machine-learning algorithms have shown promising performance in anomaly detection. Yang et al.⁵ conducted a study on the use of a wireless sensor system for orchard management. Fent et al.⁶ designed an automated apple orchard monitoring system using the Internet of Things (IoT) to minimize resource use, enhance apple quality, and provide comprehensive data. The research⁷ presented a robotic platform specifically developed to monitor the state of plants. To address the monitoring difficulties encountered by apple orchards, Ref.⁸ devised a wireless sensor-driven system for monitoring apple orchards. The study⁹ introduced an algorithm to optimize drone functionality, addressing the fragility issue caused by high-weight functionalities in existing systems. It enhances drone performance across various paths by analyzing radial functions, data transmission coverage, and incorporating motion signatures and a special identification system. Then, the work of Ref.¹⁰ presented a method using cascading k-means clustering and the decision tree method C4.5, for classifying anomalous and typical computer network operations. Recently, Shitharth Selvarajan¹¹ investigated the development in bio-inspired optimization techniques, analyzing their unique characteristics, optimization performance, and operational paradigms. It demonstrates their revolutionary potential in solving complex engineering problems. On the other hand, the deep learning methods have shown favorable results in the field of anomaly detection. Convolutional networks (ConvNets) that include multi-scale and hierarchical architectures have significantly influenced the development of object detection¹². In their study, Grignaffini et al.¹³ introduced a convolutional neural network (CNN) model that incorporated handcrafted texture features of dermoscopic images as supplementary input during the training phase. He et al.¹⁴ suggested a method for constructing a high-performance framework for monitoring aberrant ECG rhythms, which effectively reduces the amount of data transmission. It offers a more effective hardware implementation and decreases the usage of hardware resources in comparison to current models. Zeng et al.¹⁵ introduced a Hierarchical Spatio-Temporal Graph Convolutional Neural Network as a solution for detecting anomalies in movies.

Despite the progress in related research, several research gaps remain. Existing machine-learning algorithms for anomaly detection in low-altitude airspace, especially those related to cropland monitoring using the Internet of Drones (IoD), struggle to efficiently handle the vast amounts of data generated. Many current methods rely on ConvNets, which may miss long-range relationships in the data. Additionally, there is a lack of comprehensive frameworks that can effectively utilize the multiple view (multi-view) data obtained from drones in cropland monitoring. Recently, since Vaswani et al. first proposed the transformer model for natural language processing (NLP) in their work¹⁶, deep learning models based on transformers have been extensively used in the domain of machine vision. Self-attention is a crucial element in transformer-based models. Furthermore, the vision transformer (ViT)¹⁷ differs from hierarchical transformers often used in computer vision. It functions as a resilient and non-hierarchical framework, serving as a foundational structure for image categorization. Traditionally, transformer models like Swin¹⁸, MViT¹⁹, PVT²⁰, and PiT²¹ did not use ConvNet concepts like convolution and pooling. UViT²² used the breadth, depth, and input resolution of ViT models, together with a progressive attention mechanism, to proficiently manage high-resolution images. Carion et al.²³ presented a framework named DETR for object identification. DETR employs a transformer-based approach. Wu et al.²⁴ used ConvNet to extract visual tokens and get the representation. Later on, transformers were used to alter the extracted tokens and depict the relationships between them. Kobayashi et al.²⁵ introduced channel attention blocks as a method for anomaly identification, with the purpose of emphasizing important channel information. The Partial Semantic Aggregation Vision Transformer, proposed in the article of Yao et al.²⁶, is a scalable framework for anomaly identification in industrial movies. It allows simultaneous multi-category anomaly detection.

Bearing the above-mentioned analysis in mind, this study proposes an innovative transformer-based framework. The proposed framework is designed to effectively handle the intrinsic correlation in multi-views obtained from IoD in cropland scenarios. To reduce the global inductive bias, the proposed model is pre-trained using the large-scale ImageNet-ISLVRC dataset²⁷ before fine-tuning it with manually-collected images from farmlands. An attention mechanism is also presented, including a dynamic attention module based on a shifting window. And a novel loss function is used to enhance the accuracy of anomaly detection. Through rigorous experiments using 6803 frames from farmland scenes, the proposed strategy demonstrates superiority over current deep-learning methods.

The contributions of this study includ the following:

This is an early exploration of anomaly detection in cropland monitoring.
A transformer-based model is proposed to realize the anomaly detection task.
Rigorous assessments establish the excellence of this work compared to the most advanced algorithms and illustrate its resilience across different workloads.

The remaining of this paper is organized as follows: “Methods” section details the proposed methods, elaborating on the dataset, implementation steps, and key techniques involved. In “Results” section, the experimental results are presented, including the experimental settings, and a comprehensive analysis of the obtained outcomes. “Discussion” section is dedicated to the discussion, where the implications of the results are explored, and potential limitations are addressed. Finally, “Conclusion” section concludes the paper, summarizing the main findings, highlighting the contributions of the research, and suggesting directions for future work.

Methods

Dataset and image preprocessing

The Multi-View Vision Transformer (MVVT) introduced in this research was first trained using the ImageNet-ISLVRC database²⁷, a publicly available dataset that has been widely used to enhance the precision of object detection and classification since 2010. The training dataset comprises 50,000 images, with each image assigned a label from a pool of 1000 categories. A total of 6803 frames were extracted from the images taken by 16 sets of drones (model: DJI JY03-4K; size: 31–40 cm; channels: 4; material: plastic). When capturing a image, the sRGB color space is chosen by lighting it with white fluorescent light. The collection is located inside the premises of a university campus in Zibo, Shandong Province, China. In general, the drones were arranged in a well-structured IoD system, where each drone was allocated a certain preplanned route to traverse across the campus. To maintain continuous drone operation, each drone was recharged every 15 min throughout the voyage and collected scene images every 30 s. The collected images have a resolution of 8192 pixels in width and 4096 pixels in height. After completing the data collection process, each sample frame was classified as either anomalous or normal using a majority vote approach carried out by three machine vision professionals. The dataset, consisting of 6803 image-label pairs, potentially contains missing pairs. To address this, a manual check was systematically performed to identify such missing pairs, and the reasons for their absence, such as data collection errors, were logged. For outlier detection, a combination of automated and manual methods was employed. Initially, automated screening was carried out to flag images with low resolution or abnormal dimensions. Subsequently, the research team manually inspected these flagged images to discriminate between genuine outliers and legitimate cases within the context of cropland scenarios. Regarding imbalanced data, simple random oversampling, which involved duplicating minority class samples, and undersampling, through randomly deleting majority class samples, were applied to balance the distribution of normal and abnormal cropland scenarios.

It should be mentioned that the images that were initially collected were labeled utilizing the annotation tool known as LabelMe. The images with added annotations were stored in the MS COCO format²⁸. Both the images and the corresponding JSON file were generated. Furthermore, a sequence of modifications were implemented to the manually collected images in order to enhance their diversity. The changes included both horizontal and vertical mirroring, as well as rotation. It is important to observe that each image and its modified variants are classified under the same designation.

The proposed multi-view vision transformer framework

Initially, a single-view vision transformer (ViT) was utilized. However, to better handle the local structures in cropland data, the shifted window structure from the Swin Transformer was incorporated, including W-MSA and SW-MSA. This change reduced computational complexity while enhancing local pattern recognition. Subsequently, considering the multi-view nature of the captured cropland data samples, a dual-view Swin Transformer was introduced to integrate complementary information from different perspectives, aiming to improve anomaly detection accuracy.

Then, the information about the suggested MVVT architecture for detecting anomalies in a farmland is provided. This model is constructed based on a vision transformer that incorporates shifting window and attention mechanics. It is worth mentioning that the attention mechanism has been used in several studies, such as Refs.^18,19,20, either alone or in conjunction with convolutional layers. However, the analysis demonstrates that transformer-based models may provide comparable results to CNN and hybrid designs. The architecture of the proposed MVVT is shown in Fig. 1, and it is largely taken from the works of ViT²⁹ and Swin¹⁸. The MVVT architecture takes the series of image patches as its input.

Input of the proposed model

The described model typically takes split image patches as input, which are obtained from images gathered by the drones. The suggested vision transformer incorporates position embeddings into its input to provide spatial information. In a real-life situation, each UAV begins its flight 30 s prior to acquiring 1 s of video footage. Throughout the journey, each UAV gathers video recordings at 30-s intervals. It is important to observe that the velocity of the UAVs is below 3 m per second, and the UAVs have the capability to capture images from distances beyond 100 m.

This model utilizes 2-dimensional (2D) embeddings for each view, following the vision transformer. In order to provide input to the suggested model, the input image is resized from $x\in {\mathbb {R}}^{H\times W\times C}$ into image patches represented as $x_p\in {\mathbb {R}}^{N\times (P^2\cdot C)}$. The variables H, W, and C indicate the height, width, and number of channels of an image, respectively. The variable P specifies the width and height of a patch. Next, the patches undergo a transformation and are converted into a vector with a length of D.

Like the vision transformer, a trainable class token is added to create a series of embeddings ($z_0^0=x_{class}$), and the resulting output from the transformer ($z_L^0$) is represented as y. Moreover, the position embedding is used to include the information about the location in addition to the sequence of patches.

$$\begin{aligned} z_0=[x_{class};x_p^1E; x_p^2E;...;x_p^NE]+E_{pos}, \end{aligned}$$

(1)

where $E\in {\mathbb {R}}^{P^2\cdot C}\times D$ and $E_{pos}\in {\mathbb {R}}^{(N+1)\times D}$.

Encoder

The previously stated $z_0$ serves as the input for the suggested transformer. Building upon the research conducted by Vaswani et al.¹⁶, the model being discussed considers the input patches as individual tokens. Each encoder has L layers, consisting of a multi-head self-attention (MSA) layer and a multi-layer perception (MLP) layer. In addition to the MSA module, a layerNorm module is used before to each block, and a residual block is employed after each block. A two-layer multilayer perceptron (MLP) with a Gaussian error linear unit (GELU) as the classification head is connected to the variable $z_L^0$.

The MSA module is based on the self-attention (SA) mechanism, as described in Ref.¹⁶. Semantic analysis is used to quantify the similarity between a query and its related keys, taking into account weighting values. Thus, the result may be derived by calculating the weighted total of all the values. More precisely, the input $Z\in R^{N\times D}$ consisting of N vectors of length D is leveraged.

$$\begin{aligned} {[}Q,K,V{]}=ZW_{QKV}, \end{aligned}$$

(2)

where $W_{QKV}$ denotes the weight matrix that can be updated by training. All of the weights are computed into the probabilities P with the following function:

$$\begin{aligned} P=softmax(\frac{QK^T}{\sqrt{D}}), \end{aligned}$$

(3)

where D is the length of each vector in Q, K, and V. Finally, the output of the SA mechanism can be mathematically expressed as:

$$\begin{aligned} SA(Z)=PV. \end{aligned}$$

(4)

Furthermore, the MSA mechanism employs the SA mechanism several times simultaneously, enhancing the extraction of information from the input for each head individually. The result of the MSA is the combination of all the components of the heads, which is represented as:

$$\begin{aligned} MSA(Z)={[}SA_1(Z);SA_2(Z);...;SA_h(Z){]}W_{MSA}, \end{aligned}$$

(5)

where h denotes the number of heads in the MSA module, and Z represents the feature map.

In contrast to the MSA module, this research introduces the multi-view attention (MVA) module, seen in Fig. 2. Each view generates its associated Q, K, and V matrices. Furthermore, in order to record the connections between two perspectives, the K entries are moved between them. The output of the MVA is the combination of the output from both viewpoints, which is expressed as:

$$\begin{aligned} & MVA(Z)=<O_{view1},O_{view2}>, \end{aligned}$$

(6)

$$\begin{aligned} & O_{view1}=FC(\sigma (Q_{view1}.K_{view1}^T)\bigoplus \sigma (Q_{view1}.K_{view2}^T).V_{view1}, \end{aligned}$$

(7)

$$\begin{aligned} & O_{view2}=FC(\sigma (Q_{view2}.K_{view2}^T)\bigoplus \sigma (Q_{view2}.K_{view1}^T).V_{view2}, \end{aligned}$$

(8)

where FC denotes fully connected to provide the linear operation, $\sigma$ is the activation function, and $\bigoplus$ represents the concatenation operation. In addition, $Q_{view1}$, $K_{view1}$, $Q_{view2}$, $K_{view2}$, $V_{view1}$, and $V_{view1}$ represent the Q, K, V matrices for both views, respectively.

Furthermore, taking inspiration from the research conducted by Swin¹⁸, both the regular windowing MSA (W-MSA) and shifted windowing MSA (SW-MSA) modules have been adapted into regular windowing multi-view attention (W-MVA) and shifted windowing multi-view dynamic attention (SW-MVA) modules, as seen in Fig. 2. It is important to mention that both the W-MVA and SW-MVA modules (as shown in Fig. 3) used the MVA module as the internal attention mechanism.

And the successive W-MVA and SW-MVA modules can be mathematically formulated as:

$$\begin{aligned} & Z_{l}^{\prime }=W-MVA(LN(Z_{l-1}))+Z_{l-1}, \end{aligned}$$

(9)

$$\begin{aligned} & Z_{l}=MLP(LN(Z_{l}^{\prime }))+Z_{l}^{\prime }, \end{aligned}$$

(10)

$$\begin{aligned} & Z_{l+1}^{\prime }=SW-MVA(LN(Z_{l}))+Z_{l}, \end{aligned}$$

(11)

$$\begin{aligned} & Z_{l+1}=MLP(LN(Z_{l+1}^{\prime }))+Z_{l+1}^{\prime }, \end{aligned}$$

(12)

To be specific, the shifted windowing mechanism adopted in the SW-MSA module is illustrated in Fig. 4.

Additionally, the vision transformer structure concludes with the use of a linear layer to combine the derived feature maps from both viewpoints.

$$\begin{aligned} y=Linear(LayerNorm{[}(Z_L^0)_{view1}+(Z_L^0)_{view2}{]}), \end{aligned}$$

(13)

where Linear(.) denotes a linear function, L=1 or 2, $(Z_L^0)_{view1}$ and $(Z_L^0)_{view2}$ represent the output of each view, respectively.

It is important to note that, unlike vision transformers, the vision transformer being discussed here receives the sequence of patches from distinct views individually. More precisely, the position embeddings represent the order of the image patches in two different views.

Interpretability of the proposed approach

The proposed model, designed for cropland anomaly detection, commences with a comprehensive feature extraction process from multi-view drone-captured images. The layers in the proposed model are adept at discerning texture-based features. Accordingly, they can identify the fine-scale patterns in crop canopies, such as the presence of irregular leaf arrangements or abnormal growth directions, which might be indicative of crop stress or disease. At the classification stage, the model assigns a probability score to sub-region within the image. And this score is based on the aggregated information from the previously extracted features.

Results

Implementation details

In summary, the images from ImageNet-ISLVRC²⁷ were used to perform the first training of the proposed transformer model. Moreover, the main configurations consist of using RMSprop as the optimizer, setting the learning rate to 0.002 with a reduction factor of 0.2, and using a batch size of 16 images. This was accomplished by using PyTorch³⁰ and 4 NVIDIA Telsa V100 GPUs equipped with 64GB of HBM2 memory each.

Firstly, the influence of three factors, namely layers (L), model size (D), and number of heads (h), was investigated on the proposed transformer. This was accomplished by using a subset of the whole data sets. The pre-training of the transformer model was performed using the ImageNet-ISLVRC dataset²⁷ by using the optimal parameter combination. In addition, the manually collected image samples were exploited to further enhance the recommended transformer via the process of fine-tuning. In addition, the comparison tests were conducted between the deep learning models and the proposed technique. The results indicate that the proposed transformer model surpasses the existing state-of-the-art models in terms of performance metrics like sensitivity, specificity, accuracy, and F1 score. Ultimately, the ablation study were carried out to evaluate the effectiveness of the suggested model. In addition, to implement overfitting mitigation, both 10-fold, 15-fold, and 20-fold cross-validation techniques were leveraged in the experiments.

Loss function

The transformer-based pipeline utilizes the integration of multi-view components as the loss function.

$$\begin{aligned} Loss_{model}=Loss_{view1}+Loss_{view2}, \end{aligned}$$

(14)

where $Loss_{view1}$ and $Loss_{view2}$ denote the multi-view cross-entropy loss, respectively. By adding a penalty term to the loss function, L2 regularization helps in reducing the complexity of the model by shrinking the weights. This also prevents the model from over-emphasizing on specific patterns in the training data.

Evaluation metrics

The parameters used for assessment in this research are sensitivity, specificity, accuracy, and F1 score. Specifically, sensitivity measures the model’s ability to identify actual abnormal cropland areas (positive cases). In cropland monitoring, high sensitivity can prevent the omission of key anomalies and reduce crop losses. Specificity is used to evaluate the model’s ability to correctly identify normal cropland areas (negative cases). High specificity can effectively reduce the misjudgment of normal cropland as abnormal and avoid waste of resources. Accuracy reflects the overall correctness of the model’s predictions. It can intuitively demonstrate the model’s comprehensive ability to judge the normal and abnormal states of cropland. F1-score can comprehensively evaluate the model’s performance in detecting anomalies. A high F1-score indicates that the model performs well in both identifying positive cases and controlling false alarms. The metrics used in the trials can be characterized as follows:

Sensitivity: The ratio between true positives (TP) cases and $(TP + FN)$, where FN denotes false negative.
$$\begin{aligned} Sensitivity=\frac{TP}{TP+FN}. \end{aligned}$$
(15)
Specificity: The ratio between the true negatives (TN) and $(TN+FP)$, where FP denotes false positives.
$$\begin{aligned} Specificity=\frac{TN}{TN+FP}. \end{aligned}$$
(16)
There is a trade-off between sensitivity and specificity. Increasing sensitivity makes the model more likely to judge a sample as positive. Although it can capture more real-world anomalies, it may increase false alarms and reduce specificity. Conversely, increasing specificity makes the model more cautious in judging positive cases, which may miss some real-world anomalies and lead to a decrease in sensitivity.
Accuracy:
$$\begin{aligned} Accuracy=\frac{TP+TN}{TP+FN+TN+FP}. \end{aligned}$$
(17)
F1 score:
$$\begin{aligned} & F1 =2\times \frac{Precision \times Sensitivity}{Precision+Sensitivity}, \end{aligned}$$
(18)
$$\begin{aligned} & Precision=\frac{TP}{TP+FP}. \end{aligned}$$
(19)

To note that in the cropland anomaly detection task, sensitivity is the most critical metric. Avoiding the omission of abnormal areas is crucial for ensuring crop yields and reducing economic and ecological losses. At the same time, it is also necessary to reasonably maintain specificity to control false alarms.

Ablation study

Given that the suggested model is a hybrid design, the first step was to quantify the disparity between the individual view and the multi-view model. The discrimination performance was calculated using the single views, using 30% of the manually gathered dataset. As seen in Fig. 5, it is evident that the hybrid model outperforms the separate models in terms of sensitivity, specificity, accuracy, and F1 score.

In order to maximize the use of both intra-image and inter-image data, the concept of the multi-view transformer was proposed. Furthermore, in order to get a precise classification result, the newly developed loss function was presented. The benefit of the multi-view structure has been shown by the results of both the comparative trials and ablation studies.

In addition, Table 1 is provided to determine whether removing outliers from the dataset makes any difference for the proposed approach, using 30% of the manually gathered dataset.

Table 1 Comparison of the performance of the proposed approach before and after outlier removal.

Full size table

Influence of the hyper-parameters on the proposed transformer

The comparative tests were performed on a portion of the gathered images, testing different combinations of parameters to determine the most effective combination of the three parameters for the suggested model. Moreover, it is expected to provide an enhanced categorization outcome for the whole dataset.

As seen in table Table 2, the successive combinations of these three factors were evaluated. The indicated combinations are identified by the MVVT as initials and the hyper-parameters, which correspond to the actual parameter values.

Table 2 The combinations of the 3 parameters in the proposed transformer.

Full size table

It is important to mention that, at this point, only three parameters were chosen. Monitoring more variations would be impractical. The transformer model obtained was a mix of MVVT_4_128_8, as seen by the comparison result in Fig. 6.

Furthermore, the comparative tests were performed including the mean square error (MSE) loss, cross entropy (CE) loss, and the suggested loss function (as seen in Table 3).

Table 3 Influence of various loss functions on the proposed approach.

Full size table

Comparison between the state-of-the-arts and the proposed transformer

The suggested technique demonstrates exceptional performance in terms of sensitivity, specificity, accuracy, and F1 score, as seen in Table 4. This implies that the proposed technique may have greater benefits compared to the present leading methods in the field of anomaly detection.

Table 4 Performance comparison between state-of-the-art techniques and this work in terms of sensitivity (%), specificity (%), accuracy (%), and F1 score (%).

Full size table

Specifically, the single-view ViT had a sensitivity of 88.5%, specificity of 86.4%, accuracy of 87.1%, and an F1-score of 87.1%. After adopting the Swin Transformer, these metrics increased. The proposed approach (10-fold) achieved a sensitivity of 92.8%, specificity of 93.1%, accuracy of 93.5%, and an F1-score of 94.1%, demonstrating the effectiveness of the proposed method enhancements.

Furthermore, different starting weights of random initialization were evaluated during the training process of the proposed suggested transformer on ImageNet-ISLVRC. The deterministic strategy achieves convergence in less than 15 epochs, whereas the stochastic set requires more than 30 epochs. using the other hand, the transformer trained using Image-ISLVRC has a higher starting value, and the differences in losses follow a regular pattern.

Moreover, the confusion matrices of the anomaly detection task using the competing methods, are provided in Table 5. The confusion matrix serves as a critical instrument for visualizing the performance of the competing models.

Table 5 Confusion matrices for the competing methods.

Full size table

Discussion

This research presents a novel multi-view anomaly detection approach specifically designed for low-altitude circumstances. This is another instance when a visual transformer-based algorithm is used in settings with low altitude. The experimental results suggest that the suggested transformer may provide accurate detection results by leveraging its hybrid design.

It is important to mention that the vision transformer acts as the basis for the suggested anomaly detection system. Unlike other anomaly identification approaches published in the literature, the proposed system has the ability to include information from both viewpoints in the input image samples. Although the provided technique utilizes an end-to-end learning approach, more improvements are necessary to properly meet the objectives of anomaly detection. By using the attention mechanism often used in transformer-based methods, it is feasible to identify the relationships between global pixels in a image captured by UAVs. The experimental results provide proof that the suggested approach may guarantee the effectiveness of anomaly detection. Consequently, it functions as a good instrument for monitoring orchard.

Furthermore, compared with traditional multi-view methods based on convolutional neural networks, the vision transformer architecture in this study abandons local convolutional operations and divides the input multi-view images into patches and processes them as sequences. By using the self-attention mechanism, it can directly model the relationships between patches of each view, effectively integrating information from different perspectives and constructing a more comprehensive feature representation. In contrast, CNN-based methods have limitations in capturing long-range dependencies across views. In terms of feature extraction, many similar methods rely on hand-crafted features, while this multi-view vision transformer adopts an end-to-end learning approach to automatically explore complex patterns and relationships in multi-view data. Taking cropland images as an example, traditional methods rely on predefined rules to extract features, while this model can learn subtle features such as the co-occurrence of crop growth patterns in different views through its self-attention layers. In addition, when processing multi-view data, some methods simply concatenate the features of each view at an early stage and then process them jointly. However, this study uses a cross-attention mechanism to explicitly model the interactions between different views, enabling the model to selectively focus on relevant information from each view according to the task requirements. Compared with the early-stage concatenation methods, it can more effectively utilize the complementary information between views and improve the performance of the model.

Moreover, the proposed approach encounters several limitations that warrant attention. Firstly, the computational requirements of the proposed model are substantial. Training on the 6803-frame farmland dataset consumed a significant amount of time on a high-end GPU. This not only hampers the practical implementation in real-time cropland monitoring, where quick results are crucial for timely intervention, but also poses challenges for large-scale deployments. In such scenarios, multiple drones may be collecting data simultaneously, and the high computational load could lead to bottlenecks in data processing and analysis. Secondly, the model’s performance in complex and rare anomaly scenarios is sub-optimal. The mis-classification shows that the proposed model fails to accurately identify the complex situations. This can be mainly attributed to the limited presence of such complex scenarios in the training data. With only a small proportion of the training data representing these intricate anomalies, the model lacks sufficient exposure to learn and generalize effectively. Accordingly, the advantages and disadvantages of the proposed method are provided in Table 6.

Table 6 Advantages and disadvantages of the proposed method.

Full size table

Conclusion

This study aimed to address the crucial issue of anomaly detection in cropland monitoring within the low-altitude security domain, leveraging the IoD. This research introduced a novel approach that significantly deviates from existing methods.

In the future, the potential of the proposed framework will be explored. This includes integrating data from other types of sensors, such as thermal, infrared, or LIDAR, to enhance the robustness and accuracy of anomaly detection across diverse environmental conditions. Additionally, efforts will be made to make the vision transformer model more interpretable and explainable, fostering trust and transparency in security-critical applications. Next steps also aim to enhance the deep-learning pipeline to reduce computational complexity and latency, making it more suitable for real-time anomaly detection in resource-constrained environments.

Data availability

The ImageNet-ISLVRC database used in this study can be downloaded from https://www.kaggle.com/c/imagenet-object-localization-challenge/data. The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Henrio, J. & Nakashima, T. Anomaly detection in videos recorded by drones in a surveillance context. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2018).
Sindhwani, V., Sidahmed, H., Choromanski, K. & Jones, B. Unsupervised anomaly detection for self-flying delivery drones. In 2020 IEEE International Conference on Robotics and Automation (ICRA) (2020).
Avola, D. et al. Low-altitude aerial video surveillance via one-class SVM anomaly detection from textural features in UAV images. Information 13, 2. https://doi.org/10.3390/INFO13010002 (2022).
Article Google Scholar
Avola, D. et al. A novel gan-based anomaly detection and localization method for aerial video surveillance at low altitude. Remote Sens. 14, 4110. https://doi.org/10.3390/RS14164110 (2022).
Article ADS Google Scholar
Yang, H., Kuang, B. & Mouazen, A. M. Wireless sensor network for orchard management. In 2011 Third International Conference on Measuring Technology and Mechatronics Automation, vol. 3, 1162–1165 (IEEE, 2011).
Feng, C., Wu, H. R., Zhu, H. J. & Sun, X. The design and realization of apple orchard intelligent monitoring system based on internet of things technology. Adv. Mater. Res. 546, 898–902 (2012).
Article Google Scholar
Bietresato, M. et al. A tracked mobile robotic lab for monitoring the plants volume and health. In 2016 12th IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications (MESA) 1–6 (IEEE, 2016).
Meng, X., Cong, W., Liang, H. & Li, J. Design and implementation of apple orchard monitoring system based on wireless sensor network. In 2018 IEEE International Conference on Mechatronics and Automation (ICMA) 200–204 (IEEE, 2018).
Srivastava, G. et al. Connotation of unconventional drones for agricultural applications with node arrangements using neural networks. In 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall) 1–6. https://doi.org/10.1109/VTC2022-Fall57202.2022.10013040 (2022).
Mohammad, G. B., Shitharth, S. & Dileep, P. Classification of Normal and Anomalous Activities in a Network by Cascading C4.5 Decision Tree and K-Means Clustering Algorithms, chap. 7 109–131 (Wiley, 2022).
Selvarajan, S. A comprehensive study on modern optimization techniques for engineering applications. Artif. Intell. Rev. 57, 194. https://doi.org/10.1007/s10462-024-10829-9 (2024).
Article Google Scholar
Li, Y., Mao, H., Girshick, R. & He, K. Exploring plain vision transformer backbones for object detection. https://doi.org/10.48550/ARXIV.2203.16527 (2022).
Grignaffini, F. et al. Anomaly detection for skin lesion images using convolutional neural network and injection of handcrafted features: A method that bypasses the preprocessing of dermoscopic images. Algorithms 16, 466. https://doi.org/10.3390/A16100466 (2023).
Article Google Scholar
He, S., Wang, Z., Liao, B., Zeng, J. & Liu, H. Anomaly detection of hydro-turbine based on audio feature extraction of deep convolutional neural network. Int. J. Comput. Appl. Technol. 73, 192–202. https://doi.org/10.1504/IJCAT.2023.135584 (2023).
Article Google Scholar
Zeng, X. et al. A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos. IEEE Trans. Circuits Syst. Video Technol. 33, 200–212. https://doi.org/10.1109/TCSVT.2021.3134410 (2023).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. https://doi.org/10.48550/ARXIV.1706.03762 (2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021 (2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986 (2021).
Fan, H. et al. Multiscale vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 6804–6815. https://doi.org/10.1109/ICCV48922.2021.00675 (2021).
Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 548–558. https://doi.org/10.1109/ICCV48922.2021.00061 (2021).
Heo, B. et al. Rethinking spatial dimensions of vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 11916–11925. https://doi.org/10.1109/ICCV48922.2021.01172 (2021).
Chen, W. et al. A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation (2021).
Carion, N. et al. End-to-end object detection with transformers. In Computer Vision—ECCV 2020 (eds Vedaldi, A. et al.) 213–229 (Springer, 2020).
Chapter Google Scholar
Wu, B. et al. Visual transformers: Token-based image representation and processing for computer vision. http://arxiv.org/abs/Computer (2020).
Kobayashi, S., Hizukuri, A. & Nakayama, R. Video anomaly detection using encoder-decoder networks with video vision transformer and channel attention blocks. In 18th International Conference on Machine Vision and Applications, MVA 2023, Hamamatsu, Japan, July 23–25, 2023 1–4. https://doi.org/10.23919/MVA57639.2023.10215921 (IEEE, 2023).
Yao, H. et al. Scalable industrial visual anomaly detection with partial semantics aggregation vision transformer. IEEE Trans. Instrum. Meas. 73, 1–17. https://doi.org/10.1109/TIM.2023.3343832 (2024).
Article Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y (2015).
Article MathSciNet Google Scholar
Lin, T.-Y. et al. Microsoft coco: Common objects in context. In Computer Vision—ECCV 2014 (eds Fleet, D. et al.) 740–755 (Springer, 2014).
Chapter Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR 2021 (2020).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. http://arxiv.org/abs/1912.01703 (2019).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015 (eds Navab, N. et al.) 234–241 (Springer, 2015).
Google Scholar
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988. https://doi.org/10.1109/ICCV.2017.322 (2017).
Zhou, X., Zhuo, J. & Krähenbühl, P. Bottom-up object detection by grouping extreme and center points. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 850–859. https://doi.org/10.1109/CVPR.2019.00094 (2019).
Chen, X., Girshick, R., He, K. & Dollar, P. Tensormask: A foundation for dense object segmentation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2061–2069. https://doi.org/10.1109/ICCV.2019.00215 (2019).

Download references

Author information

Authors and Affiliations

James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
Xuesong Liu & Zhaobo Guo
School of Intelligence Engineering, Shandong Management University, Jinan, 250357, China
Yansong Liu
College of Aeronautical Engineering, Civil Aviation University of China, Tianjin, 300300, China
He Sui
Department of Infrastructure Engineering, University of Melbourne, Melbourne, VIC, 3010, Australia
Chuan Qin
Department of Computer Science, Xidian University, Xi’an, 710126, China
Yuanxi Che

Authors

Xuesong Liu
View author publications
Search author on:PubMed Google Scholar
Yansong Liu
View author publications
Search author on:PubMed Google Scholar
He Sui
View author publications
Search author on:PubMed Google Scholar
Chuan Qin
View author publications
Search author on:PubMed Google Scholar
Yuanxi Che
View author publications
Search author on:PubMed Google Scholar
Zhaobo Guo
View author publications
Search author on:PubMed Google Scholar

Contributions

X.L., Y.L.: writing—original draft preparation and software; H.S. and Z.G.: supervision, conceptualization, project administration, and funding acquisition; H.S., C.Q., and Y.C.: resources, data curation, formal analysis. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xuesong Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, X., Liu, Y., Sui, H. et al. Anomaly detection in cropland monitoring using multiple view vision transformer. Sci Rep 15, 14147 (2025). https://doi.org/10.1038/s41598-025-98405-1

Download citation

Received: 07 September 2024
Accepted: 11 April 2025
Published: 23 April 2025
Version of record: 23 April 2025
DOI: https://doi.org/10.1038/s41598-025-98405-1