Abstract
Deep learning model optimization have notably enhanced weed identification accuracy. However, there is a short-fall in detailed research on optimizing models for weed identification with images from mobile embedded systems. Also, existing methods generally use large, slow multi-layer convolutional networks (CNNs), which are impractical for use on mobile embedded devices. To address those issues, we propose a lightweight weed identification model based on an enhanced MobileViT architecture, effectively balancing high accuracy with real-time performance. Our approach begins with the application of a multi-scale retinal enhancement algorithm featuring color restoration to preprocess image data. This step improves the clarity of images, particularly those with blurred edges or significant shadow interference. Following this, we introduce an optimized MobileViT model that incorporates the Efficient Channel Attention (ECA) module into the weed feature extraction network. This design ensures robust feature extraction capabilities while simultaneously reducing the model’s parameters and computational complexity. The MobileViT model within our feature extraction network is engineered to concurrently learn local and global semantic information. This capability allows it to accurately distinguish subtle differences between weeds and crops by leveraging a minimal number of modules. To demonstrate the effectiveness of our model, it achieved an F1 score of 98.51% and an average identification time of 89 milliseconds per image. These results underscore its suitability for lightweight deployment, maintaining high accuracy while minimizing model complexity.
Similar content being viewed by others
Introduction
Weeds proliferate rapidly and have short growth cycles, competing with crops in their early growth stages for light, water, and nutrients. If not controlled timely, they can severely impact crop yield and quality1,2. Automatic weed identification methods based on computer vision technology can provide accurate information on field weed distribution, which is essential for implementing automated and precise weeding operations. Current automatic weed identification methods primarily rely on manually designed features such as shape and texture, combined with support vector machines. Consequently, real-time weed identification based on sensors has become common in advanced technologies, where target identification models utilizing CNN technology can be deployed to identify weeds in real-time settings3,4.
In the interdisciplinary field of computer vision and agricultural automation, CNN have demonstrated exceptional performance in weed identification tasks. For instance, the weeds identification employs advanced deep learning models such as InceptionV3 and ResNet-505, providing a robust foundation for multi-class weed identification. Furthermore, by incorporating channel attention mechanisms and DropBlock6 regularization modules on top of the DenseNet7 architecture, researchers have achieved an average accuracy of 98.63% in identifying corn seedlings and accompanying weeds, showcasing the potential of deep learning technologies in smart agriculture for weed identification8.
Despite the impressive achievements of deep learning technologies in the domain of weed identification, current research still faces three primary challenges: Firstly, existing studies predominantly focus on the design and optimization of model framework, with relatively insufficient research on their application in real agricultural settings. Specifically, studies on real-time weed identification and removal using mobile embedded devices (such as weeding robots) in the field are notably scarce. This leads to significant performance degradation of laboratory-developed models when applied in real-world scenarios. The complexity of farm environments, such as changes in lighting, weed occlusion, and variations in growth stages, all demand higher generalization capabilities from the models9,10,11. Secondly, the core processing units of mobile embedded devices like weeding robots are often constrained by computational power and storage capacity. While existing high-performance CNN models, such as multi-layer deep CNNs, excel in feature extraction and target identification, their large model size, massive parameter count, and slower inference speeds make them challenging to deploy directly on resource-limited embedded systems or mobile embedded devices12,13,14,15. Lastly, real-time weed identification and processing are critical in actual agricultural production. Although CNN-based object identification models are theoretically suitable for real-time weed identification, the deployment of dense CNN structures on mobile embedded devices faces severe challenges. The inference process of the models often leads to significant computational delays, which not only impacts the system’s real-time responsiveness but may also cause decision-making delays in weeding robots, thereby reducing operational efficiency16,17,18,19.
To address the challenge of balancing high accuracy with real-time performance in weed identification, we propose a lightweight method based on an enhanced MobileViT model. This approach incorporates several innovative technologies to surpass the limitations of existing techniques in practical applications. Firstly, we employ a multi-scale retinal enhancement algorithm with color restoration for feature enhancement, coupled with diverse data augmentation strategies including rotation, cropping, brightness adjustment, and noise injection. These techniques not only improve image quality but also preserve crucial color information, significantly enhancing the model’s generalization capability and identification accuracy.
Secondly, we optimize the MobileViT model for mobile embedded devices, combining the local feature extraction capabilities of CNNs20 with the global modeling advantages of ViT. This includes incorporating a self-attention mechanism to facilitate high-quality feature learning under resource-constrained conditions. This design substantially reduces the reliance on large-scale datasets, making it more suitable for weed identification tasks with limited data availability. Additionally, we design a hybrid network structure that integrates CNN and MobileViT modules, utilizing the ECA module to enhance focus on key locations in feature maps. The final loss function is employed for model parameter optimization, enhancing the model’s ability to discern subtle differences in weed images while reducing model parameters and computational complexity. This innovative architectural design effectively learns fine-grained features in weed images, achieving real-time identification performance while maintaining high accuracy.
Finally, experimental results demonstrate that this method exhibits superior performance on multiple public weed datasets, outperforming existing methods in terms of accuracy, inference speed, and resource consumption. We significantly improve the model’s deployment efficiency and real-time performance in actual agricultural environments, effectively reducing the model’s computational complexity and storage requirements while maintaining high identification accuracy.
Above all, we are the major contribution of this work is as follows:
-
We propose a novel combination of a multi-scale retinal enhancement algorithm with color restoration for feature enhancement, coupled with diverse data augmentation strategies. This approach significantly improves image quality, preserves crucial color information, and enhances the model’s generalization capability and identification accuracy.
-
We enhance MobileViT model specifically optimized for mobile embedded devices. This innovative design combines the strengths of CNNs and ViTs, incorporating a self-attention mechanism and an ECA module. The hybrid network structure effectively learns fine-grained features in weed images, achieving high-quality feature learning under resource-constrained conditions while reducing dependence on large-scale datasets.
-
From experiments, our model performance on public weed datasets, outperforming existing methods in terms of accuracy, inference speed, and resource consumption. It significantly improves deployment efficiency and real-time performance in actual agricultural environments, effectively reducing computational complexity and storage requirements while maintaining high identification accuracy.
The rest of this paper is organized as follows. Section “Related work” summarizes related work. In Sect. “Proposed method”, we elaborate on the overall system design of the weed identification model. Section "Experimental results and analysis" presents comparative experiments on the identification models. The last section concludes the work.
Related work
Field of weed management can be accomplished using weeding robots, but the core processing units of these robots have limited computational and storage resources. Further research is needed to reduce model parameters and computational complexity while maintaining high identification accuracy for crop seedlings and weeds, thereby improving weed identification speed.
The challenge of weed identification has attracted the attention of several researchers. Studies21,22 have employed CNN to establish deep learning-based weed identification models, with improved models achieving mean Average Precision (mAP) scores exceeding 74%. However, most of these studies utilize multi-layer deep CNNs for feature extraction and object identification, resulting in models with large size, numerous parameters, and slow identification speed, making them difficult to deploy on small mobile devices.
Researchers have begun to focus on lightweight identification models for object identification. Wang et al.23 designed a lightweight YOLOv4-tiny model, which demonstrated over 80% speed improvement compared to the YOLOv4 model when tested on GPUs. Zeng et al.24 addressed the issues of low identification accuracy, poor real-time performance, and robustness in corn field weed identification by constructing an SSD weed identification model using lightweight CNNs combined with feature layer fusion mechanisms, achieving an mAP of 88.27% and an identification speed of 32.26 f/s. Wang et al.25 addressed the issues of low identification accuracy and slow identification speed in natural field environments by constructing a weed identification model based on lightweight CNNs using the Xception CNN as a foundation, achieving an mAP of 98.63% with a memory footprint of 83.5 MB. Li et al.26 tackled the problem of high parameter count and computational complexity in fruit identification algorithms for apple-picking robots by using MobileNetv3 as the backbone network to create a lightweight YOLOv4 weed identification model, achieving an mAP of 92.23% and an identification speed of 15.11 f/s on embedded platforms. Rai et al.12 developed a multi-channel depth-wise separable convolution model based on depth-wise separable convolutions and residual blocks for fast and accurate identification of sugar beets and weeds. The model achieved an average identification accuracy of 87.58% with a speed of 42.064 frames per second (f/s). Yang et al.14 proposed a model compression method based on the SENet attention mechanism and dynamic sparse constraints, validating it on the VGG16 model using the classic CIFAR10 multi-classification dataset, resulting in a 43.97% reduction in parameters with only a 0.91% point decrease in average accuracy. Currently, there is still room for improvement in the lightweight design of weed identification models, and the balance between identification accuracy and speed requires further investigation.
In recent years, Vision Transformers have demonstrated superior performance compared to CNN across various visual tasks27,28,29. Vision Transformers apply self-attention mechanisms directly to sequences of image patches, effectively capturing important regions within images30. Compared to CNN, they can learn richer semantic information. Owing to their excellent performance, Vision Transformers have also garnered widespread attention in the agricultural domain. Zhang et al.31 combined Vision Transformers with CNNs, adopting a dual-branch structure to extract global and local features separately, achieving effective disease identification in apple leaves. Other researchers have applied Vision Transformers to weed identification studies. Wang et al.10 proposed a weed identification method based on a shifted window Transformer network, utilizing an improved Swin Transformer as the backbone network to recognize corn and weed targets under overlapping and occlusion conditions, achieving fine-grained segmentation of corn and weeds. While these studies have demonstrated excellent identification accuracy, the introduction of self-attention mechanisms has led to substantial computational requirements and a need for large-scale training data. This results in longer training times and increased computational resources. Moreover, the identification speed during actual deployment is relatively slow, failing to meet the real-time requirements of field weed identification.
To address these issues, we propose a lightweight field weed identification method based on an improved MobileViT32 model. This method effectively reduces the model’s computational complexity and storage requirements while maintaining high identification accuracy. The approach aims to strike a balance between model efficiency and performance, making it more suitable for real-time applications in agricultural settings. By leveraging the strengths of Vision Transformers and addressing their limitations, this research contributes to the ongoing efforts to develop efficient and accurate weed identification systems for precision agriculture. The proposed method has the potential to enhance the practicality of automated weed management systems, enabling more effective and timely interventions in crop fields.
Proposed method
To address the issues of high parameter count, large model size, and slow identification speeds in weed identification models within agricultural environments, focusing on corn seedlings and various small-target weeds, we devise a lightweight weed identification method based on an improved MobileViT model.
Initially, the method employs a multi-scale retinal enhancement with color restoration to enhance image processing, preventing model overfitting during training and increasing the diversity and quantity of the data sample set. Subsequently, a hybrid structure combining MobileViT model and convolutional layers serves as the weed feature extraction network. The MobileViT model, incorporating a self-attention mechanism, models the long-distance semantic information in images of weeds and corn seedlings to capture more discriminative fine-grained features. Standard convolution and depth wise separable convolution are utilized to learn local information while forming multi-scale features through feature map down sampling.
An ECA module is then used to further enhance focus on key positions in the feature maps, and the final loss function optimizes model parameters. The classification layer is responsible for outputting the predicted categories of weeds. Building on parameter adjustments in the original MobileViT model, we leverage the ECA mechanism to further enhance the model’s identification capabilities, thereby better balancing identification accuracy and speed.
A lightweight weed identification model based on an improved MobileViT model. The enhanced images are processed through two convolutional operations to generate feature maps. These are then processed by a Transformer module to obtain a global feature sequence. The weed feature extraction network, incorporating ECA across five stages, achieves a balance between identification accuracy and processing speed.
From Fig. 1, the MobileViT module is an innovative lightweight network architecture that seamlessly integrates traditional convolution operations with the Transformer mechanism to elegantly process both local and global image information. This structure is particularly suitable for resource-constrained environments, such as image recognition tasks on mobile devices or embedded systems.
Initially, the module receives a three-dimensional input feature map \(\:{X}_{f}\) with dimensions \(\:H\times\:W\times\:C\), representing the height, width, and channel count of the image. It first captures the image’s details and local features using a \(\:3\times\:3\) convolution kernel. Subsequently, these features are mapped to a higher-dimensional feature space using a \(\:1\times\:1\) convolution kernel, enhancing the semantic associations between features.
The processed feature map \(\:{X}_{fl}\) is then divided into multiple small blocks, each containing \(\:P=w\times\:h\) pixels. These blocks are unfolded into feature sequences \(\:{X}_{o}\) with dimensions \(\:P\times\:N\times\:d\) to facilitate learning by the Transformer module. In this step, the Transformer module processes these sequences to learn global dependencies between image blocks, thereby generating a global feature sequence \(\:{X}_{G}\).
Following this, \(\:{X}_{G}\) is refolded into a new feature map \(\:{X}_{GF}\), which retains the dimensions of the input feature map \(\:{X}_{f}\) but is enriched with global semantic information. This feature map undergoes a \(\:1\times\:1\) convolution to adjust its dimensions to match the channel count of the original input feature map, and then it is overlaid with \(\:{X}_{f}\) to form a feature map with doubled channel count. Finally, these features are merged through a \(\:3\times\:3\) convolution kernel, and the output feature map’s dimensions are mapped back to the original channel count \(\:C\).
Moreover, the design of the MobileViT module allows for the computation of self-attention without the use of positional encodings, as it preserves the positional information within and between image blocks. This characteristic enables the module to process image data more efficiently, excelling in capturing local textures as well as understanding the overall layout of the image. Thus, the MobileViT module not only enhances the performance of image processing tasks but also significantly reduces computational complexity, making it an ideal choice for resource-limited environments.
Image enhancement
We utilize the public weed dataset CornWeed33 for model training and evaluation to validate the effectiveness of the proposed lightweight weed identification method. The CornWeed dataset contains 5,998 images spanning five categories, covering corn seedlings and their primary accompanying weeds (sedges, goosefoot, spiny amaranth, and foxtail). These images were collected under varying conditions of time, lighting, and soil environments, reflecting the complex backgrounds of agricultural fields. Notably, certain categories, such as foxtail, corn seedlings, and sedges, exhibit significant morphological similarities, which increases the difficulty of the classification task.
We conduct field data collection, ensuring the authenticity and accuracy of the experiments. The collection system was mounted on a tracked robot, with an STM32 microcontroller as the core processor. The system was equipped with a camera module for capturing plant images, as well as temperature, humidity, and light sensors for collecting environmental data. All data were transmitted in real-time to a self-hosted server via a LoRa communication module.
From Fig. 2 among the 2,560 collected images, quality issues were identified in 425 images due to shadow occlusion, 262 images with excessive brightness, and 348 images with blurred edge details.
We presents a Multi-Scale Retinal Enhancement algorithm with Color Restoration (MSRECR) designed to address image quality issues in agricultural weed recognition. The algorithm employs a five-level Gaussian pyramid \(\:\left({\sigma\:}_{i}={\sigma\:}_{o}\cdot\:{2}^{i},{\sigma\:}_{o}=1.0\right)\) for multi-scale decomposition and calculates center-surround differences: \(\:{D}_{i}\left(x,y\right)={G}_{i}\left(x,y\right)-{G}_{i+s}\left(x,y\right)\). Multi-scale responses are fused using weighted coefficients w=[0.5,0.25,0.125,0.0625] to produce an enhancement map: \(\:E\left(x,y\right)=L\left(x,y\right)+\alpha\:\cdot\:R\left(x,y\right)\cdot\:\left(1-\gamma\:\cdot\:L\left(x,y\right)\right)\), where \(\:\alpha\:\in\:\left[\text{0.3,0.7}\right]\) controls the enhancement intensity, and \(\:\gamma\:=0.5\) adjusts local contrast.
During the color restoration process, we introduce a color restoration coefficient \(\:\beta\:\left(0.2-0.4\right)\) and a chroma preservation factor \(\:K\left(1.2\right)\). Color restoration gain is calculated using \(\:C\left(x,y\right)=\beta\:\cdot\:S\left(x,y\right)/\left(S\left(x,y\right)+K\right)\). Depending on different quality defects, the algorithm dynamically adjusts parameter settings: higher \(\:\alpha\:\) values (0.65) for shadowed areas, reduced \(\:\alpha\:\) values (0.35) for overexposed areas, and increased high-frequency weights (0.65) for edge-blurred areas.
Quantitative assessments demonstrate that the algorithm improves the Structural Similarity Index by 18.3% on 425 shadow-obscured images, enhances the Peak Signal-to-Noise Ratio by 4.2 dB on 262 overexposed images, and increases Gradient Magnitude Similarity by 22.5% on 348 edge-blurred images. Overall, the MSRECR algorithm raises the feature matching success rate from the original 67.8–91.4%, providing more robust feature representations for subsequent weed recognition models.
By combining data augmentation techniques such as rotation (±\(\:{30}^{\circ\:}\)), cropping (0.8–0.95 times), brightness adjustment (±0.15), and Gaussian noise (\(\:\sigma\:\)=0.01), the dataset was expanded from 2560 to 4000 images, significantly mitigating overfitting issues during model training and increasing validation set accuracy by 6.2% points. Experiments prove that this method effectively enhances image quality in complex agricultural environments, laying a solid foundation for precise weed recognition.
To enhance data quality and model performance, a multi-scale retinal enhancement algorithm with color restoration was used to process low-quality images. Additionally, data augmentation techniques such as rotation, cropping, brightness adjustment, and noise addition were employed to expand the dataset to 4,000 images. These measures not only improved the clarity and recognizability of image features but also increased the quantity and diversity of the samples, helping to prevent model overfitting, as shown in Fig. 2. Regarding dataset division, an 80% training and 20% testing split was used. It is important to emphasize that the distribution ratio of original and feature-enhanced images was maintained consistently in both the training and testing sets. Through this rigorous data handling and division method, we constructed an experimental framework that both reflects the complexity of real agricultural environments and fully assesses model performance. This provides a reliable data foundation for validating the effectiveness of the lightweight weed identification method, aiding in a more accurate evaluation of the model’s performance in practical application scenarios.
Weed identification based on mobilevit model
From Fig. 1, this paper denotes the output vector given by the teacher model \(\:Mode{l}_{t}\) just before the Softmax function in the output layer as \(\:{z}_{t}\), where \(\:\widehat{{y}_{t}}=Softmax\left({z}_{t}\right)\) represents the inferred probability distribution of input sample classes based on the model. Similarly, the output vector of the student model is denoted as \(\:{z}_{s}\), with the predicted probability distribution being \(\:\widehat{{y}_{s}}\). Therefore, the definition of the loss function used during the training process of the student model deployed at node k, \(\:Mode{l}_{s}^{k}\), is as follows.
The MobileViT model contains standard convolution and the Transformer mechanism to learn both local and global information within feature maps. The self-attention mechanism, particularly, enables high-quality learning under resource-limited conditions. This module forms the core of the lightweight weed identification method proposed in this paper, and its structure is illustrated in Fig. 1. Assuming the input feature map \(\:{X}_{f}\) of the MobileViT module has dimensions \(\:H\times\:W\times\:C\) (where H is the height, W is the width, and \(\:C\) is the number of channels), a \(\:3\times\:3\) convolution kernel is used to model the local spatial information in the feature map. Subsequently, a \(\:1\times\:1\) convolution maps the feature map to a higher d-dimensional feature space, enriching the semantic information learned by the convolution.
After two convolution operations, the input feature map \(\:{X}_{f}\) is transformed into a local feature map \(\:{X}_{fl}\) of the same size. Next, \(\:{X}_{fl}\) is divided into \(\:N\) equal-sized image blocks, each containing P pixels. These are then unfolded into a sequence of features \(\:{X}_{o}\) of size \(\:P\times\:N\times\:d\), to learn the global semantic information within the feature map. Here, \(\:P=w\times\:h\) and \(\:N=\left(H\times\:W\right)/P\) (where w and h represent the preset width and height of the image blocks, and d represents the feature dimension). Within \(\:{X}_{o}\), features of the same position across different image blocks are processed through a series of L Transformer modules, ultimately yielding a global feature sequence, where \(\:Tr\left(\right)\) represents the Transformer model function.
.
The global feature sequence \(\:{X}_{G}\) obtained after processing through the Transformer module has dimensions \(\:P\times\:N\times\:d\), where \(\:p\) represents the pixel feature at the \(\:p\)-th position within each image block. Unlike the original vision Transformer, MobileViT preserves the positional information within and between image blocks; thus, positional encoding is not required when calculating self-attention. Subsequently, \(\:{X}_{G}\) is folded to produce the feature map \(\:{X}_{GF}\), with dimensions \(\:H\times\:W\times\:d\), where \(\:H,W\) are the same as those of the input feature map \(\:{X}_{f}\). The unfolding and folding operations are implemented through a combination of Transpose and Reshape functions. Then, \(\:{X}_{GF}\) is mapped to the same dimension \(\:C\) as the input feature map \(\:{X}_{f}\) of the MobileViT module using a \(\:1\times\:1\) convolution. At this point, \(\:{X}_{GF}\) has dimensions \(\:H\times\:W\times\:C\) and can be concatenated with the input feature map \(\:{X}_{f}\) to form a new feature map with dimension \(\:2C\). Finally, a \(\:3\times\:3\) convolution kernel is used to fuse the concatenated feature map and map its dimensions back to \(\:C\).
Where \(\:{X}_{o}\) can represent the local semantic information of the \(\:3\times\:3\) area covered by the convolution, and \(\:{X}_{G}\left(p\right)\) encodes the global semantic information of the \(\:p\)-th position across different image blocks. This means that each pixel can encode all pixels in \(\:{X}_{f}\). \(\:h=w=2\) is set to ensure that the effective receptive field of the MobileViT model can cover the spatial resolution of the input feature map \(\:H\times\:W\).
Weed identification model lightweight
The weed feature extraction network used in this paper is based on the original MobileViT architecture and comprises five stages, as illustrated in Figs. 1 and 3. The algorithm takes an RGB three-channel image as input.
We proposes a weed feature extraction network based on the enhanced MobileViT architecture, which successfully enhances the network’s adaptability and efficiency for weed recognition tasks through the integration of the Efficient Channel Attention (ECA) module. This novel network structure design is particularly focused on achieving efficient image processing in resource-constrained environments.
In the network design, this study innovatively embeds the ECA module within the foundational architecture of MobileViT and adjusts the convolution stride and kernel size of the initial layers to better handle large-sized input images. These improvements allow the network to reduce information loss during preliminary feature extraction, while the introduction of the ECA module further strengthens the expression of these features. The ECA module enhances the features of key channels, improving the network’s focus and recognition accuracy.
In Stage 1, a \(\:4\times\:4\) convolution with a stride of 4 downsamples the input image of size \(\:256\times\:256\times\:3\) to a feature map of size \(\:64\times\:64\times\:16\) to facilitate subsequent computations. Then, an ECA module is employed to enhance the feature map; unlike the original MobileViT structure, this paper does not use a \(\:3\times\:3\) convolution with stride 2 because a larger convolution kernel and stride better accommodate the redundancy in the image mapping to feature maps. The structure of the ECA module used in this paper is shown in Fig. 3, where the ECA module enhances key features through interactions among different channels in the feature map.
In Stage 2, the feature map is fed into two stacked MobileNetv2 modules, which further extract features while down sampling. Similar to Stage 1, Stage 2 also employs an ECA module to enhance the down sampled feature map.
Stage 3 consists of multiple MobileViT modules and down sampling MobileNetv2 modules. The MobileViT modules are responsible for capturing global semantic features while learning local semantic features. The MobileNetv2 modules downs ample the feature map and increase dimensions across channels to form a multi-scale feature representation. Stages 4 and 5 function similarly to Stage 3 but differ in the internal parameter settings of the MobileViT modules and the number of MobileNetv2 modules. The internal parameters of the improved MobileViT feature extraction network are shown in Fig. 1 of five stages.
Experimental results and analysis
From Fig. 4, the algorithm proposed in this paper was executed in an Ubuntu/Linux 18.04 operating system environment, utilizing code written in Python 3.10, and the deep neural network models were constructed using the API of the PyTorch 1.13 deep learning framework. Both the development of the model and preliminary experiments were carried out on a computing platform equipped with two Nvidia RTX 3090Ti graphics card, featuring an 8-core Intel Core i7 10,700 F processor running at 2.9 GHz and 32GB of RAM, as shown in Fig. 4B. To validate the lightweight approach of this paper, a Raspberry Pi 4B 4G was selected as the embedded system test platform for weed identification, as shown in Fig. 4A.
The hyperparameter settings for model training are as follows: the batch size was set to 64. When training CNN, the optimizer used was Stochastic Gradient Descent (SGD) with a momentum of 0.9. For training the methods described in this paper and MobileViT, the AdamW optimizer was employed for 350 rounds of learning. The initial learning rate was set to 0.001 for all models, with a learning rate decay strategy of cosine annealing. The maximum number of iterations for training all network models was 50.
Evaluation metrics
We employ four evaluation metrics: accuracy, precision, recall, and F1 score. To assess the model’s performance in practical deployment, inference time, measured in milliseconds (ms), is also used as a performance metric. Inference time refers to the duration required for the network model to predict a single image on an embedded platform.
Loss function
We utilize a streamlined classification layer comprising convolutional, pooling, and fully connected layers to differentiate among various weed types. From Fig. 4, under the sections detailing the classification layer and loss function, the process begins by increasing the dimension of the input feature map’s channels through a 1 × 1 convolution. Subsequently, a feature encoding of length 384 is derived via global average pooling, which is then forwarded to a linear classification layer for final categorization. The lightweight weed identification model’s parameters are optimized using the cross-entropy loss function \(\:L\left(\right)\), with the formula expressed as follows:
Where \(\:X\) is the set of all samples in the training set, Y is the set of all true labels of the samples in the training set, n is the total number of samples in the dataset, and \(\:M\left({x}_{i}\right)\) is the output obtained by the network for the i-th sample in the training set.
Evaluation of the color restoration multi-scale retinal enhancement algorithm
To evaluate the effectiveness of algorithmic preprocessing on weed identification, the test dataset included images featuring dual weed species, weeds of similar height to young corn plants, and smaller weeds coexisting with corn seedlings. Figure 2 presents the identification results of two image datasets using YOLOv5s, where Fig. 2 original image depicts the outcomes for original images processed by MobileViT, and Fig. 2 enhanced image shows the results for images preprocessed with a color restoration multi-scale retinal enhancement algorithm, also recognized by MobileViT. The identification results reveal that in Fig. 2 original image, the MobileViT weed identification model exhibited missed identifications in the original images, with prediction box confidence levels being relatively low, ranging from 0.64 to 0.79. Conversely, Fig. 2 enhanced.
To evaluate the effectiveness of algorithmic preprocessing on weed identification, the test dataset included images featuring dual weed species, weeds of similar height to young corn plants, and smaller weeds coexisting with corn seedlings. Figure 2 presents the identification results of two image datasets using YOLOv5s, where Fig. 2 original image depicts the outcomes for original images processed by MobileViT, and Fig. 2 enhanced image shows the results for images preprocessed with a color restoration multi-scale retinal enhancement algorithm, also recognized by MobileViT. The identification results reveal that in Fig. 2 original image, the MobileViT weed identification model exhibited missed identifications in the original images, with prediction box confidence levels being relatively low, ranging from 0.64 to 0.79. Conversely, Fig. 2 enhanced image eliminated missed identifications, with the identification model showing higher confidence in prediction boxes for both corn seedlings and weed targets, achieving confidence levels between 0.64 and 0.97. The experimental findings demonstrate that preprocessing images with a color restoration multi-scale retinal enhancement algorithm significantly improves the identification performance of the weed identification model.
Baseline model performance comparison
To validate the performance advantages of the lightweight weed identification method proposed in this paper, several CNN models were trained, including VGG-1634, ResNet505, and DenseNet-16135, which have shown excellent performance in previous weed identification research, as well as MobileNetv136, MobileNetv237, MobileNetv338, MobileNetV3-Large38, EfficientNet-Lite39, GhostNet40 and ShuffleNet41, which have demonstrated superior performance in lightweight image identification tasks. In the comparative experiments, all CNN models were fine-tuned using pre-trained models from the ImageNet dataset. Since the method proposed in this paper involves modifications to the original MobileViT, no pre-trained model was available; therefore, training started from scratch using randomly initialized model parameters. The model with the highest identification accuracy on the test set was selected as the final model. The comparative experimental results are shown in Table 1.
Table 1 indicates that in the weed identification task with high inter-class morphological similarity, the identification accuracy of generic CNNs is significantly higher than that of lightweight CNNs. The method proposed in this paper, which incorporates global semantic information learning capabilities, outperforms all CNN models in terms of identification accuracy, precision, recall, and F1-score. The identification accuracy reaches 98.56%, which is approximately 0.8% points higher than the generic CNN DenseNet-16135 and 1.56% points higher than the lightweight CNN MobileNetv2, demonstrating a clear advantage in weed identification performance.
The identification accuracy of MobileNetv3 is only 87.90%, significantly lower than that of MobileNetv2. This may be attributed to the fact that its structure was optimized through neural architecture search on the ImageNet dataset, which may not be well-suited for the weed identification task in this paper.
From Table 1, our proposed method significantly reduced the misidentification rate among corn seedlings, sedge, and prickly sida. These results demonstrate that the proposed method can effectively learn fine-grained features with stronger discriminative capabilities, achieving higher identification accuracies in distinguishing highly similar crops and weeds in natural scenes. MobileNetV3-Large has been optimized to balance speed and accuracy, making it highly suitable for applications that require rapid inference with moderate precision. EfficientNet-Lite offers good accuracy and applicability, especially on resource-constrained devices, by simplifying its design to accommodate a wider range of hardware. GhostNet is renowned for its extremely low computational demands and parameter efficiency, making it ideal for operation on highly constrained hardware environments.
Comparison of weed identification efficiency on different models
Although general CNN exhibit commendable accuracy in weed identification, their complexity and substantial computational demand render them unsuitable for deployment in real time field weed identification systems. Conversely, lightweight CNNs, while faster, typically offer lower identification accuracy. this paper aims to address these issues by enhancing the MobileViT model to maintain high identification accuracy with fewer model parameters. To validate its effectiveness, the identification efficiency of the proposed method was compared with mainstream CNNs, with results presented in Table 1.
The comparison of model size, accuracy, and inference time demonstrates that the identification speed of the proposed method approaches that of lightweight CNNs, with an inference time of only 83ms per image, meeting the real-time requirements for weeding operations. Moreover, the accuracy of this method not only surpasses that of the general CNN DenseNet-161, whose35 model size and inference time are several times larger than those of the proposed method, but also significantly exceeds that of the lightweight network MobileNetv2.
By ingeniously integrating convolution and Transformer architectures, the proposed method achieves a balanced trade-off between identification accuracy and speed, making it effectively applicable for field weed identification.
Comparison of our model with mobilevit model
Considering that the original structure of MobileViT was designed for the ImageNet dataset, directly applying it to the specific tasks of this paper might lead to issues of model structure incompatibility. Consequently, this research introduces several modifications to the MobileViT network, specifically employing larger strides and kernel sizes in the initial convolutional layers and enhancing the attention to critical information in feature maps through the incorporation of an ECA module.
The original MobileViT model is categorized into three variants based on scale and parameter quantity: MobileViT-S32, MobileViT-XS32, and MobileViT-XXS32. Table 2 presents the comparative results of identification performance between the proposed method and these three original versions of MobileViT. From Table 2, the proposed method achieves an identification accuracy comparable to MobileViT-S but with a significantly reduced inference time per image. Compared to MobileViT-XXS, the proposed method improves accuracy by 0.39% points. However, since the parameter settings of the proposed method are based on the MobileViT-XS model, there is a slight increase in inference time. Nonetheless, the inference speed of the proposed method remains sufficiently fast to meet the real-time requirements of field weed removal operations.
Experimental results show that the method proposed in this study achieves comparable weed recognition accuracy to the original MobileViT-S model, but performs better in inference time per image, which is crucial for field operations requiring rapid response. Although the inference time is slightly increased compared to the smaller MobileViT-XXS version, the improvement in accuracy demonstrates the effectiveness of the performance optimization. Additionally, this study also showcases the potential of the ECA module in enhancing network processing speed and efficiency, particularly in handling complex scenes.
In summary, by optimizing the MobileViT structure and integrating the ECA module, this study not only improves the accuracy of weed recognition but also optimizes the model’s real-time performance, making it more suitable for practical agricultural scenarios. This work not only advances the application of image recognition technology in the field of smart agriculture but also provides valuable references for achieving similar optimizations in other visual tasks in the future.
Visual analysis
In this paper, one image from each category in the CornWeed test set was selected for visual analysis using Gradientweighted Class Activation Mapping (Grad-CAM) on MobileNetv2, DenseNet-16135, and the proposed method42. To achieve better visualization, only the correct labels were used in generating the activation heatmaps. This involved computing the gradients of the last convolutional layer’s output feature maps to obtain the distribution of activations, which were then superimposed on the original images as heatmaps. The specific results are displayed in Fig. 5.
The visual analysis reveals that the proposed algorithm effectively focuses on the areas of the image containing weeds and corn seedlings, with a heatmap coverage that is more precise compared to the comparative CNN. In visualizations of species like foxtail, sedge, and corn, key parts such as leaves and stems show higher activation values, significantly aiding in distinguishing morphologically similar weeds and crops.
Furthermore, the visualization of the quinoa images across different models demonstrates that, despite the presence of interference from other weed categories, the overall heatmap generated by the proposed method remains focused on the quinoa’s location. In contrast, DenseNet-161 and MobileNetv2 fail to accurately cover the target area. These visualization results indicate that the weed identification method proposed in this paper enhances the extraction of critical weed features and suppresses interference from background features, effectively resolving the problem of weed identification in agricultural environments.
Ablation study of modules
To verify the identification effects of the combination of image enhancement and text identification networks, a series of ablation experiments were conducted, the results of which are presented in Table 3. After preprocessing with image enhancement, the proposed identification model demonstrates significant improvements in accuracy, recall, and mean average precision (mAP) compared to the baseline model. Specifically, the average precision of the model increased by 2.6% points. Under complex conditions such as light occlusion and high similarity between weeds and crops, the average precision improved by 5.3% and 3.1%, respectively.
Further improvements were achieved by incorporating the ECA network, which resulted in additional gains in accuracy, recall, and mAP. Compared to the baseline model, the overall mAP increased by 6.3% points. From this analysis, it can be concluded that the integration of image enhancement modules and ECA modules into the improved identification model effectively enhances the model’s performance in detecting weeds in complex paddy field environments.
In summary, the integration of image enhancement modules and ECA modules into the improved identification model significantly enhances the performance of weed detection in complex paddy field environments.
Conclusion
To address the challenges of precision and efficiency in field weed detection, we devised a lightweight identification method using an enhanced MobileViT model. This innovative approach integrates advanced technologies to surpass the limitations of current methods in practical applications. We began by implementing a multi-scale retinex algorithm with color restoration to enhance features, supported by various data augmentation techniques including rotation, cropping, brightness adjustment, and noise injection. These strategies improved image quality, preserved vital color information, and boosted the model’s generalization and accuracy. We further tailored the MobileViT architecture for optimal performance on mobile devices by merging the local feature extraction capabilities of CNNs with the global insights of ViT through a self-attention mechanism. This adaptation reduces reliance on extensive datasets, making it ideal for weed detection with limited data availability. Moreover, we introduced a hybrid structure combining CNN and MobileViT modules with ECA modules to accentuate essential features in the images. The optimized loss function enhances the model’s sensitivity to subtle differences in weed images, minimizing computational demands and ensuring efficient, high-quality feature processing. Experimental results confirm that our method outperforms existing techniques in accuracy, speed, and resource efficiency across multiple public weed datasets. It offers significant improvements in real-time performance and deployment efficiency in farming environments, reducing computational and storage demands while maintaining excellent accuracy. This method represents a viable and effective solution for practical weed detection in field conditions.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Kra¨hmer, H. et al. Weed surveys and weed mapping in europe: State of the art and future tasks, Crop Protection, vol. 129, p. 105010, (2020).
Coleman, G. R. et al. Weed detection to weed recognition: Reviewing 50 years of research to identify constraints and opportunities for large-scale cropping systems. Weed Technol.36(6), 741–757 (2022).
Tang, J. et al. Weed identification based on k-means feature learning combined with convolutional neural network. Comput. Electron. Agric.135, 63–70 (2017).
Espejo-Garcia, B., Panoutsopoulos, H., Anastasiou, E., Rodr´ıguezRigueiro, F. J. & Fountas, S. Top-tuning on transformers and data augmentation transferring for boosting the performance of weed identification. Comput. Electron. Agric.211, 108055 (2023).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. (2016).
Ghiasi, G., Lin, T. Y. & Le, Q. V. Dropblock: A regularization method for convolutional networks, Advances in neural information processing systems, vol. 31, (2018).
Zhu, Y. & Newsam, S. Densenet for dense flow, in 2017 IEEE international conference on image processing (ICIP). IEEE, pp. 790–794. (2017).
Ahmad, A., Saraswat, D., Aggarwal, V., Etienne, A. & Hancock, B. Performance of deep learning models for classifying and detecting common weeds in corn and soybean production systems. Comput. Electron. Agric.184, 106081 (2021).
Yang, Y., Li, Y., Yang, J. & Wen, J. Dissimilarity-based active learning for embedded weed identification. Turk. J. Agric. For.46(3), 390–401 (2022).
Wang, P. et al. Weed25: A deep learning dataset for weed identification. Front. Plant Sci.13, 1053329 (2022).
Vasileiou, M. et al. Transforming weed management in sustainable agriculture with artificial intelligence: A systematic literature review towards weed identification and deep learning, Crop Protection, p. 106522, (2023).
Rai, N. et al. Multi-format open-source weed image dataset for real-time weed identification in precision agriculture. Data Brief. 51, 109691 (2023).
Diao, Z. et al. Spatial-spectral attention-enhanced res-3d-octconv for corn and weed identification utilizing hyperspectral imaging and deep learning. Comput. Electron. Agric.212, 108092 (2023).
Yang, L. et al. A new model based on improved VGG16 for corn weed identification. Front. Plant Sci.14, 1205151 (2023).
Cai, Y. et al. Attention-aided semantic segmentation network for weed identification in pineapple field. Comput. Electron. Agric.210, 107881 (2023).
Visentin, F. et al. A mixed-autonomous robotic platform for intra-row and inter-row weed removal for precision agriculture. Comput. Electron. Agric.214, 108270 (2023).
Zhu, H. et al. Research on improved yolox weed detection based on lightweight attention module. Crop Prot.177, 106563 (2024).
Sapkota, R., Stenger, J., Ostlie, M. & Flores, P. Towards reducing chemical usage for weed control in agriculture using UAS imagery analysis and computer vision techniques. Sci. Rep.13(1), 6548 (2023).
Ronay, I. Lati, R. N. & Kizel, F. Spectral mixture analysis for weed traits identification under varying resolutions and growth stages. Comput. Electron. Agric. 220, 108859 (2024).
Li, Z., Liu, F., Yang, W., Peng, S. & Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst.33(12), 6999–7019 (2021).
Peteinatos, G. G., Reichel, P., Karouta, J., Andujar, D. & Gerhards, R. Weed identification in maize, sunflower, and potatoes with the aid of convolutional neural networks. Remote Sens. 12(24), 4185 2020.
Bakhshipour, A. & Jafari, A. Evaluation of support vector machine and artificial neural networks in weed detection using shape features. Comput. Electron. Agric.145, 153–160 (2018).
Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. Scaled-yolov4: Scaling cross stage partial network, in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, p. 13 029 – 13 038. (2021).
Zeng, W., Li, H., Hu, G. & Liang, D. Lightweight dense-scale network (ldsnet) for corn leaf disease identification. Comput. Electron. Agric.197, 106943 (2022).
Wang, Z., Guo, J. & Zhang, S. Lightweight Convolution neural network based on multi-scale parallel fusion for weed identification. Int. J. Pattern Recognit. Artif. Intell. 36(07), 2250028 (2022).
Li, J., Li, J. Zhao, X., Su, X. & Wu, W. Lightweight detection networks for tea bud on complex agricultural environment via improved yolo v4. Comput. Electron. Agric. 211, 107955 (2023).
Arnab, A. et al. Vivit: A video vision transformer, in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846. (2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows, in Proceedings of the IEEE/CVF international conference on computer vision, p. 10 012 – 10 022. (2021).
Yang, J. et al. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale, arxiv Preprint arxiv:2010.11929, 2020.
Zhang, J. Weed recognition method based on hybrid CNN-transformer model. Front. Comput. Intell. Syst.4(2), 72–77 (2023).
Mehta, S. & Rastegari, M. Mobilevit: light-weight, generalpurpose, and mobile-friendly vision transformer, arXiv preprint arXiv:2110.02178, (2021).
Jiang, H. et al. Cnn feature based graph convolutional network for weed and crop recognition in smart farming. Comput. Electron. Agric. 174, 105450 (2020).
Simonyan Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. (2017).
Howard, A. G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, (2017).
Sandler, A., Howard, M., Zhu, A., Zhmoginov, L. C. & Chen Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. (2018).
Howard, M. et al. Searching for mobilenetv3, in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1314–1324. (2019).
Ab Wahab, M. et al. Efficientnet-lite and hybrid CNN-KNN implementation for facial expression recognition on raspberry Pi. IEEE Access. 9, 134065–134080 (2021).
Han, K. et al. Ghostnet: More features from cheap operations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020).
Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. (2018).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization, in Proceedings of the IEEE international conference on computer vision, pp. 618–626. (2017).
Funding
This work is supported by the Science Technology Research Project of Jilin Provincial Department of Education–“Research on rice seedling growth monitoring based on computer vision” (No. JJKH20241706KJ).
Author information
Authors and Affiliations
Contributions
Jingru Sui contribution lies in data analysis, original draft preparation and sorting, Xiaoyan Liu and Zhihui Chen participated in the relevant revisions of the section"Experiment and Analysis “. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, X., Sui, Q. & Chen, Z. Real time weed identification with enhanced mobilevit model for mobile devices. Sci Rep 15, 27323 (2025). https://doi.org/10.1038/s41598-025-12036-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-12036-0







