Abstract
In large-scale rice cultivation, seedling deficiency is a common issue that significantly impacts timely replanting decisions. Traditional manual inspection methods are inefficient and labor-intensive, highlighting the need for an automated and accurate detection approach. This study proposes a rice seedling deficiency detection method based on a state space model, aiming to improve detection precision for small seedlings. With a dual-branch feature extraction module built upon the State Space Model (Mamba), and the wavelet convolution transform, enhances the detection accuracy on the self-constructed rice seedling deficiency dataset. Experiments show that the proposed optimized model achieves a mAP50 of 78%, outperforming other baseline models. The results indicate the effectiveness and practicality of the approach, offering a novel and efficient solution for detecting missing seedlings in rice fields.
Similar content being viewed by others
Introduction
Rice is one of the most important staple crops in the world and attracts considerable attention. Rice seedling deficiency is a common issue in practical agricultural production. Currently, China is lagged in rice seedling deficiency detection technology, relying on traditional manual visual inspection methods. This approach is not only labor-intensive but also inefficient and prone to subjective biases. With the development of computer hardware performance, vision-based detection systems are gradually replacing expensive and inefficient manual observation methods.
Rice seedling deficiency detection based on computer vision and unmanned aerial vehicle (UAV) remote sensing imagery has primarily focused on machine learning and deep learning approaches. Traditional machine learning approaches, coupled with image processing, rely on handcrafted features to count seedlings and thereby infer regions of deficiency in remote sensing images. Jin et al. proposed a method for estimating wheat plant density using high-resolution UAV imagery captured at low altitude. Crop rows were segmented using the Excess Green index and Hough transform, followed by plant counting with a support vector machine1. Shirzadifar et al. employed both the Excess Green and k-means clustering to segment maize plant pixels and further explored the use of high-resolution UAV images for estimating seedling count and evaluating stand uniformity2. These studies demonstrate that seedling deficiency can be indirectly inferred by segmenting plant regions in UAV imagery and estimating actual seedling counts. By comparing these counts with expected planting densities allows for effective detection and quantification of deficient areas. However, conventional crop counting methods based on machine learning rely on low-level image descriptors and handcrafted features, making them suitable only for scenarios with simple backgrounds. In contrast, deep learning-based object detection algorithms overcome these limitations by automatically extracting robust features. Stavness et al. further validated this advantage, showing that deep learning models not only outperform conventional techniques in terms of accuracy but also offer greater robustness and scalability when applied to complex image-based plant phenotyping problems3.
Researchers have applied deep learning techniques to seedling detection across various crops. Jin et al. successfully detected dead trees using the YOLOv4-tiny object detection model and remote sensing data4. Zhang et al. enhanced YOLOv5s by integrating the Coordinate Attention (CA) module to improve feature representation in the channel dimension and preserve long-range dependencies, enabling precise missing seedling localization for sandalwood trees in remote sensing images5. Wu et al. incorporated the Efficient Channel Attention (ECA) module into an improved YOLOv5s model to enhance sugarcane seedling identification6. Gui et al. introduced Paddy-YOLOv5s-Prune, integrating a Transformer into the detection head to improve sensitivity to small rice seedling targets7.
Although the aforementioned deep learning algorithms offer different solutions for crop detection, they also have certain limitations. First, these methods primarily enhance global feature perception by incorporating attention mechanisms, but they lack effective control over shallow-layer local details. Second, while the Transformer-YOLO architecture overcomes the limitations of CNN models, which often neglect long-range dependencies due to their local receptive fields, its high computational complexity leads to significant resource consumption8,9. This hinders both model training efficiency and its practical application in UAV-based remote sensing detection.
Therefore, this study proposes an optimized YOLO-based detection framework tailored specifically for row-wise rice seedling deficiency detection across large-scale farmlands10. In contrast to conventional convolutional encoders, a State Space Model (SSM)-based Mamba backbone is adopted to efficiently capture long-range dependencies11. Furthermore, we design a dual-branch feature extraction module to fuse the global semantic context derived from the Mamba branch with local detail features captured by the convolutional attention branch, enhancing the feature modeling capability12. To mitigate the loss of high-frequency components during downsampling, we introduce a wavelet-based downsampling module that preserves structural and frequency-domain information, significantly improving multi-scale feature robustness under complex field conditions.
Our research contributes a novel perspective to the domain of seedling deficiency detection in precision agriculture by integrating a State Space Model (SSM)-based backbone into the YOLO framework. We present several key contributions in our study:
-
A UAV-based rice seedling deficiency dataset was constructed and annotated, providing a reliable foundation for training and evaluating detection models.
-
An SSM-based architecture (Mamba) with global modeling capability is introduced to expand the receptive field. Based on this, a dual-branch module is designed to enhance the model’s understanding of local information and improve its feature representation capability.
-
To enhance feature representation and preserve the frequency-domain details of small seedling-deficient targets, a wavelet-based downsampling module is designed. This module integrates spatially interleaved partitioning with wavelet convolution to retain high-frequency and structural information during resolution reduction.
Materials and methods
Dataset collection and processing
This study was carried out in a systematically planted rice field spanning about 1 hectare in Guangdong Province, China. Due to the lack of publicly available datasets tailored to seedling deficiency detection during rice emergence, we constructed a custom dataset to support model training and evaluation. The data collection took place on August 16, 2024, corresponding to the 10th day after rice seedling emergence. At this stage, rice seedlings typically reach a height of 10–15 cm and the leaves exhibiting a tender green hue. At the pixel level, the seedlings appear as elongated green regions, facilitating the distinction between seedling-present and seedling-absent areas.
The image acquisition system consisted of an unmanned aerial vehicle (UAV) equipped with an M3M camera, which has a resolution of 5280 × 3956 pixels. During data collection, the UAV flew at a fixed altitude of 12.45 meters under natural lighting conditions, yielding a ground sampling distance (GSD) of 3.5 millimeters per pixel, which provided sufficient spatial resolution to capture small-scale seedling gaps.
The collected raw images were initially stored in TIF raster format at a resolution of 5280 × 3956 pixels and were later converted to JPG format. After excluding unusable images, the remaining data was pre-processed, including cropping out irrelevant areas such as image borders and field edges, as well as adjusting contrast to enhance visual clarity. These corrections were performed prior to annotation, resulting in 1677 usable samples.
After preprocessing, the seedling-deficient regions in each image were manually annotated using the LabelImg tool, and the annotations were saved in TXT format according to the YOLO standard. To fully utilize the annotated dataset and enhance the robustness of model training, we applied data augmentation techniques such as Mixup, Mosaic, and HSV adjustments. Finally, the dataset consisted of 1677 usable images, which were split into a training set and a validation set, with 1377 images used for training and 300 for validation. The phenomenon of rice seedling deficiency in the field is illustrated in Fig. 1.
Optimized model
YOLOv8 demonstrates strong performance in seedling deficiency detection due to its preservation of detailed spatial features during initial feature extraction. However, YOLOv8 presents several limitations in complex field environments. Typical convolution and pooling layers in YOLOv8 often lose critical structural and edge information needed to detect missing seedlings in challenging backgrounds. The convolutional architecture lacks global modeling capability, which limits its ability to detect missing seedlings across planting rows13 . In addition, the absence of frequency-domain information makes the model less sensitive to fine edge details and subtle planting gaps.
To address these limitations, we introduce three architectural components tailored for this task. The Init_Stem module enhances shallow-level feature preservation during initial downsampling. The Mamba_Conv block integrates a dual-branch mechanism to capture both global semantics and local details. The ClueMerge module incorporates spatial rearrangement and wavelet convolution to retain high-frequency and structural information during feature compression. These modules collectively enhance the model’s capacity to detect missing seedlings under challenging field conditions.
Building upon these enhancements, this study proposes an optimized YOLO model based on the Mamba architecture for detecting row-wise rice seedling deficiency across large-scale farmland14. In this section, we first introduce the relevant preliminaries, followed by an overview of the backbone structure. Then, we provide a detailed breakdown of its key components, including the Init_stem block, 2D-Selective-Scan, Mamba_Conv block, and ClueMerge layer15,16.
Preliminaries
Contemporary SSM-based models, especially Structured State Space Models (S4) and the Mamba model, are typical continuous systems16,17,18. These systems utilize an implicit latent state \(h(t) \in \mathbb {R}^N\) to map a one-dimensional input function \(x(t) \in \mathbb {R}\) to an output \(y(t) \in \mathbb {R}\), as shown in Eq. (1):
Here, \(A \in \mathbb {R}^{N \times N}\) denotes the state matrix, which dictates the temporal dynamics of the hidden state, while \(B \in \mathbb {R}^{N \times 1}\) and \(C \in \mathbb {R}^{N \times 1}\) serve as projection parameters, where B as a weight matrix responsible for updating the hidden state based on the input space and C projects the intermediate hidden state onto the output space. The Mamba model utilizes such a continuous system for processing discrete-time sequential data by applying discretization functions \(f_A\) and \(f_ B\) to make the parameters A and B into their discrete counterparts \(\bar{A}\) and \(\bar{B}\), so that they can be seamlessly integrated into deep learning architectures. A commonly used method for discretization is the Zero-Order Hold (ZOH), which introduces a time-scale parameter \(\Delta\) to adjust the model’s temporal resolution. The discretized state matrix and projection parameters are presented in Eq. (2):
In Eq. (2), \(\Delta A\) and \(\Delta B\) denote the discrete-time equivalents of the continuous parameters over a specified time interval. I represents to the identity matrix. The resulting discretized formulation is presented in Eq. (3):
The discretized SSM computes the output in the form of a global convolution, as shown in Eq. (4):
Here, \(\bar{K} \in \mathbb {R}^{L}\) represents the structured convolution kernel, where L denotes the length of the input x19,20,21.
Overall architecture
Fig. 2 illustrates the overall architecture of the optimized model. Specifically, the backbone of the optimized model consists of the Init_stem block, Mamba_Conv block, and a ClueMerge layer.
The input image is denoted as \(X \in \mathbb {R}^{H \times W \times C}\) , where H , W , and C represent the height, width, and number of channels, respectively. Initially, the first stacked block reduces the resolution of X through the Init_Stem block. Its output is then fed into the first dual-branch feature extraction module to capture global dependencies:
where \(\text {Mamba\_Conv}_1()\) denotes the first dual-branch feature extraction operation. The \(X_1 \in \mathbb {R}^{\frac{H}{4} \times \frac{W}{4} \times C}\) represents the output of the initial stacked block. Subsequently, we feed \(X_0\) into three sequentially stacked blocks, each composed of a ClueMerge layer followed by a Mamba_Conv block:
where i indicates the sequential index of the stacked block in the backbone. At each stage, the spatial resolution is reduced by half while the number of channels is doubled, progressively forming a hierarchical feature representation with rich multiscale semantics. The resulting feature maps \(X_2 \in \mathbb {R}^{\frac{H}{8} \times \frac{W}{8} \times 2C}\), \(X_3 \in \mathbb {R}^{\frac{H}{16} \times \frac{W}{16} \times 4C}\), and \(X_4 \in \mathbb {R}^{\frac{H}{32} \times \frac{W}{32} \times 8C}\) from the selected stages are then forwarded to the neck for multi-scale feature fusion.
-
Init_stem block
The Init_Stem block is designed to downsample the input image and extract shallow feature representations for the subsequent backbone. This block adopts a dual-branch residual structure to perform spatial downsampling on the input \(X \in \mathbb {R}^{H \times W \times C}\) without altering the input channel dimension. The main branch consists of two sequential convolutional layers, each followed by a distinct activation function to enhance feature diversity. Meanwhile, the residual branch utilizes a \(1 \times 1\) convolution with a stride of 4 to achieve rapid spatial alignment. The outputs of both branches are aggregated via element-wise addition to form the final shallow feature representation. The process in Init_Stem can be delineated as follows:
$$\begin{aligned} X_1 = {{\textbf {MainConv}}}(X) + {{\textbf {SkipProj}}}(X) \end{aligned}$$(7)where \(\textbf {MainConv}()\) represents the sequential convolutions in the main branch, and \(\textbf {SkipProj}()\) denotes the residual projection branch. \(X_1 \in \mathbb {R}^{\frac{H}{4} \times \frac{W}{4} \times C}\) serves as the shallow feature map for the subsequent stages.
-
2D-Selective-Scan
The 2D-Selective-Scan (SS2D) module is the core of the Mamba_Conv block and consists of three key components: a scan expansion operation, S6 block and a scan merging operation. The scan expansion operation splits the input image into sequences in four directions(top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right). This operation captures features from multiple perspectives while preserving spatial information. Next,the S6 block updates the parameters of the SSM via a selective mechanism, filtering irrelevant information to precisely extract useful image features. Finally, the scan merging operation reconstructs the output image by integrating the four transformed sequences processed by the S6 block, ensuring that the final 2D feature map matches the original input image size.
The S6 block, derived from the previously introduced Mamba model, extends the S4 with a selective mechanism22,23,24. This adjustment enables the S6 to selectively retain relevant information while discarding irrelevant data. The pseudo-code for the S6 block is presented in Algorithm 1. The diagram illustrating the SS2D operation is shown in Fig. 3:
-
Mamba_Conv block
The Mamba_Conv block plays a critical role in the optimized model, as shown in the Fig. 2. Relying solely on convolutional feature extraction presents issues in rice seedling deficiency detection, such as reduced semantic information in deeper feature maps as the network deepens. To address this issue, the long-range modeling capability should be enhanced by designing a feature extraction block based on Mamba, which captures extensive contextual information.
However, Mamba excels at capturing global context, but rice seedling deficiency detection also demands precise localization of small gaps—an area where CNN is inherently effective. Hence, a hybrid structure is adopted to combine Mamba’s semantic modeling with CNN’s local feature extraction.
First, channel splitting divides the input features, which are then fed into the module’s Mamba and Conv-Attention branches. In the Mamba branch, the sub-input features undergo layer normalization before entering the SS2D module, ultimately producing global features with enriched semantic information. In the Conv-Attention branch, the traditional CNN convolutional stacking structure is followed. However, to mitigate feature loss caused by convolution operations, it integrates the Convolutional Block Attention Module (CBAM)25. This mechanism adaptively refines input features by sequentially inferring attention weights along both channel and spatial dimensions. Finally, channel shuffling is employed to restore channel dimensions while ensuring feature reordering along the channel axis.
To illustrate the modeling process, the pseudo-code for the Mamba_Conv block is presented in Algorithm 2:
-
ClueMerge layer
Standard convolutional blocks often fail to explicitly preserve spatial structure and frequency diversity during downsampling, which limits the model’s ability to represent localized textures and high-frequency cues. To address this limitation, the proposed ClueMerge layer utilizes spatial partitioning, channel concatenation and wavelet convolution transform for efficient downsampling26,27.
Specifically, given an input feature map \(X \in \mathbb {R}^{H \times W \times C_1}\), the layer first partitions the input based on even and odd positions along the spatial dimension, resulting in four sub-feature maps. These sub-feature maps are then concatenated along the channel dimension to form an intermediate tensor \(X_1 \in \mathbb {R}^{\frac{H}{2} \times \frac{W}{2} \times 4C_1}\), as formulated below:
$$\begin{aligned} X_1 = \text {Concat} \left( X^{(ee)}, X^{(oe)}, X^{(eo)}, X^{(oo)} \right) \end{aligned}$$(8)where \(X^{(ee)},\, X^{(oe)},\, X^{(eo)},\, X^{(oo)}\) denote the four sub-feature maps obtained by sampling along even and odd spatial positions. Finally, it is fed into the WTConv, followed by batch normalization and activation to produce the final output:
$$\begin{aligned} Y = \text {SiLU}(\text {BatchNorm}(\text {WTConv}(X_1))) \end{aligned}$$(9)The final output \(Y \in \mathbb {R}^{\frac{H}{2} \times \frac{W}{2} \times C_2}\) retains key structural and high-frequency characteristics, which contributes to more effective extraction of seedling-deficient features in subsequent stages. As shown in Fig. 4:
Experiments
Performance metrics
Different models can be quantitatively compared using specific evaluation metrics. For object detection models, the primary evaluation criterion is detection accuracy, which is typically assessed through Precision (Pre), Recall (Re), and Mean Average Precision (mAP). In this paper, the three metrics are adopted to evaluate the performance of the proposed model. Specifically, mAP at an Intersection over Union (IoU) threshold of 0.5 (mAP50) is used to emphasize detection accuracy, while mAP averaged over IoU thresholds from 0.5 to 0.95 (mAP95) provides a more comprehensive evaluation of both localization precision and robustness. The equations for these metrics are defined as follows:
In Eqs. (10) and (11), TP (True Positive) denotes the number of correctly identified positive samples, FP (False Positive) denotes the number of negative samples incorrectly identified as positives, and FN (False Negative) denotes the number of positive samples incorrectly classified as negatives. Additionally, the Mean Average Precision (mAP) in Eq. (12) denotes the mean precision across all categories, calculated under the assumption of n distinct classes.
Software and hardware systems used in this study
The hardware and software configurations for this study are as follows. The central processing unit (CPU) is an Intel® \(\hbox {Core}^{\textrm{TM}}\) i7-13700K, equipped with 16 cores and 24 threads, with a performance core base frequency of 3.4 GHz and an efficiency core base frequency of 2.5 GHz. The GPU is a NVIDIA GeForce RTX 4070Ti with 12GB memory. The operating system is Microsoft Windows 11, and the software development environment is Visual Studio Code with the Windows Subsystem for Linux (WSL) extension. The deep learning framework used is PyTorch 2.1.1, executed in a Python 3.10 environment and accelerated by CUDA 11.8. During the training process, AdamW is selected as optimizer, with an initial learning rate of 0.001, a learning rate decay factor of 0.001, and a momentum factor of 0.95. The maximum number of iterations is set to 400, and the number of worker threads for data loading is set to 4. Table 1 provides a detailed overview of the hardware and software employed in this study.
Comparison experiments
To verify the effectiveness of the optimized model in detecting rice seedling deficiency, we conducted comparative experiments with mainstream models in object detection which were evaluated on our rice seedling deficiency dataset.
-
Comparison experiment 1: We conducted a comparative experiment against several SOTA object detection models that have been widely adopted in top-tier computer vision research, including Internimage, Co-DETR, BRTSD, YOLOv11, YOLO-MS, and Mamba-YOLO. These models are recognized for their strong detection accuracy and serve as reliable benchmarks. The evaluation metrics used in this experiment are mAP50 and mAP95. The results are summarized in Table 2.
As shown in Table 2, the proposed optimized model achieved the highest performance among all evaluated models, with a mAP50 of 78.0% and mAP95 of 38.9%, surpassing the mainstream SOTA detectors.
Compared to InternImage (CVPR2023) and Co-DETR (ICCV2023), which achieved mAP50 scores of 76.6% and 76.5% respectively, our model improves detection accuracy by 1.4% and 1.5%. Similarly, the mAP95 of our model surpasses InternImage and Co-DETR by a margin of 5.8% and 2.4%, indicating better precision in localizing small or subtle seedling gaps.
BRTSD (TGRS2024) achieved a mAP50 of 70.7% and mAP95 of 31.1%, falling behind our optimized model by 7.3% and 7.8% respectively. YOLOv11 (Arxiv2024) showed a mAP50 of 77.6% and mAP95 of 37.2%, slightly lower than those of our optimized model by 0.4% and 1.7%, respectively. These results indicated that our approach offers enhanced robustness and greater sensitivity in detecting seedling-deficient areas.
With a mAP50 of 77.1% and mAP95 of 37.0%, YOLO-MS (TPAMI2025) slightly underperformed our optimized model by 0.9% and 1.9% respectively. In contrast, Mamba-YOLO (AAAI2025) showed the lowest performance among all compared models, with a mAP50 of 35.6% and a mAP95 of 15.4%. These demonstrated theadvantages of the model we proposed.
-
Comparison experiment 2: To demonstrate the superiority of our proposed improvements over various official YOLO versions, we conducted a cross-version comparison experiment involving YOLOv5, YOLOv6, YOLOv8, and YOLOv11 under consistent training and evaluation settings. The evaluation metrics used in this experiment are Pre, Re and mAP50. The experimental results are summarized in Table 3.
As shown in Table 3, our optimized model achieved the best overall performance, with a mAP50 of 78.0%, surpassing YOLOv11 (77.6%), YOLOv8(75.5%), YOLOv6 (76.6%), and YOLOv5 (74.5%). It also maintained a high recall of 71.8%, which was close to the highest recall of YOLOv11 (73.1%) and exceeded that of YOLOv5, YOLOv6 and YOLOv8. Although YOLOv8 achieved the highest precision (71.1%), its low recall (68. 8%) indicated a reduced sensitivity to seedling-deficient regions. Therefore, our model achieved a better balance between precision and recall. These results confirmed the superiority and practicality of our improvements over standard YOLO variants in detecting rice seedling deficiency.
Ablation experiments
To clearly illustrate the performance improvements of the proposed model over the original YOLOv8 architecture, a series of ablation experiments were systematically conducted to quantify the impact of each modification. The core objective of these experiments is to evaluate the effectiveness of the three proposed enhancement modules: Init_stem block, Mamba_Conv block, and ClueMerge layer. We conducted a systematic analysis of different module combinations on the constructed rice seedling deficiency dataset to quantify each module’s contribution to the overall detection accuracy. All experiments were conducted under identical dataset conditions and training settings. The evaluation metrics include Pre, Re and mAP50, which quantitatively illustrate the impact of each module. The experimental results are summarized in Table 4.
As shown in Table 4, we incrementally introduced each module to independently assess its contribution to seedling deficiency detection.
By comparing Group 1 and Group 2, the Init_Stem block (Group 2) improved mAP50 by 1.5% and recall by 2.6%, indicating its ability to retain shallow spatial features that are often lost during early downsampling. This improves the model’s responsiveness to small and weak seedling targets.
Furthermore, the comparison between Group 1 and Group 3 showed that the inclusion of the Mamba_Conv block led to a 1.2% improvement in mAP50 and a 2.1% increase in recall, demonstrating that this module significantly enhances the capability of capturing global features and improves the coverage of rice seedling deficiency detection.
Similarly, by comparing Group 1 and Group 4, it can be observed that the ClueMerge layer alone resulted in a 1.7% increase in mAP50, while also improving recall compared to Group 2, illustrating the benefits of preserving high-frequency structural information via spatial rearrangement and wavelet-domain convolution.
When Init_Stem and Mamba_Conv were combined (Group 5), the model achieved 76.9% mAP50, indicating a synergistic effect between shallow detail retention and global semantic encoding. Finally, the Group 6 which integrates all three components, achieved the highest performance in mAP50, long with simultaneous improvements in both precision and recall. This comprehensive gain strongly supports the complementary design of the proposed modules.
Thus, the proposed combination of Init_stem block, Mamba_Conv block and ClueMerge layer effectively improved the overall performance of the model in rice seedling deficiency detection, particularly contributing to notable increase in mAP50.
Results and analysis
Results and analysis of the comparison experiments
As illustrated in Fig. 5, we provided a visual comparison of seedling deficiency detection results under varying field backgrounds to conduct a robustness analysis.
Fig. 5a–g,k focused on the comparison between SOTA detectors and our proposed optimized model in different background to perform a robustness analysis. It can be observed that our model yields the most complete and accurate detection results, closely aligning with the ground truth. This robust evidence further confirmed the effectiveness of our proposed innovations, which include shallow feature preservation through the Init_Stem block, dual-branch feature modeling via the Mamba_Conv block, and frequency-aware enhancement implemented by the ClueMerge layer.
As illustrated in Fig. 5g–k,d, we further conducted a robustness analysis by visually comparing our optimized model with multiple official YOLO versions under varying field conditions. The results showed that the optimized model achieves more accurate localization and more complete coverage of deficient areas compared to YOLOv5, YOLOv6, YOLOv8, and YOLOv11. These qualitative observations reinforce the model’s strong adaptability to different backgrounds and align well with the quantitative improvements reported earlier, thereby confirming the practical robustness and effectiveness of our proposed enhancements.
Results and analysis of the ablation experiments
To visually demonstrate the effectiveness of the proposed blocks, we compared the ablation detection results with the ground truth labels, as shown in Fig. 6.
In Fig. 6, our primary focus is on the collaborative effect of different blocks when combined with YOLOv8 for seedling deficiency detection. It can be observed that the original YOLOv8 model misses several missing seedling regions. Although it successfully detects some areas, the results remain unsatisfactory when compared to the ground truth. In contrast, adding Init_Stem block, Mamba_Conv block, or ClueMerge layer individually improves detection performance. Notably, when all three modules are combined, the model achieves the most accurate and robust detection results.
Through the ablation experiments of the three blocks, it shows that enhancing shallow features, global dependencies, and high-frequency edge information significantly benefits seedling deficiency detection. This confirms that our model serves as an effective and targeted approach for identifying deficient regions in rice seedlings.
Conclusion
This study introduces a novel deep learning framework tailored to detect rice seedling deficiency in paddy fields ten days after transplanting. To support this task, we first construct a custom UAV-based dataset to address the lack of publicly available data in this domain. Building upon this foundation, the proposed framework incorporates several architectural innovations to enhance detection performance. Specifically, inspired by Mamba, we introduce a state space model (SSM)-based backbone and construct a dual-branch feature extraction block to capture both global semantic context and local spatial details. In addition, a wavelet-based downsampling module is designed to preserve high-frequency cues of small-scale seedling-deficient regions. Through ablation studies on our datasets, we have proven the effectiveness of our designed modules. Compared to other advanced object detection models, our approach achieves superior performance.
However, this article still has several limitations. First,the current dataset, while sufficient for initial training and evaluation, remains relatively limited in size and diversity. This may constrain the model’s ability to generalize across different rice varieties, growth stages, and complex field conditions. Second, the convolutional branch in the dual-path module uses standard convolution and activation to extract local features, its fixed receptive field limits adaptability to complex field conditions. To address these limitations, future work will focus on two directions: expanding the dataset with more diverse and representative samples to improve the model’s robustness and generalization; and integrating multi-scale convolution into the dual-path module to enable spatial feature extraction at varying resolutions, thereby enhancing detection performance under diverse environmental settings. These efforts aim to expand the model’s application scope to broader seedling deficiency detection tasks, improving its generalization and adaptability in large-scale farming scenarios33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48.
Data availability
The datasets generated and/or analysed during the current study are not publicly available due they were obtained under a data-sharing agreement with a third party that restricts public dissemination but are available from the corresponding author on reasonable request.
References
Jin, X., Liu, S., Baret, F., Hemerlé, M. & Comar, A. Estimates of plant density of wheat crops at emergence from very low altitude uav imagery. Remote Sens. Environ. 198, 105–114 (2017).
Shirzadifar, A., Maharlooei, M., Bajwa, S. G., Oduor, P. G. & Nowatzki, J. F. Mapping crop stand count and planting uniformity using high resolution imagery in a maize crop. Biosyst. Eng. 200, 377–390 (2020).
Mostafa, S., Mondal, D., Panjvani, K., Kochian, L. & Stavness, I. Explainable deep learning in plant phenotyping. Front. Artif. Intell. 6, 1203546 (2023).
Yuanhang, J., Maolin, X. & Jiayuan, Z. A dead tree detection algorithm based on improved yolov4-tiny for uav images. Remote Sens. Nat. Resour. 35, 90–98 (2023).
Zhang, Y. et al. High-precisiondetection for sandalwood trees via improved yolov5s and stylegan. Agriculture 14, 452 (2024).
Wu, T. et al. An improved yolov5s model for effectively predict sugarcane seed replenishment positions verified by a field re-seeding robot. Comput. Electron. Agric. 214, 108280 (2023).
Cui, J. et al. Real-time missing seedling counting in paddy fields based on lightweight network and tracking-by-detection algorithm. Comput. Electron. Agric. 212, 108045 (2023).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929https://doi.org/10.48550/arXiv.2010.11929 (2020).
Chen, W., Huang, Z., Mu, Q. & Sun, Y. Pcb defect detection method based on transformer-yolo. IEEE Access 10, 129480–129489. https://doi.org/10.1109/ACCESS.2022.3228206 (2022).
Jiang, P., Ergu, D., Liu, F., Cai, Y. & Ma, B. A review of yolo algorithm developments. Procedia Comput. Sci. 199, 1066–1073 (2022).
Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 34, 572–585 (2021).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Dao, T. & Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. https://doi.org/10.48550/arXiv.2405.21060, arXiv preprint arXiv:2405.21060, (2024).
Yue, Y. & Li, Z. Medmamba: Vision mamba for medical image classification. https://doi.org/10.48550/arXiv.2403.03849, arXiv preprint arXiv:2403.03849, (2024).
Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. https://doi.org/10.48550/arXiv.2401.09417 (2024).
Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. https://doi.org/10.48550/arXiv.2312.00752, arXiv preprint arXiv:2312.00752, (2023).
Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. https://doi.org/10.48550/arXiv.2111.00396. arXiv preprint arXiv:2111.00396, (2021).
Gu, A., Johnson, I., Timalsina, A., Rudra, A. & Ré, C. How to train your hippo: State space models with generalized orthogonal basis projections. https://doi.org/10.48550/arXiv.2206.12037, arXiv preprint arXiv:2206.12037, (2022).
Liu, Y. et al. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 37, 103031–103063 (2024).
Ruan, J., Li, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).
Wang, Z., Li, C., Xu, H. & Zhu, X. Mamba yolo: Ssms-based yolo for object detection. https://doi.org/10.48550/arXiv.2406.05835, arXiv preprint arXiv:2406.05835, (2024).
Xu, R., Yang, S., Wang, Y., Du, B. & Chen, H. A survey on vision mamba: Models, applications and challenges. arXiv e-prints arXiv–2404. https://doi.org/10.48550/arXiv.2404.18861 (2024).
Yang, C. et al. Plainmamba: Improving non-hierarchical mamba in visual recognition. https://doi.org/10.48550/arXiv.2403.17695, arXiv preprint arXiv:2403.17695, (2024).
Xu, J. Hc-mamba: Vision mamba with hybrid convolutional techniques for medical image segmentation. https://doi.org/10.48550/arXiv.2405.05007, arXiv preprint arXiv:2405.05007 (2024).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) 3–19. https://doi.org/10.48550/arXiv.1807.06521 (2018).
Finder, S. E., Amoyal, R., Treister, E. & Freifeld, O. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision 363–380 (Springer, 2024).
Tan, J. et al. Wavelet-based mamba with fourier adjustment for low-light image enhancement. In Proceedings of the Asian Conference on Computer Vision 3449–3464. https://doi.org/10.48550/arXiv.2410.20314 (2024).
Zong, Z., Song, G. & Liu, Y. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 6748–6758 (2023).
Wang, W. et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 14408–14419 (2023).
Huang, S., Lin, C., Jiang, X. & Qu, Z. Brstd: Bio-inspired remote sensing tiny object detection. IEEE Trans. Geosci. Remote Sens. (2024).
Khanam, R. & Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).
Chen, Y. et al. Yolo-ms: rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2025).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012).
Jocher, G. et al. ultralytics/yolov5: v3. 0. Zenodo (2020).
Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications. https://doi.org/10.48550/arXiv.2209.02976, arXiv preprint arXiv:2209.02976, (2022).
Hatamizadeh, A. & Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. https://doi.org/10.48550/arXiv.2407.08083, arXiv preprint arXiv:2407.08083 (2024).
Khan, A., Asad, M., Benning, M., Roney, C. & Slabaugh, G. Convolution and attention-free mamba-based cardiac image segmentation. arXiv e-prints arXiv–2406 (2024).
Sharma, A. S., Atkinson, D. & Bau, D. Locating and editing factual associations in mamba. https://doi.org/10.48550/arXiv.2404.03646, arXiv preprint arXiv:2404.03646, (2024).
Alhwaiti, Y. et al. Leveraging yolo deep learning models to enhance plant disease identification. Sci. Rep. 15, 7969 (2025).
Mora, J. J. et al. Digital framework for georeferenced multiplatform surveillance of banana wilt using human in the loop ai and yolo foundation models. Sci. Rep. 15, 3491 (2025).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
Ma, J., Li, F. & Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024).
Kanna, G. P. et al. Advanced deep learning techniques for early disease prediction in cauliflower plants. Sci. Rep. 13, 18475 (2023).
Bezabh, Y. A., Salau, A. O., Abuhayi, B. M., Mussa, A. A. & Ayalew, A. M. Cpd-ccnn: Classification of pepper disease using a concatenation of convolutional neural network models. Sci. Rep. 13, 15581 (2023).
Kalpana, P., Anandan, R., Hussien, A. G., Migdady, H. & Abualigah, L. Plant disease recognition using residual convolutional enlightened swin transformer networks. Sci. Rep. 14, 8660 (2024).
Faisal, H. M. et al. Detection of cotton crops diseases using customized deep learning model. Sci. Rep. 15, 10766 (2025).
Chimate, Y., Patil, S., Prathapan, K., Patil, J. & Khot, J. Optimized sequential model for superior classification of plant disease. Sci. Rep. 15, 3700 (2025).
Author information
Authors and Affiliations
Contributions
Y.X., and Z.Z. designed the study and performed the experiments and are the main contributing authors of the paper. X.L. contributed to the image collection of UAV images. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Consent for publication
All authors agreed to publish this manuscript.
Research involving plant statement
The authors affirm that all requisite permissions and licenses for the collection of plant and all plant parts and their accompanying images, utilized in this study, have been duly obtained in adherence to relevant regulations and guidelines. Additionally, the authors confirm that the species utilized in this study are not endangered.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Xia, Y., Zhu, Z. & Liu, X. SSM-based detection of rice seedling deficiency. Sci Rep 15, 22605 (2025). https://doi.org/10.1038/s41598-025-06579-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-06579-5