Introduction

Corn is an important crop and industrial raw material and its breeding technology has been rapidly developed, resulting in a significant increase in the number of varieties. However, the chaotic situation in the seed market is mainly due to the emergence of a large number of counterfeit and substandard seeds and thus an efficient method of corn variety identification is urgently needed to maintain the market order and safeguard the stability of agricultural production. Corn is not only one of the crops with the largest planting area and highest yield in the world, but also occupies an important position in China’s food production and its quality is directly related to national food security. Usually, corn in the harvest process using mechanized threshing methods; but in the threshing drum, pulling the stalk roller, feed churning cage and other mechanical components, corn kernels will be subjected to random extrusion, shear, kneading and other external forces, these factors may lead to damage to the corn kernels and in severe cases, even a large area of broken. During storage, corn kernels are also susceptible to a combination of environmental factors such as temperature, humidity, air and moisture, leading to problems such as germination, mold and insect damage. These abnormal kernels are detrimental to corn breeding and in the context of agricultural mechanization, especially large-scale mechanized planting, supervision and harvesting face even greater challenges. Therefore, there is an urgent need to develop a real-time, automated and accurate technique and device for identifying and detecting abnormal corn kernels in order to solve the current problems faced.

Over the past decade, convolutional neural networks (CNNs) have been widely used for various types of segmentation tasks and have made significant progress in the field of image segmentation. Especially recently, Full Convolutional Networks (FCN)1, UNet2 and their variants have performed particularly well. These network architectures utilize an encoder-decoder structure that enhances segmentation by directly combining the high-level semantic features extracted by the encoder path with the fine-grained features provided by the decoder path through jump connections. However, the limitations of the convolutional operation in the receptive domain and the inherent generalization bias of the convolutional structure make it difficult for CNNs to capture long-term dependencies and global contexts in the image, which restricts the further improvement of segmentation accuracy. In particular, when dealing with images of corn kernels, due to the variability of corn kernels in the target image, including changes in size, texture and shape, it is often difficult for these CNN-based methods to provide effective guidance for adapting to individual differences. It is worth noting that automatic segmentation of the image is crucial in the recognition process of corn seed images as it classifies the pixels for corn seed and determines the kind of anomaly of the seed by segmenting the size, color and shape of the anomalous key regions. Various U-structured networks3, especially UNet2, UNet++4, UNet3+5 and nnU-Net6, have become the standard techniques to achieve high-quality segmentation results. Attention mechanisms7 have also been integrated into these models to enhance feature mapping and improve pixel-level classification. Although attention-based models have shown improved performance, they still face significant challenges due to the high computational cost of the convolutional blocks typically used in conjunction with attention mechanisms.

Recently, visual Transformers8 have shown remarkable potential in image segmentation tasks9, mainly thanks to the Self-Attention mechanism to capture remote dependencies between pixels. To further enhance image segmentation performance, hierarchical visual transformers such as Swin10, ConvFormer11 and MetaFormer12 have been introduced. However, while Self-Attention mechanisms excel in capturing global information, they are relatively weak in understanding local spatial context13,14. To address this issue, some approaches integrate local convolutional attention mechanisms in the decoder to better capture spatial details. Despite the theoretical potential of these methods, they typically require dense prediction at the pixel level, which is computationally expensive when dealing with high-resolution images, as they often rely on costly convolutional blocks. In addition, most of these methods are fixed-scale, making it difficult to cope with the diversity of scenes. Another challenge is that these image segmentation methods usually divide the image into non-overlapping chunks and convert these chunks into vector embeddings individually at each stage. However, existing chunk segmentation methods tend to ignore the pixel-level structural information and local convex topology inside each chunk, resulting in models that cannot maintain local continuity around the chunks.

To address the above limitations, we propose VMUnet-MSADI, a novel efficient multiscale convolutional attention decoding network that incorporates a multiscale convolutional block attention mechanism and detail infusion. Specifically, VMUnet-MSADI enhances feature mapping through efficient multiscale convolution, while integrating complex spatial relations and local attention using channel attention, spatial attention and group-gated attention mechanisms. Our main contributions can be summarized as follows:

  • In this paper, we propose a novel efficient multiscale convolutional decoder: we introduce an efficient Visual Mamba UNet fused multi-scale attention mechanism and detail infusion decoder for anomalous corn kernel image segmentation; this takes full advantage of the multilevel nature of the VMUnet encoder, effectively fuses the advantages of the visual Mamba UNet architecture and enhances the functionality and flexibility of the traditional encoder-decoder architecture. The core idea is to combine VMUnet with a multi-scale attention mechanism and detail infusion structure to realize automatic control of corn seed image segmentation.

  • A well-designed coding mechanism for a multiscale deep convolutional attention module is proposed: we design a Multi-scale Convolutional Attention Module (MCAM) that performs deep convolution at multiple scales to improve the feature maps generated by the visual encoder. The MCAM captures salient features at multiple scales by suppressing the irrelevant regions and its high efficiency is attributed to the application of deep convolution.

  • Proposed Detail Infusion Block (DIB): evaluates the spatial and channel attention scores by computing different attention mechanisms and fully utilizes the result to generate discriminative feature representations of coarse-to-fine features at different scales of the encoder, ensuring semantic consistency between coarse and fine features.

  • Improved performance: extensive experiments on the corn kernel dataset provided by GaoZhe Technology show that the proposed VMUnet-MSADI consistently outperforms the existing state-of-the-art methods, which is 0.9% higher than the leading method. it can prove the effectiveness and superiority of our approach.

The rest of the paper is organized as follows. Section 2 summarizes the work related to image segmentation, Sect. 3 gives the data acquisition device as well as the process and Sect. 4 describes our proposed VMUnet-MSADI network model. Section 5 explains our experimental setup as well as the results of our benchmarks on corn seed and non-corn seed image segmentation with comprehensive experiments and visualization. Finally, Sect. 6 summarizes the full paper.

Related work

In this section, we first summarize the most typical CNN-based methods used in corn image segmentation and then we make an overview of the recent related works about the applications of visual coders in computer vision, especially in the field of segmentation.

Corn image segmentation

Image segmentation involves pixel-level classification to identify detailed structures and information within images. In the field of agriculture, significant progress has been made in applying computer vision techniques and traditional algorithms to the quality assessment of corn kernels15. The following summarizes the key research contributions, of QUAN et al.16 employed image labeling techniques to process images containing numerous scattered corn kernels. They utilized a multi-scale wavelet analysis algorithm for logical assessment of the labeled images, achieving segmentation, localization and shape recognition of corn kernels. CHENG et al.17 focused on the embryonic features of corn kernels, using color space models to exploit differences in image components across different channels for segmentation. They applied an improved morphological opening and closing operation to refine the segmented images, ultimately identifying the characteristic regions of the seed’s endosperm. SHI et al.18 analyzed collected corn images using genetic algorithms and SPSS techniques. They assessed the ratio of white to yellow areas of damaged kernels in the HSI color space, using this information to identify corn variety attributes. K. Ding et al.19 distinguished between whole and broken corn kernels by employing parameters such as the continuous symmetric index, curvature index and radius index. Their approach effectively excluded broken kernels from the overall dataset. Ni B et al.20 implemented a grading system to detect whole versus broken kernels and examined the horizontal and vertical histogram distributions of corn kernels to analyze their concave and convex characteristics. Ng H. F21 utilized staining techniques on the damaged surfaces of broken corn seeds and applied single-grain and batch analysis methods to detect internal mechanical stress cracks. A three-layer neural network was employed to effectively identify moldy corn seeds. I. Zayas et al.22 analyzed twelve morphological parameters and selected seven significant parameters to establish a Mahalanobis distance discriminant function for distinguishing between whole and broken kernels. These studies demonstrate the application of various image processing and analysis techniques in corn kernel quality assessment, providing valuable references for research in this domain.

Recently, deep neural networks have demonstrated substantial capabilities and notable advantages in feature extraction for target object detection23. The application of deep learning techniques in object classification not only enhances adaptability to detected objects but also significantly improves the accuracy of detection results, with pronounced benefits observed in grain classification tasks24. The following provides an overview of relevant research, LIN et al.25 designed a multi-convolutional block feature detection network model for rice classification, achieving efficient identification of rice varieties. LIU et al.26 utilized improved YOLO series algorithms to address issues of grain breakage and counting with precise identification and detection capabilities. KHAKI et al.27 applied convolutional neural network-based techniques to achieve high-precision identification of damaged corn kernels under varying light conditions. ZHAO et al.28 developed a deep learning-based model for identifying and screening high-quality soybean seeds, which improved seed screening efficiency. JIN et al.29 proposed a soybean quality detection algorithm based on an enhanced U-Net model, increasing the accuracy of soybean quality assessment. TU et al.30 employed a VGG16 model with transfer learning to classify the JINGKE-968 variety, achieving a test accuracy of up to 98%. LV et al.31 trained an improved ResNet model, achieving a maximum recognition accuracy of 96.4%. These studies indicate that deep learning technologies, particularly advanced network models and transfer learning strategies, play a crucial role in enhancing the accuracy and efficiency of grain classification and detection.

Vision encoders

The CNN is at the heart of the base encoder with its strength in handling spatial relationships in images. AlexNet23 and VGG32 laid a solid foundation for the development of CNN through layer-by-layer deep convolutional feature extraction. GoogleNet33 introduced the Inception module, which allows for a more efficient computation of features at multiple scales. ResNet34 solved the problem of gradient vanishing through residual splicing solves the gradient vanishing problem and makes it possible to train deeper networks. R2U-Net35 further achieves better feature representation by fusing residual networks and U-Net. PraNet3 accomplishes the computation of detailed features for segmented targets by combining the Parallel Partial Decoder and the Inverse Attention module. KiU-Net36 utilizes the under- and over-complete features of a new architecture to improve the segmentation of spatial structural features of the target. DoubleU-Net37 refines the baseline for image segmentation using two U-Net sequences and employs spatial pyramid pooling. FANet38 improves the unification of the previous epoch mask with the current epoch feature map during training. MobileNets39 improves the computation of detailed features of the segmented target by means of a lightweight of deeply separable convolution successfully extends CNN to mobile devices, and EfficientNet40 further improves the performance of CNN through a scalable composite extension architecture design. Although CNN performs well in many visual tasks, it is often difficult to capture remote dependencies in images due to the limitation of its local receptive domain. Although CNNs perform well in numerous visual tasks, it is often difficult to capture remote dependencies in images due to the limitation of their local receptive domain.

Recently, Vision Transformers (ViTs), first proposed by Dosovitskiy et al. are able to efficiently capture remote dependencies between pixels through the Self-Attention (SA) mechanism. With the continuous development of ViTs, these models have achieved significant performance improvements by combining CNN features41,42, improving Self-Attention blocks and introducing innovative architectural designs43,44. For example, Swin Transformer10 introduces a sliding window attention mechanism, SegFormer44 implements a hierarchical structure through Mix-FFN blocks, PVT43 employs a spatially-approximate attention mechanism and PVTv27 is optimized with overlapping patch embedding and linear complexity attention layers. maxViT41 introduced multi-axis self-attention in the encoder to form a hierarchical CNN-transformer structure. These advances further advance the application and performance of visual transformers in various vision tasks.

Although visual transformers (ViTs) perform well in capturing remote pixel relationships, they still face certain challenges in capturing local spatial relationships between pixels. To address this problem, this paper proposes a multiscale attention mechanism decoder based on visual Mamba UNet. The decoder utilizes a multiscale convolutional attention module to refine the feature mapping and effectively incorporates a local DIB, which enhances the ability to model local spatial relationships. The accurate capture of local details in image segmentation tasks is further enhanced.

In summary, while deep learning technologies have been extensively applied to the evaluation of crop quality, research specifically focusing on the classification of anomalous corn kernels remains relatively limited. Previous studies have addressed the problem in low-density and relatively ideal conditions, but there remains significant room for improvement. Our research team proposes an innovative approach that utilizes the Visual Mamba UNet network, incorporating a multi-scale attention mechanism to enhance the capture of features across different scales and achieve fine-grained segmentation of corn kernels. We introduce the VSS module to capture contextual information, thereby improving adaptability to complex backgrounds and environmental variations. Additionally, we integrate the DI Block to enhance the fusion of low-level and high-level features, thereby improving the detail and accuracy of feature representation. Through these novel techniques, we aim to achieve more accurate and effective recognition of corn kernels, providing valuable insights and new directions for research in corn kernel classification.

Materials

Sample preparation

Fig. 1
figure 1

Examples of NOR and DU corn grains.

In this study, the data set used to train the network model was mainly from Anhui GaoZhe Technology Co., LTD45. There are three kinds of data collection devices: P600, G600 and M600. The P600 consists of two industrial cameras with a light source and a conveyor belt that feeds the grain automatically. The G600 consists of an industrial camera with a light source and a conveyor belt; The M600 consists of a vision terminal and a fixed bracket. Among them, the experimental data of this project are collected by P600 and G600, which are composed of industrial cameras, grain placement platforms and lighting sources respectively. The corn grains in the data set were from five countries and regions, including the United States (USA), Canada (CAN), Australia (AU), Cambodia (KHM) and China (CHN). Corn seeds can be divided into two categories, NORMAL (NOR) seeds and abnormal (Damaged and Unsound) seeds and abnormal seeds are divided into six categories, respectively, FUSARIUM & SHRIVELLED (F&S) seeds, SPROUTED (SPROUTED) seeds and damaged and unsound (DU) seeds. SD) seeds, MOULDY (MY) seeds, BROKEN (BN) seeds, ATTACKED by PESTS (AP) seeds and BLACK POINT (BP) seeds, as shown in Fig. 1.

The light blue part is the normal corn grain map, including round and long grains, while the orange part shows the six abnormal grain maps. The dataset has a total of 11,460 image and the training model uses this to randomly divide the training set, verification set and test set with a ratio of about 10:1:1 to ensure their independence. The detailed distribution of selected corn kernel image data is shown in Table 1.

Table 1 Corn kernel dataset.

Images acquisition system

The acquisition of the actual test data in this study mainly consists of the test bench of the detection camera (HIKVISION, MV-CA050-10 C), the fill light strip, the image acquisition card and the computer. To ensure real-time and continuous data acquisition, we designed the bottom of the acquisition device with an adjustable Angle single-layer guide structure. It is covered with white matte stickers to reduce the effects of direct light and corn grain rolling. In the test process, corn grains start from the feeding hopper, enter the single-layer guide structure through the trough wheel and the transition plate and reach the uniform distribution plate. There is an adjustable Angle α between the plane and the horizontal plane of the uniform distribution plate and the bottom of the uniform distribution plate is equipped with a vibrator to assist in the flattening of corn grains. After passing through the uniform distribution plate, the corn grains achieve single-layer discretization and finally pass through the detection area, image acquisition is carried out by an industrial camera equipped with a 6 mm focal length adjustable lens, as shown in Fig. 2.

At each scan for data collection, a set of 64 grains of corn from each category was removed and spread out in a tile, organized into 8 rows by 8 columns. To ensure that no two seeds come into contact, all the seeds that are placed are placed about 2 × 2 cm in the center of the platform. 10 groups were formed for each test cycle for a test cycle with 7 types of tests and 4480 grains were processed in each test cycle.

Fig. 2
figure 2

A schematic diagram of the hyperspectral imaging system.

Data preprocessing

The data collected in the experiment took a picture containing multiple corn grains in a photo, but it was a single corn grain that was finally identified as abnormal. Therefore, the image needs to be processed. The main steps include binary image processing of multiple grains and then contour detection, calculating the maximum external polygon base on the detected contour pixels and extracting the outline of a single corn grain according to the maximum and minimum value of the external polygon. Finally, the selected corn grains were cut to get the single corn image. The image processing process is shown in Fig. 3.

Fig. 3
figure 3

Corn kernel image processing process.

In the study, we used a photograph of multiple corn grains as data, and we needed to process these images in order to identify abnormal kernels in individual corn grains. The main steps of processing are as follows: First, the original image is filtered and enhanced, CIELAB channel separation is carried out. The analysis shows that the B-channel image has high contrast, which is conducive to image segmentation. Gamma transformation is performed on the B-channel image, where λ=0.16, binarization processing is carried out to obtain the binarization image of a single grain. Then, the contour detection was carried out, and the maximum external polygon of each corn kernel was calculated by counting contour pixels. Then, the outer rectangle of each corn kernel is extracted according to the maximum and minimum values of these outer polygons. Finally, each corn grain is cut according to the external rectangle, and the image of a single corn is obtained. Figure 3 shows the image after the combination of 8 × 8 corn grains. The whole process of image processing is shown in Fig. 3. After the above processing, the isolated single corn kernel image was input into the VMUnet-MSADI network model as the actual test data for the classification and recognition of corn grains.

Data flow

The core framework of the VMUnet-MSADI model is proposed in this study. The data is preprocessed before passing through the network, including image filtering and denoising, scale transformation, color correction, normalization, single grain cutting, etc. Then feed into the network model, through the encoder stage, single image feature extraction; Then we enter the visual state space module (VSS) to capture the context information in the wide area. Then we enter the Detail Infusion Block to enhance the low-level feature detail information and the high-level feature detail information infusion for the network model. Before it enters the main 2D Selection Scan (SS2D) module, the VSS module results are output through hierarchical feature mapping combined with outputs from other layer information streams. Finally, the final recognition result is output through the decoder of the VSS module, as shown in Fig. 4.

Fig. 4
figure 4

The overall architecture of the VMUnet-MSADI model, consists of an Encoder module, MSADI module, and Decoder module. (a) The overall architecture of VMUnet-MSADI. (b) The core processing units of MSADI consist of the CA Block, SA Block and MSC Block.

Methods

Preliminaries

SSM-based VMUnet-MSADI networks, including the structured state space sequence model and Mamba model, rely on a classical continuous system, represent as \(x(t) \in \mathcal{R}\), which maps one-dimensional input functions or sequences to output \(y(t) \in \mathcal{R}\)via an intermediate implicit state \(h(t) \in {\mathcal{R}^N}\). This process can be expressed as a Linear Ordinary Differential Equation (ODE):

$$\begin{gathered} h^{\prime}(t)=Ah(t)+Bx(t) \hfill \\ y(t)=Ch(t) \hfill \\ \end{gathered}$$
(1)

Where, \(A \in {\mathcal{R}^{N \times N}}\)represents the state matrix, \(B \in {\mathcal{R}^{N \times 1}}\)and\(C \in {\mathcal{R}^{N \times 1}}\) is the projection vector. Structured state space sequences and Mamba discretize this continuous system to make it more suitable for deep learning scenarios. Specifically, they introduce a time scale parameter \(\delta\) and convert A and B into discrete parameters \(\bar {A}\)and \(\bar {B}\) using fixed discretization rules. A zero-order hold (ZOH) is usually used as a discrete rule, which can be defined as:

$$\begin{gathered} \bar {A}=\exp (\delta A) \hfill \\ \bar {B}={(\delta A)^{ - 1}}(\exp (\delta A) - I) \cdot \delta B \hfill \\ \end{gathered}$$
(2)

After discretization, SSM-based models can be computed either by linear recursion or global convolution, defined as Eqs. 3 and 4, respectively.

$$\begin{gathered} h^{\prime}(t)=\bar {A}h(t)+\bar {B}x(t) \hfill \\ y(t)=Ch(t) \hfill \\ \end{gathered}$$
(3)
$$\begin{gathered} \bar {K}=(C\bar {B},C\bar {A}\bar {B}, \ldots ,C{{\bar {A}}^{L - 1}}\bar {B}) \hfill \\ y=x * K \hfill \\ \end{gathered}$$
(4)

Where, \(\bar {K} \in {\mathcal{R}^L}\)represents a structured convolution kernel and L represents the length of the input sequencex.

VMUnet-MSADI architecture

The overall structure of VMUnet-MSADI consists of three main modules: Encoder, DIB (Detail Infusion Block), and Decoder. Given the input image I, where \(I \in {R^{H \times W \times 3}}\), the encoder generates M-levels of features. We represent the characteristics of layer i-th as \(f_{i}^{o}\), where \(1 \leqslant i \leqslant M\). These accumulated features as \(\{ f_{1}^{o},f_{2}^{o},…,f_{M}^{o}\}\)are then forwarded to the DIB for further enhancement. After entering the DIB, the encoder output channel of \({f_i}\)is \({2^i} \times C\), these accumulated features \(\{ f_{1}^{o},f_{2}^{o},…,f_{M}^{o}\}\)enters the DIB for feature fusion, \({f_i}\)corresponds to \(f_{i}^{\prime }\)as the output of i-th stage. \(\frac{H}{{{2^{i+1}}}} \times \frac{W}{{{2^{i+1}}}} \times {2^i}C\)is characterized by \(f_{i}^{\prime }\). In our model, we use deep supervision to calculate the loss of \(f_{i}^{\prime }\)and\(f_{{i - 1}}^{\prime }\) features.

In this article, we use the \([{N_1},{N_2},{N_3},{N_4}]\)VSS block on the four stages of the encoder, with a channel count of [C,2C,4C,8C] for each stage. According to our observations in VMamba, the different values of \({N_3}\)andc are important factors in differentiating the Tiny, Small and Base framework specifications. According to the VMamba specification, we let c take the value of 96, \({N_1}\)and\({N_2}\) take the value of 2, respectively, and \({N_3}\)take the values in the set 46. This represents our intention to use tiny and small models of VMamba as the backbone of our ablation experiments.

VSS module

VSS Block is the backbone of VMUnet-MSADI, and SS2D is the core of VSS Block. The Detail Infusion Block consists of the trunk output feature, a multi-scale attention module, and the DIB output feature is the same size as the trunk output feature.

The VSS block, derived from VMamba, is the backbone of the VM UNetV2 encoder and its structure is shown in Fig. 5a. The input is first processed through an initial linear embedding layer, which is then split into two separate information streams. A stream passes through the 3 × 3 deep convolution layer and the subsequent Silu activation function before entering the main SS2D module. The SS2D output is then processed by the layer normalization layer and combined with the output of other information flows, which are also processed by Silu activation. The combined output forms the final result of the VSS block.

Fig. 5
figure 5

VMUnet-MSADI network core functional modules, VSS and MSA modules. (a) The core structure of VSS Block. (b)Multi-Scale Attention is the core module of SS2D.

DIB, as shown in Fig. 5b. For hierarchical feature mapping \(f_{i}^{o}=\frac{H}{{{2^{i+1}}}} \times \frac{W}{{{2^{i+1}}}} \times {2^i}C\), the encoder generates \(1 \leqslant i \leqslant 4\)where i represents layer i-th.

Different attention mechanisms can be used in the DIB to calculate the attention score of space and channel. According to what is mentioned in UNetV2, we use CBAM to achieve temporal and spatial attention. The calculation formula is as follows, \(\phi _{i}^{{att}}\)represents i-th time attention calculation:

$$f_{i}^{1}=\phi _{i}^{{att}}\left( {f_{i}^{0}} \right)$$
(5)

We use 1 × 1 convolution to align the \(f_{i}^{1}\)channels to c, and the resulting feature map is denoted as \(f_{i}^{2} \in {R^{{H_i} \times {W_i} \times C}}\).

DIB decoder stage i-th, \(f_{i}^{2}\)represents the target reference. Then, we adjust the size of the feature map at each j-th layer so that it matches the size of \(f_{i}^{2}\), as follows:

$$f_{{ij}}^{3}=\left\{ {\begin{array}{*{20}{c}} {{{\text{G}}_{\text{d}}}\left( {f_{j}^{2},({H_i},{W_i})} \right)}&{if}&{j<i} \\ {{{\text{G}}_1}\left( {f_{j}^{2}} \right)}&{if}&{j=i} \\ {{{\text{G}}_u}\left( {f_{j}^{2},({H_i},{W_i})} \right)}&{if}&{j>i} \end{array}} \right.$$
(6)

In Formula 6, Gd, Gi and Gu represent adaptive averaging pooling, identity mapping and bilinear interpolation respectively. In Eq. 7, \({\theta _{ij}}\)is the parameter of smooth convolution, and \(f_{{ij}}^{4}\)is the j-th smooth feature map at i-th level. Here, \(H()\)represents the Hadamard product. The \(f_{i}^{5}\)is then forwarded to the decoder at layer i-th for further resolution reconstruction and segmentation.

$$\begin{gathered} f_{{ij}}^{4}={\theta _{ij}}\left( {f_{{ij}}^{3}} \right) \hfill \\ f_{i}^{5}=H\left( {\left[ {f_{{i1}}^{4},f_{{i2}}^{4},f_{{i3}}^{4},f_{{i4}}^{4}} \right]} \right) \hfill \\ \end{gathered}$$
(7)

Multi-scale attention module (MSAM)

We introduced an efficient multi-scale convolutional attention module to refine feature maps. \(MSAM\) sequentially incorporates a channel attention block \(CA( \bullet )\)that emphasizes channel correlations, a spatial attention block \(SA( \bullet )\)designed to capture local contextual information, and an efficient multi-scale convolution block \(MSC( \bullet )\)that enhances the retention of contextual relationships in feature maps. The definition of \(MSAM( \bullet )\)is defined by Eq. (8):

$$MSAM({x_{tensor}})=MSC(SA(CA({x_{tensor}})))$$
(8)

Here, \({x_{tensor}}\)represents the input tensor. Due to the use of deep convolution across multiple scales, our \(MSAM\)module is more efficient compared to the convolutional attention module with a significant reduction in computational cost.

Multi-scale convolution block (MSCB)

We introduce an efficient Multi-scale Convolution Block to enhance the features generated by our cascade expansion path. In our MSCB, we follow the design principles of the Inverted Residual Block (IRB) from MobileNetV2. However, unlike IRB, our MSCB performs deep convolution across multiple scales and utilizes channel shuffling to shuffle channels across groups. Specifically, in our MSCB, we first use a pointwise (1 × 1) convolution layer \(PWC1( \bullet )\)to increase the number of channels (i.e., expansion factor = 2), followed by batch normalization \(BN( \bullet )\)and activation\(ReLU6( \bullet )\). Next, we apply multi-scale depth-wise convolution \(MSDC( \bullet )\)to capture context information at various scales and resolutions. Since depth-wise convolution ignores inter-channel relationships, we implement channel shuffling to facilitate cross-channel mixing. Subsequently, another pointwise convolution layer \(PWC2( \bullet )\)is used, followed by \(BN( \bullet )\)to revert to the original number of channels, thereby capturing channel dependencies. The definition of \(MSCB( \bullet )\)is provided by Eq. (9)

$$\begin{gathered} MSDC({x_{tensor}})=ReLU6(BN(PWC1({x_{tensor}}))) \hfill \\ MSCB({x_{tensor}})=BN(PWC2(CS(MSDC))) \hfill \\ \end{gathered}$$
(9)

Where, the parallel operations with different kernel-sizes (KS) are defined as follows in Eq. (10):

$$MSDC({x_{tensor}})=\sum\limits_{{k \in KS}} {DWC{B_k}({x_{tensor}})}$$
(10)

\(DWC{B_k}({x_{tensor}})=ReLU(BN(DW{C_k}({x_{tensor}})))\)is defined as follows. \(DW{C_k}( \bullet )\) refers to the depth-wise convolution with kernel size k. \(BN( \bullet )\) and \(ReLU6( \bullet )\) denote batch normalization and \(ReLU\)activation, respectively. Additionally, our sequence \(MSDC( \bullet )\)utilizes the recursively updated input\({x_{tensor}}\), where \({x_{tensor}}\)maintains a residual connection with the previous \(DWC{B_k}( \bullet )\)to enhance regularization, as defined in Eq. (11):

$${x_{tensor}}={x_{tensor}}+DWC{B_k}({x_{tensor}})$$
(11)

Channel attention block (CAB)

We employ the Channel Attention Block to assign different levels of importance to each channel, thereby emphasizing more relevant features while suppressing less useful ones. Essentially, the\(CA( \bullet )\) block determines which feature maps to focus on (and subsequently refines them). We first apply maximum pooling \({P_M}( \bullet )\)and average pooling \({P_A}( \bullet )\)across the spatial dimensions (i.e., height and width) to extract the most prominent features from the entire feature map for each channel. Then, for each pooled feature map, we use pointwise convolution \(PWC1( \bullet )\)to reduce the number of channels, followed by \(ReLU\)activation. Next, another pointwise convolution \(PWC2( \bullet )\)is applied to restore the original channel dimensions. We then add the two restored feature maps together and apply a Sigmoid activation\(\sigma\)to estimate the attention weights as adjustment factors. Finally, these weights are integrated into the input \({x_{tensor}}\)using Hadamard product. The channel attention block \(CA( \bullet )\)is defined as follows in Eq. (12)

$$\begin{gathered} CAB({x_{tensor}})=\sigma (CAB1({x_{tensor}})+CAB2({x_{tensor}})) \otimes {x_{tensor}} \hfill \\ CAB1({x_{tensor}})=PWC2(\operatorname{Re} LU6(PWC1({P_M}({x_{tensor}}))) \hfill \\ CAB2({x_{tensor}})=PWC2(\operatorname{Re} LU6(PWC1({P_A}({x_{tensor}}))) \hfill \\ \end{gathered}$$
(12)

Spatial attention block (SAB)

We employ the Spatial Attention Block to simulate the attention mechanism of the human brain, focusing on specific regions of the input image. Essentially, the \(SA( \bullet )\)module determines the focal points within the feature map. By enhancing these focal regions, the model’s ability to recognize and respond to relevant spatial features is improved, which is crucial for image segmentation as the background and positioning of objects significantly influence the output. In\(SA( \bullet )\), we first aggregate the maximum \({C_M}( \bullet )\)and average \({C_A}( \bullet )\)values along the channel dimension to focus on local features. Next, we use a large kernel convolution (such as a 7 × 7 kernel) to enhance the local contextual relationships between features. We then apply Sigmoid activation\(\sigma\)to compute the attention weights. Finally, these weights are applied to the input \({x_{tensor}}\)using the Hadamard product, which allows for more targeted processing of the information. The spatial attention block \(SA( \bullet )\)is defined as follows in Eq. (13)

$$SAB({x_{tensor}})=\sigma (LKC([{C_M}({x_{tensor}}),{C_A}({x_{tensor}})])) \otimes {x_{tensor}}$$
(13)

Loss function

For our corn anomalous grain image segmentation task, we mainly use the basic cross-entropy and Dice methods as loss functions because all of our dataset masks consist of two classes that are single target and background.

$$\begin{gathered} {L_{{\text{BceDice}}}}={\lambda _1}{L_{{\text{Bce}}}}+{\lambda _2}{L_{{\text{Dice}}}} \hfill \\ {L_{Bce}}= - \frac{1}{N}\sum\limits_{N}^{1} {\left[ {{y_i}log\left( {{{\hat {y}}_i}} \right)+\left( {1 - {y_i}} \right)log\left( {1 - {{\hat {y}}_i}} \right)} \right]} \hfill \\ {L_{{\text{Dice}}}}=1 - \frac{{2|X \cap Y|}}{{|X|+|Y|}} \hfill \\ \end{gathered}$$
(14)

Where, \(({\lambda _1},{\lambda _2})\)is a constant, and (1,1) is usually selected as the default parameter.

Experiments

After studying the VMamba network operation, the image size of the dataset was adjusted to 256 × 256 pixels. In order to avoid overfitting, data enhancement methods are introduced, such as random flipping, random rotation and other operations. During training, the initialization parameters are as follows, the batch size is set to 64, the optimizer’s learning rate is set to 1e-3 and by using Cosine Annealing LR as a scheduler, the operation can be performed over a maximum of 50 iterations and the learning rate can be as low as 1e-5. We did a total of 50 epochs. For VMUnet-MSADI networks, the initial weights of the encoder cells are set to align with VMamba’s weights. The implementation is carried out on the Ubuntu 20.04 system, using NVIDIA RTX 3060, 8GB memory, Python3.8, PyTorch 2.0.1 and CUDA11.7 development environment.

Datasets

We used an open-source data set provided by GaoZhe Tech to verify the validity of our framework. A total of 11,460 data were collected in the corn kernel dataset, including 2 major categories: normal (NOR) grains 4800, abnormal (DU) grains 6660 and abnormal was further divided into 6 subcategories, namely, wilted (F&S) grains 1200, germinated (SD) grains 480, moulded (MY) grains 1600. Broken (BN) seed 1400, Pest (AY) seed 1200 and black spot (BP) seed 780. The above data set is divided into normal classes and abnormal classes in a ratio of 7:3 as training and test sets.

This data set plays a crucial role in accurately identifying the categories of corn grains. Serves as the basis for a model-driven learning process that enables it to identify patterns and characteristics that are unique to each seed class. We collected 4480 corn kernel samples, among which 1500 were high-quality seed samples and the remaining 2800 were abnormal seed samples, which were further divided into 6 different categories, as shown in Table 2. In order to verify the validity of the model, we divided the data set into three groups according to different proportions for training verification, the segmentation ratio is 7:3, 8:2 and 9:1 and observed their performance. As can be seen from the experimental analysis, we find that the 8:2 ratio gives good results, so for further research, we prefer the 8:2 ratio training weight model.

Table 2 Dataset distribution.

Implementation details

In order to train an effective VMUnet-MSADI network on the corn data set, the classification cross-entropy loss function is used to reduce the influence of data set diversity and the classification cross-entropy is paired with the activation function of the final output layer. Softmax is used to convert the original model output into a probability value and the classification cross-entropy is compared with the difference between the probability value and the true value. Finally, the exact type of corn kernel was obtained. During the training process, we used the Adam optimizer, the learning rate was set to 0.0001, the weight decay was set to 0.0005, the momentum was set to 0.9 and the batch size was set to 16. In order to train the network, 50 epochs were used to output a model. Figure 6 shows that during the training and verification process, with the increase in the number of training rounds, the accuracy of the VMUnet-MSADI model increases and the loss gradually decreases.

Fig. 6
figure 6

Graphical representation of epoch-wise accuracy and loss curve of the VMUnet-MSADI model.

The following are four indicators: Precision (Pre), also known as Specificity (Spe), Sensitivity (Sen), also known as Recall, Accuracy (Acc) and score (F1_score). This index was used to evaluate the influence of the segmentation ratio of different data sets on the training model before and after adding the preprocessing module. Table 3 shows the results of VMUnet-MSADI for data sets 7:3, 8:2 and 9:1. First, the corn data set is passed through the preprocessing module, then the image is divided into the data segmentation ratio in Table 2 above and the network is trained.

Table 3 The results of VMUnet-MSADI with and without a preprocessing block for dataset split ratios (bold indicates the best).

The experimental results show that the data segmentation ratio is 8:2, which shows a good result. Accuracy improved by 2.5% compared to no preprocessing module. Therefore, to some extent, the preprocessing module is helpful to improve the classification accuracy of the model. Figure 7 shows a graphical representation of VMUnet-MSADI performance with and without preprocessing module.

Fig. 7
figure 7

Performance of VMUnet-MSADI. (a) Without preprocessing block. (b) With preprocessing block.

Main results

In this section, we first evaluate the performance of our proposed VMUnet-MSADI framework on corn kernel image segmentation tasks using the dataset provided by Gaozhe Technology. We compare our method against state-of-the-art approaches. Additionally, we conduct ablation studies to analyze the impact of each component used in VMUnet-MSADI.

Experimental settings

Baselines

In addition to the conventional UNet, our comparative experiments involve two broad categories of methods as baselines: CNN-based methods and Transformer-based methods.

  • CNN-based methods: Some advanced CNN-based models are introduced and compared with our proposed model VMUnet-MSADI, Model includes UNet, UNet++, Att-net, UNetV2, UTNetV2, SANet, MALUNet, VMUNet, VMUNetV2 and DoubleU-Net.

  • Transformer-based methods: Several Transformer-based models are considered major contenders, including TransUNet, TransFuse, Swin-unet and MCTrans.

  • VMamba-based methods: approaches have advantages in remote interaction modeling and linear computational complexity, including VMUNet and VMUNetV2.

Implementation details

: In this study, we compare our proposed VMUnet-MSADI model with several state-of-the-art models using a dataset split of 80:20 for training and testing. To ensure a comprehensive evaluation, we include additional performance metrics beyond the traditional Precision (Pre), Specificity (Spe), Sensitivity (Sen) and Accuracy (Acc). Specifically, we also evaluate the Mean Intersection over Union (mIoU) and the Mean Dice Similarity Coefficient (mDSC). Table 4 presents the test results for different network models on the Gaofe Technology corn grain dataset. The results indicate that VMUnet-MSADI outperforms other models in terms of mIoU, DSC, Pre and Acc. Our model also exceeds the performance of the leading model, UNetV2, with an improvement of up to 5% in the mIoU metric.

Table 4 Comparative experimental results on the GaoZhe Tech Corn datasets (bold indicates the best).

Corn experimental results

Results on corn segmentation

This section evaluates the performance of VMUnet-MSADI for corn grain segmentation under various conditions. To ensure a fair comparison, we initially conducted segmentation experiments using one set of normal corn grains and six sets of abnormal corn grains. Additionally, we perform cross-validation across all datasets to verify the effectiveness of the proposed VMUnet-MSADI. The comparison results with state-of-the-art methods are presented in Table 5 and the corresponding qualitative results are shown in Fig. 8.

Based on the experimental results, we make the following observations. Compared to the conventional U-Net, various FCN and U-Net variants, such as VMUnet and VMUnetV2, have shown varying degrees of success. For instance, on the abnormal grain datasets, HD and MY, VMUnet and VMUnetV2 achieved increases in the mDSC scores of 0.8% and 1.6%, respectively. This underscores the significant value in further enhancing the stability and flexibility of standard encoder-decoder architectures. In contrast, Transformer-based models, such as Swin-Unet and TransUNet, have shown clear improvements over the aforementioned variants when guided by standard Transformer principles. Notably, TransUNet achieved mDSC scores of 0.881 and 0.897 on the AP and BN datasets, respectively, which are encouraging results. However, due to the limitations inherent in Transformers, these models still lag behind advanced CNN-based methods in performance. It is evident that the proposed VMUnet-MSADI achieves the highest scores across nearly all evaluation metrics for the independent datasets used.

Specifically, our VMUnet-MSADI attains the highest mDSC scores of 0.843 and 0.911 on the BP and FM datasets, respectively, which still represent significant improvements over previous state-of-the-art competitors such as Unet + + and Att-Unet. Moreover, Fig. 8 illustrates that the network segmentation outputs from VMUnet-MSADI are more precise and detailed compared to existing baselines. This improvement not only demonstrates the effectiveness of the proposed VMUnet-MSADI but also suggests that the introduction of advanced mechanisms like multi-scale convolution and attention module offers substantial potential to surpass traditional CNN approaches. Furthermore, as shown in Table 5, the evaluation results for VMUnet-MSADI consistently outperform previous competitors across various datasets, effectively validating its generalization capability. Compared to VMUnetV2, our VMUnet-MSADI shows an increase of 2.1% in average mDSC and 2.0% in average mIoU scores. As depicted in Fig. 8, VMUnet-MSADI generates high-quality segmentation masks for the corn kernel segmentation task. The promising ability of VMUnet-MSADI to identify and segment target regions of interest in corn kernel images underscores its advantages in automated corn segmentation. In summary, these comparative results confirm the superiority of the proposed VMUnet-MSADI in the automated segmentation of corn kernels.

Based on the experimental results, we have made several observations. Compared to the standard U-Net, various FCN and U-Net variants, such as VMUnet and VMUnetV2, have demonstrated varying degrees of success. For instance, on the anomaly datasets HD and MY, VMUnet and VMUnetV2 achieved increases in mDSC scores of 0.8% and 1.6%, respectively. This indicates the value of further enhancing the stability and adaptability of standard encoder-decoder architectures. In contrast, Transformer-based models such as Swin-Unet and TransUNet, guided by standard Transformer principles, significantly outperform the aforementioned variants. Notably, TransUNet achieved mDSC scores of 0.881 and 0.897 on the AP and BN datasets, respectively, which is encouraging.

Fig. 8
figure 8

Comparison of qualitative results between VMUnet-MSADI and the existing models on the corn kernel segmentation task. To better visualize the differences between segmentation predictions and ground truths, we highlight the key region with appropriate boxes.

However, despite their advantages, Transformer-based models still fall short of the performance of advanced CNN-based methods due to inherent limitations. Our proposed VMUnet-MSADI has achieved the highest scores across nearly all evaluation metrics for independent datasets. Specifically, VMUnet-MSADI showed the lowest mDSC scores of 0.843 and 0.911 on the BP and FM datasets, respectively, yet these scores still surpass those of previous leading competitors such as Unet + + and Att-Unet. Furthermore, Fig. 8 illustrates that VMUnet-MSADI produces more precise and detailed segmentation outputs compared to existing baselines. This improvement not only demonstrates the effectiveness of VMUnet-MSADI but also suggests that the proposed model has significant potential to surpass traditional CNN approaches. Additionally, as shown in Table 5, the evaluation results for VMUnet-MSADI consistently outperform previous competitors across various datasets, effectively demonstrating its generalization capability. Compared to VMUnetV2, VMUnet-MSADI improves the average mDSC and average mIoU scores by 2.1% and 2.0%, respectively. Figure 9 highlights that VMUnet-MSADI generates high-quality segmentation masks for the corn kernel segmentation task, showcasing its promising ability to identify and segment areas of interest in corn kernel images. In summary, these comparative results affirm the advantages of the proposed VMUnet-MSADI in the automated segmentation of corn kernels.

Table 5 Quantitative results of corn kernel segmentation task. (bold indicates the best)

To further assess the effectiveness and generalization of the proposed VMUnet-MSADI network, we evaluated the model on three types of medical image segmentation tasks. These tasks included: skin lesion segmentation on the ISIC2018 dataset, gland segmentation on the Gland Segmentation (GLAS) dataset and nucleus segmentation on the 2018 Data Science Bowl (Bowl) dataset. To ensure fairness in the evaluation, the input image sizes were standardized to 256 × 256 for ISIC2018 and Bowl datasets and to 128 × 128 for the GLAS dataset. The ISIC2018 dataset, sourced from the ISIC-2018 challenge57, is used for skin lesion analysis and comprises 2,596 images with corresponding annotations. In this section, we conducted experiments using five-fold cross-validation to demonstrate the efficacy of our VMUnet-MSADI model. The GLAS dataset, collected from the 2015 Histology Image Gland Segmentation Challenge, provides Hematoxylin and Eosin (H&E) stained slide images. It contains 165 images, with 85 used for training and 80 for testing, as detailed in27. The Bowl dataset, part of the 2018 Data Science Bowl Challenge58, is used for nucleus detection in diverging images and consists of 670 images. We followed the same setup as17, allocating 80% of the dataset for training, 10% for validation and 10% for testing. The experimental results are illustrated in Fig. 9.

Results on 2018 data science bowl

We evaluated the proposed VMUnet-MSADI network model on the nucleus segmentation task of the 2018 Data Science Bowl dataset. The comparative results with state-of-the-art methods are summarized in Table 6 and the corresponding quantitative results are shown in Fig. 9a. From Table 6, it is evident that the VMUnet-MSADI outperforms existing baselines in terms of self-attention computation and multi-scale context exploration. Specifically, our VMUnet-MSADI achieved the highest F1 score of 0.923 and a recall rate of 0.938. When compared to previous advanced methods such as TransUNet (F1 score of 0.918) and VMUNet (F1 score of 0.921), our VMUnet-MSADI shows improvements of 0.5% and 0.2%, respectively. As illustrated in Fig. 9a, the VMUnet-MSADI model provides significantly more accurate predictions of multiple nucleus boundaries compared to existing baselines. This demonstrates the strong nucleus segmentation capability of our VMUnet-MSADI model even in challenging diverging images. These experimental results further validate the generalization capability of VMUnet-MSADI across various medical image segmentation tasks.

Fig. 9
figure 9

Qualitative results of VMUnet-MSADI on three medical image segmentation tasks compared with other models. (a) 2018 Data Science Bowl dataset (b) GLAS dataset and (c) ISIC 2018 dataset, respectively. To better visualize the differences between segmentation predictions and ground truths, we highlight the key region with appropriate boxes.

Results on ISIC 2018 dataset

In order to evaluate the validity of the proposed work, we also conducted a comparative experiment on the skin lesion segmentation task on the ISIC 2018 dataset.

Table 6 Quantitative results of the 2018 Data Science Bowl. (bold indicates the best)
Table 7 Quantitative results of the ISIC 2018 dataset. (bold indicates the best)
Table 8 Quantitative results of the GLAS dataset. (bold indicates the best)

From Table 7, we observe the following, Attention-guided models such as U-Net (F1 score of 0.786) and U-NetV2 (F1 score of 0.812) demonstrate improvements over traditional U-Net by incorporating additional attention mechanisms. This suggests that attention can optimize the traditional U-Net model effectively. CNN-based methods that utilize multi-scale context to enhance the U-Net structure, such as Att-Unet (F1 score of 0.813) and DoubleU-Net (F1 score of 0.836), confirm the effectiveness of multi-scale context fusion. In contrast, Transformer-based models, including Swin-Unet (F1 score of 0.821), TransUNet (F1 score of 0.815) and TransFuse (F1 score of 0.818), show superior performance compared to the aforementioned methods. Our VMUnet-MSADI consistently outperforms Transformer-based competitors, improving the F1 score from 0.906 to 0.913. As shown in Fig. 9b, the VMUnet-MSADI effectively captures the boundaries of skin lesions and produces superior segmentation results. Thus, these comparative results further validate the strong capability of VMUnet-MSADI in skin lesion segmentation.

Results on GLAS dataset

: We also evaluated the VMUnet-MSADI on the GLAS dataset for gland segmentation, focusing on automatic quantification of gland morphology. The results of this comparison with state-of-the-art methods are shown in Table 8 and the quantitative results are depicted in Fig. 9c. From Table 8, we observe the following, Performance of VMUnet-MSADI: The VMUnet-MSADI obtain the highest mDSC and mIoU scores of 0.919 and 0.853. This demonstrates the effectiveness of VMUnet-MSADI in gland segmentation tasks. Comparison with State-of-the-Art Models: Compared to the previous state-of-the-art model, VMUNetV2, our VMUnet-MSADI exceeds VMUNetV2 by 14% in mDSC and 7% in mIoU scores. This significant improvement highlights the model’s ability to deliver high-quality segmentation even with a limited number of training samples. The VMUnet-MSADI also shows clear advantages over recent Transformer-based approaches, including TransFuse (mDSC of 0.806), TransUNet (mDSC of 0.801) and Swin-Unet (mDSC of 0.811). This further affirms the superiority of the dual-scale encoding mechanism and the TIF module in gland segmentation. In Fig. 9c, the visualizations of the generated masks illustrate how VMUnet-MSADI effectively differentiates between the glands and surrounding tissue. This performance highlights the model’s capability to produce accurate and detailed segmentation results. Overall, these results confirm that VMUnet-MSADI delivers outstanding performance in gland segmentation, outperforming both CNN-based and Transformer-based methods.

Ablation studies

In this section, to illustrate the contribution of DIB and multiscale attention modules, etc., to the segmentation of specific corn types, four high-resolution corn kernels, namely AP, BN, BP, MY, are sequentially selected for ablation experiments here. The capability of each module in the proposed work is further evaluated and the experimental results are shown in Table 2. Here, U-Net is considered as a common baseline. “U-T” denotes the U-shaped model based on the transformer encoder. “U-S” denotes a pure transformer encoder similar to the U-shaped model, and ‘U-V’ denotes an asymmetric encoder-decoder structure based on the U-shaped model, i.e., the VMamba model. “U-V + DI” represents the multilevel characterization of the VMamba model introducing detail injection. “U-V + MSA” represents the VMamba model fusing multi-scale attention mechanisms to enhance feature representation. “U-V + MSA + DI” is the complete VMUnet-MSADI architecture, which contains the proposed multiscale attention mechanism and detail infusion module.

Table 9 Quantitative results of corn kernel segmentation task. (bold indicates the best)

From Table 9, we have the following observations. When we replace the conventional encoder with a transformer-based encoder, the average mDSC and mIoU scores of the “U-T” are significantly improved by 5.3% and 4.3%, respectively. By comparing with the ordinary U-Net, it is demonstrated that the converter can well realize the encoding of contextual information. In comparison, “U-T” outperforms “U-S”, especially in the 0.7% higher average mDSC score, which confirms the advantage of the transformer in the encoder. Meanwhile, the work presented in this paper can effectively use Swin to model remote dependencies, so the effects of “U-V” and “U-V + DI” are unequal, and “U-V + DI” is 3.9% higher on the average mIoU score. By incorporating a multiscale attention mechanism, “U-V + MSA” improves the average mDSC and mIoU scores by 1.2% and 3.7%, respectively, suggesting that the use of additional coding branches can produce differentiated feature representations that improve segmentation performance. Although “U-V + DI” and “U-V + MSA” improved the segmentation accuracy of corn kernels to different degrees. However, “U-V + MSA + DI” can further improve the average mDSC score from 0.881 to 0.905, which is 2.4%, and the average mIoU score from 0.810 to 0.863, which is 5.3%. Such improvement proves beyond doubt that our proposed MSADI module can ensure consistency among different features and improve segmentation performance.

Table 10 Performance comparison of model size (Params) between VMUnet-MSADI and other leading methods on the GaoZhe dataset.

It can be seen from the above experimental results that all the designed components play an indispensable role in corn kernel image segmentation and perform well in medical image segmentation tasks. To compare model size and computational complexity, we further conducted experiments on corn kernel data sets. From Table 10, we can observe that; To expand the acceptance field, CNN-based methods usually need to stack deep enough convolutional layers, which leads to high computational costs; The self-attention mechanism requires more parameters than the convolution operation, which makes the transformer-based method larger in scale. VMUnet-MSADI can not only trade-off good complexity parameters but also obtain the best segmentation performance.

Conclusions

In this paper, we presented the multi-scale attention mechanism and detail infusion VMUnet (VMUnet-MSADI), a U-shaped encoder-decoder-based framework for improving the segmentation quality of corn images. Our VMUnet-MSADI was designed based on the VSS module. Besides the encoder, we also innovatively added a multi-scale deep convolutional attention module to the decoder, allowing deep convolution to be performed at multiple scales to improve the feature maps generated by the visual encoder by suppressing uncorrelated regions to capture multi-scale salient features. Moreover, we introduced a fused multi-scale attention mechanism within the encoder to extract features at multiple levels. We further proposed a novel DI block that leverages multi-scale attention mechanisms to evaluate spatial and channel attention scores. This module generates discriminative feature representations of both coarse and fine features across different scales, thereby ensuring semantic consistency between these features and enhancing the overall effectiveness of the encoder. Extensive experiments on corn image segmentation tasks rigorously evaluated on GaoZhe Tech’s corn dataset demonstrated that our VMUnet-MSADI significantly out-performed the previous state-of-the-art methods, achieving a segmentation accuracy of 95.96%, which is 0.9% higher than the leading method. Furthermore, the inclusion of the preprocessing module enhanced the segmentation accuracy to 96.23%, marking an additional improvement of 0.27%. These results underscore the high competitiveness of our model in the segmentation task. In future work, improvements can be made by dynamically adjusting the attention weights from the characteristics of the input data using an adaptive attention mechanism59,60. Additional focus will be placed on designing a more lightweight VMUnet-based model and enhancing the model’s ability to learn pixel-level texture structure features in order to extend the model’s generalization ability. On the other hand, there is still room for improvement in the presence of high light features in corn seeds, and there is also room to tap into the efficiency of the model by reducing the training time and memory consumption while ensuring the segmentation quality61. In addition, multimodal fast gating transformers have the potential to improve our segmentation task62.