Visual Mamba UNet fusion multi-scale attention and detail infusion for unsound corn kernels segmentation

Zhao, Kuibin; Zhang, Qinghui; Wan, Chenxia; Pan, Quan; Qin, Yao

doi:10.1038/s41598-024-80977-z

Download PDF

Article
Open access
Published: 29 March 2025

Visual Mamba UNet fusion multi-scale attention and detail infusion for unsound corn kernels segmentation

Kuibin Zhao¹^na1,
Qinghui Zhang¹^na1,
Chenxia Wan¹,
Quan Pan^1,2 &
…
Yao Qin¹

Scientific Reports volume 15, Article number: 10933 (2025) Cite this article

2840 Accesses
2 Citations
Metrics details

Subjects

Abstract

Corn seed breeding is a global issue, and has attracted great attention in recent years. Deploying autonomous robots for corn kernel recognition and classification has great potential in terms of constructing environmentally friendly agriculture, and saving manpower. Existing segmentation methods that utilize U-shaped architectures typically operate by processing images in discrete pixel-based segments. This approach often overlooks the finer pixel-level structural details within these segments, leading to models that struggle to preserve the continuity of target edges effectively. In this paper, we propose a new framework for corn seed image segmentation, called VMUnet-MSADI, which aims to integrate MSADI module into the encoder and decoder of the VMUnet architecture. Our VMUnet-MSADI model benefits from self-attention computation in VMUnet and multi-scale coding to effectively model non-local dependencies and context relationships at the scale layer, thus improving the segmentation quality of different images. Unlike previous Unet-based improvement schemes, the proposed VMUnet-MSADI adopts a multiscale convolutional attention module coding mechanism at the depth level and an efficient multiscale deep convolutional decoder at the spatial level to extract coarse-grained features and fine-grained features at different semantic scales and effectively avoid the loss of information at the target boundary to improve the quality and accuracy of target segmentation. We introduce a Visual State Space (VSS) block to capture a wide range of contextual information and a Detail Infusion Block (DIB) to enhance the fusion of low-level and high-level features, which further fills in the remote contextual information during the up-sampling process. Comprehensive experiments were conducted on open-source datasets and the results demonstrate that the VMUnet-MSADI model excels in the task of corn kernel segmentation. The model achieved a segmentation accuracy of 95.96%, surpassing the leading method by 0.9%. Compared to other segmentation models, our method exhibits superior performance in both accuracy and loss metrics. Extensive comparative experiments conducted on various benchmark datasets further substantiate that our approach outperforms the state-of-the-art models. Code, pre-trained models and data processing protocols are available at https://github.com/corbining/VMUnet-MSADI.

Dynamic atrous attention and dual branch context fusion for cross scale Building segmentation in high resolution remote sensing imagery

Article Open access 21 August 2025

High precision banana variety identification using vision transformer based feature extraction and support vector machine

Article Open access 26 March 2025

MD-Unet for tobacco leaf disease spot segmentation based on multi-scale residual dilated convolutions

Article Open access 22 January 2025

Introduction

Corn is an important crop and industrial raw material and its breeding technology has been rapidly developed, resulting in a significant increase in the number of varieties. However, the chaotic situation in the seed market is mainly due to the emergence of a large number of counterfeit and substandard seeds and thus an efficient method of corn variety identification is urgently needed to maintain the market order and safeguard the stability of agricultural production. Corn is not only one of the crops with the largest planting area and highest yield in the world, but also occupies an important position in China’s food production and its quality is directly related to national food security. Usually, corn in the harvest process using mechanized threshing methods; but in the threshing drum, pulling the stalk roller, feed churning cage and other mechanical components, corn kernels will be subjected to random extrusion, shear, kneading and other external forces, these factors may lead to damage to the corn kernels and in severe cases, even a large area of broken. During storage, corn kernels are also susceptible to a combination of environmental factors such as temperature, humidity, air and moisture, leading to problems such as germination, mold and insect damage. These abnormal kernels are detrimental to corn breeding and in the context of agricultural mechanization, especially large-scale mechanized planting, supervision and harvesting face even greater challenges. Therefore, there is an urgent need to develop a real-time, automated and accurate technique and device for identifying and detecting abnormal corn kernels in order to solve the current problems faced.

Over the past decade, convolutional neural networks (CNNs) have been widely used for various types of segmentation tasks and have made significant progress in the field of image segmentation. Especially recently, Full Convolutional Networks (FCN)¹, UNet² and their variants have performed particularly well. These network architectures utilize an encoder-decoder structure that enhances segmentation by directly combining the high-level semantic features extracted by the encoder path with the fine-grained features provided by the decoder path through jump connections. However, the limitations of the convolutional operation in the receptive domain and the inherent generalization bias of the convolutional structure make it difficult for CNNs to capture long-term dependencies and global contexts in the image, which restricts the further improvement of segmentation accuracy. In particular, when dealing with images of corn kernels, due to the variability of corn kernels in the target image, including changes in size, texture and shape, it is often difficult for these CNN-based methods to provide effective guidance for adapting to individual differences. It is worth noting that automatic segmentation of the image is crucial in the recognition process of corn seed images as it classifies the pixels for corn seed and determines the kind of anomaly of the seed by segmenting the size, color and shape of the anomalous key regions. Various U-structured networks³, especially UNet², UNet++⁴, UNet3+⁵ and nnU-Net⁶, have become the standard techniques to achieve high-quality segmentation results. Attention mechanisms⁷ have also been integrated into these models to enhance feature mapping and improve pixel-level classification. Although attention-based models have shown improved performance, they still face significant challenges due to the high computational cost of the convolutional blocks typically used in conjunction with attention mechanisms.

Recently, visual Transformers⁸ have shown remarkable potential in image segmentation tasks⁹, mainly thanks to the Self-Attention mechanism to capture remote dependencies between pixels. To further enhance image segmentation performance, hierarchical visual transformers such as Swin¹⁰, ConvFormer¹¹ and MetaFormer¹² have been introduced. However, while Self-Attention mechanisms excel in capturing global information, they are relatively weak in understanding local spatial context^13,14. To address this issue, some approaches integrate local convolutional attention mechanisms in the decoder to better capture spatial details. Despite the theoretical potential of these methods, they typically require dense prediction at the pixel level, which is computationally expensive when dealing with high-resolution images, as they often rely on costly convolutional blocks. In addition, most of these methods are fixed-scale, making it difficult to cope with the diversity of scenes. Another challenge is that these image segmentation methods usually divide the image into non-overlapping chunks and convert these chunks into vector embeddings individually at each stage. However, existing chunk segmentation methods tend to ignore the pixel-level structural information and local convex topology inside each chunk, resulting in models that cannot maintain local continuity around the chunks.

To address the above limitations, we propose VMUnet-MSADI, a novel efficient multiscale convolutional attention decoding network that incorporates a multiscale convolutional block attention mechanism and detail infusion. Specifically, VMUnet-MSADI enhances feature mapping through efficient multiscale convolution, while integrating complex spatial relations and local attention using channel attention, spatial attention and group-gated attention mechanisms. Our main contributions can be summarized as follows:

In this paper, we propose a novel efficient multiscale convolutional decoder: we introduce an efficient Visual Mamba UNet fused multi-scale attention mechanism and detail infusion decoder for anomalous corn kernel image segmentation; this takes full advantage of the multilevel nature of the VMUnet encoder, effectively fuses the advantages of the visual Mamba UNet architecture and enhances the functionality and flexibility of the traditional encoder-decoder architecture. The core idea is to combine VMUnet with a multi-scale attention mechanism and detail infusion structure to realize automatic control of corn seed image segmentation.
A well-designed coding mechanism for a multiscale deep convolutional attention module is proposed: we design a Multi-scale Convolutional Attention Module (MCAM) that performs deep convolution at multiple scales to improve the feature maps generated by the visual encoder. The MCAM captures salient features at multiple scales by suppressing the irrelevant regions and its high efficiency is attributed to the application of deep convolution.
Proposed Detail Infusion Block (DIB): evaluates the spatial and channel attention scores by computing different attention mechanisms and fully utilizes the result to generate discriminative feature representations of coarse-to-fine features at different scales of the encoder, ensuring semantic consistency between coarse and fine features.
Improved performance: extensive experiments on the corn kernel dataset provided by GaoZhe Technology show that the proposed VMUnet-MSADI consistently outperforms the existing state-of-the-art methods, which is 0.9% higher than the leading method. it can prove the effectiveness and superiority of our approach.

The rest of the paper is organized as follows. Section 2 summarizes the work related to image segmentation, Sect. 3 gives the data acquisition device as well as the process and Sect. 4 describes our proposed VMUnet-MSADI network model. Section 5 explains our experimental setup as well as the results of our benchmarks on corn seed and non-corn seed image segmentation with comprehensive experiments and visualization. Finally, Sect. 6 summarizes the full paper.

Related work

In this section, we first summarize the most typical CNN-based methods used in corn image segmentation and then we make an overview of the recent related works about the applications of visual coders in computer vision, especially in the field of segmentation.

Corn image segmentation

Image segmentation involves pixel-level classification to identify detailed structures and information within images. In the field of agriculture, significant progress has been made in applying computer vision techniques and traditional algorithms to the quality assessment of corn kernels¹⁵. The following summarizes the key research contributions, of QUAN et al.¹⁶ employed image labeling techniques to process images containing numerous scattered corn kernels. They utilized a multi-scale wavelet analysis algorithm for logical assessment of the labeled images, achieving segmentation, localization and shape recognition of corn kernels. CHENG et al.¹⁷ focused on the embryonic features of corn kernels, using color space models to exploit differences in image components across different channels for segmentation. They applied an improved morphological opening and closing operation to refine the segmented images, ultimately identifying the characteristic regions of the seed’s endosperm. SHI et al.¹⁸ analyzed collected corn images using genetic algorithms and SPSS techniques. They assessed the ratio of white to yellow areas of damaged kernels in the HSI color space, using this information to identify corn variety attributes. K. Ding et al.¹⁹ distinguished between whole and broken corn kernels by employing parameters such as the continuous symmetric index, curvature index and radius index. Their approach effectively excluded broken kernels from the overall dataset. Ni B et al.²⁰ implemented a grading system to detect whole versus broken kernels and examined the horizontal and vertical histogram distributions of corn kernels to analyze their concave and convex characteristics. Ng H. F²¹ utilized staining techniques on the damaged surfaces of broken corn seeds and applied single-grain and batch analysis methods to detect internal mechanical stress cracks. A three-layer neural network was employed to effectively identify moldy corn seeds. I. Zayas et al.²² analyzed twelve morphological parameters and selected seven significant parameters to establish a Mahalanobis distance discriminant function for distinguishing between whole and broken kernels. These studies demonstrate the application of various image processing and analysis techniques in corn kernel quality assessment, providing valuable references for research in this domain.

Recently, deep neural networks have demonstrated substantial capabilities and notable advantages in feature extraction for target object detection²³. The application of deep learning techniques in object classification not only enhances adaptability to detected objects but also significantly improves the accuracy of detection results, with pronounced benefits observed in grain classification tasks²⁴. The following provides an overview of relevant research, LIN et al.²⁵ designed a multi-convolutional block feature detection network model for rice classification, achieving efficient identification of rice varieties. LIU et al.²⁶ utilized improved YOLO series algorithms to address issues of grain breakage and counting with precise identification and detection capabilities. KHAKI et al.²⁷ applied convolutional neural network-based techniques to achieve high-precision identification of damaged corn kernels under varying light conditions. ZHAO et al.²⁸ developed a deep learning-based model for identifying and screening high-quality soybean seeds, which improved seed screening efficiency. JIN et al.²⁹ proposed a soybean quality detection algorithm based on an enhanced U-Net model, increasing the accuracy of soybean quality assessment. TU et al.³⁰ employed a VGG16 model with transfer learning to classify the JINGKE-968 variety, achieving a test accuracy of up to 98%. LV et al.³¹ trained an improved ResNet model, achieving a maximum recognition accuracy of 96.4%. These studies indicate that deep learning technologies, particularly advanced network models and transfer learning strategies, play a crucial role in enhancing the accuracy and efficiency of grain classification and detection.

Vision encoders

The CNN is at the heart of the base encoder with its strength in handling spatial relationships in images. AlexNet²³ and VGG³² laid a solid foundation for the development of CNN through layer-by-layer deep convolutional feature extraction. GoogleNet³³ introduced the Inception module, which allows for a more efficient computation of features at multiple scales. ResNet³⁴ solved the problem of gradient vanishing through residual splicing solves the gradient vanishing problem and makes it possible to train deeper networks. R2U-Net³⁵ further achieves better feature representation by fusing residual networks and U-Net. PraNet³ accomplishes the computation of detailed features for segmented targets by combining the Parallel Partial Decoder and the Inverse Attention module. KiU-Net³⁶ utilizes the under- and over-complete features of a new architecture to improve the segmentation of spatial structural features of the target. DoubleU-Net³⁷ refines the baseline for image segmentation using two U-Net sequences and employs spatial pyramid pooling. FANet³⁸ improves the unification of the previous epoch mask with the current epoch feature map during training. MobileNets³⁹ improves the computation of detailed features of the segmented target by means of a lightweight of deeply separable convolution successfully extends CNN to mobile devices, and EfficientNet⁴⁰ further improves the performance of CNN through a scalable composite extension architecture design. Although CNN performs well in many visual tasks, it is often difficult to capture remote dependencies in images due to the limitation of its local receptive domain. Although CNNs perform well in numerous visual tasks, it is often difficult to capture remote dependencies in images due to the limitation of their local receptive domain.

Recently, Vision Transformers (ViTs), first proposed by Dosovitskiy et al. are able to efficiently capture remote dependencies between pixels through the Self-Attention (SA) mechanism. With the continuous development of ViTs, these models have achieved significant performance improvements by combining CNN features^41,42, improving Self-Attention blocks and introducing innovative architectural designs^43,44. For example, Swin Transformer¹⁰ introduces a sliding window attention mechanism, SegFormer⁴⁴ implements a hierarchical structure through Mix-FFN blocks, PVT⁴³ employs a spatially-approximate attention mechanism and PVTv2⁷ is optimized with overlapping patch embedding and linear complexity attention layers. maxViT⁴¹ introduced multi-axis self-attention in the encoder to form a hierarchical CNN-transformer structure. These advances further advance the application and performance of visual transformers in various vision tasks.

Although visual transformers (ViTs) perform well in capturing remote pixel relationships, they still face certain challenges in capturing local spatial relationships between pixels. To address this problem, this paper proposes a multiscale attention mechanism decoder based on visual Mamba UNet. The decoder utilizes a multiscale convolutional attention module to refine the feature mapping and effectively incorporates a local DIB, which enhances the ability to model local spatial relationships. The accurate capture of local details in image segmentation tasks is further enhanced.

In summary, while deep learning technologies have been extensively applied to the evaluation of crop quality, research specifically focusing on the classification of anomalous corn kernels remains relatively limited. Previous studies have addressed the problem in low-density and relatively ideal conditions, but there remains significant room for improvement. Our research team proposes an innovative approach that utilizes the Visual Mamba UNet network, incorporating a multi-scale attention mechanism to enhance the capture of features across different scales and achieve fine-grained segmentation of corn kernels. We introduce the VSS module to capture contextual information, thereby improving adaptability to complex backgrounds and environmental variations. Additionally, we integrate the DI Block to enhance the fusion of low-level and high-level features, thereby improving the detail and accuracy of feature representation. Through these novel techniques, we aim to achieve more accurate and effective recognition of corn kernels, providing valuable insights and new directions for research in corn kernel classification.

Materials

Sample preparation

In this study, the data set used to train the network model was mainly from Anhui GaoZhe Technology Co., LTD⁴⁵. There are three kinds of data collection devices: P600, G600 and M600. The P600 consists of two industrial cameras with a light source and a conveyor belt that feeds the grain automatically. The G600 consists of an industrial camera with a light source and a conveyor belt; The M600 consists of a vision terminal and a fixed bracket. Among them, the experimental data of this project are collected by P600 and G600, which are composed of industrial cameras, grain placement platforms and lighting sources respectively. The corn grains in the data set were from five countries and regions, including the United States (USA), Canada (CAN), Australia (AU), Cambodia (KHM) and China (CHN). Corn seeds can be divided into two categories, NORMAL (NOR) seeds and abnormal (Damaged and Unsound) seeds and abnormal seeds are divided into six categories, respectively, FUSARIUM & SHRIVELLED (F&S) seeds, SPROUTED (SPROUTED) seeds and damaged and unsound (DU) seeds. SD) seeds, MOULDY (MY) seeds, BROKEN (BN) seeds, ATTACKED by PESTS (AP) seeds and BLACK POINT (BP) seeds, as shown in Fig. 1.

The light blue part is the normal corn grain map, including round and long grains, while the orange part shows the six abnormal grain maps. The dataset has a total of 11,460 image and the training model uses this to randomly divide the training set, verification set and test set with a ratio of about 10:1:1 to ensure their independence. The detailed distribution of selected corn kernel image data is shown in Table 1.

Table 1 Corn kernel dataset.

Full size table

Images acquisition system

The acquisition of the actual test data in this study mainly consists of the test bench of the detection camera (HIKVISION, MV-CA050-10 C), the fill light strip, the image acquisition card and the computer. To ensure real-time and continuous data acquisition, we designed the bottom of the acquisition device with an adjustable Angle single-layer guide structure. It is covered with white matte stickers to reduce the effects of direct light and corn grain rolling. In the test process, corn grains start from the feeding hopper, enter the single-layer guide structure through the trough wheel and the transition plate and reach the uniform distribution plate. There is an adjustable Angle α between the plane and the horizontal plane of the uniform distribution plate and the bottom of the uniform distribution plate is equipped with a vibrator to assist in the flattening of corn grains. After passing through the uniform distribution plate, the corn grains achieve single-layer discretization and finally pass through the detection area, image acquisition is carried out by an industrial camera equipped with a 6 mm focal length adjustable lens, as shown in Fig. 2.

At each scan for data collection, a set of 64 grains of corn from each category was removed and spread out in a tile, organized into 8 rows by 8 columns. To ensure that no two seeds come into contact, all the seeds that are placed are placed about 2 × 2 cm in the center of the platform. 10 groups were formed for each test cycle for a test cycle with 7 types of tests and 4480 grains were processed in each test cycle.

Data preprocessing

The data collected in the experiment took a picture containing multiple corn grains in a photo, but it was a single corn grain that was finally identified as abnormal. Therefore, the image needs to be processed. The main steps include binary image processing of multiple grains and then contour detection, calculating the maximum external polygon base on the detected contour pixels and extracting the outline of a single corn grain according to the maximum and minimum value of the external polygon. Finally, the selected corn grains were cut to get the single corn image. The image processing process is shown in Fig. 3.

In the study, we used a photograph of multiple corn grains as data, and we needed to process these images in order to identify abnormal kernels in individual corn grains. The main steps of processing are as follows: First, the original image is filtered and enhanced, CIELAB channel separation is carried out. The analysis shows that the B-channel image has high contrast, which is conducive to image segmentation. Gamma transformation is performed on the B-channel image, where λ=0.16, binarization processing is carried out to obtain the binarization image of a single grain. Then, the contour detection was carried out, and the maximum external polygon of each corn kernel was calculated by counting contour pixels. Then, the outer rectangle of each corn kernel is extracted according to the maximum and minimum values of these outer polygons. Finally, each corn grain is cut according to the external rectangle, and the image of a single corn is obtained. Figure 3 shows the image after the combination of 8 × 8 corn grains. The whole process of image processing is shown in Fig. 3. After the above processing, the isolated single corn kernel image was input into the VMUnet-MSADI network model as the actual test data for the classification and recognition of corn grains.

Data flow

The core framework of the VMUnet-MSADI model is proposed in this study. The data is preprocessed before passing through the network, including image filtering and denoising, scale transformation, color correction, normalization, single grain cutting, etc. Then feed into the network model, through the encoder stage, single image feature extraction; Then we enter the visual state space module (VSS) to capture the context information in the wide area. Then we enter the Detail Infusion Block to enhance the low-level feature detail information and the high-level feature detail information infusion for the network model. Before it enters the main 2D Selection Scan (SS2D) module, the VSS module results are output through hierarchical feature mapping combined with outputs from other layer information streams. Finally, the final recognition result is output through the decoder of the VSS module, as shown in Fig. 4.

Methods

Preliminaries

SSM-based VMUnet-MSADI networks, including the structured state space sequence model and Mamba model, rely on a classical continuous system, represent as $x(t) \in \mathcal{R}$, which maps one-dimensional input functions or sequences to output $y(t) \in \mathcal{R}$via an intermediate implicit state $h(t) \in {\mathcal{R}^N}$. This process can be expressed as a Linear Ordinary Differential Equation (ODE):

$$\begin{gathered} h^{\prime}(t)=Ah(t)+Bx(t) \hfill \\ y(t)=Ch(t) \hfill \\ \end{gathered}$$

(1)

Where, $A \in {\mathcal{R}^{N \times N}}$represents the state matrix, $B \in {\mathcal{R}^{N \times 1}}$and$C \in {\mathcal{R}^{N \times 1}}$ is the projection vector. Structured state space sequences and Mamba discretize this continuous system to make it more suitable for deep learning scenarios. Specifically, they introduce a time scale parameter $\delta$ and convert A and B into discrete parameters $\bar {A}$and $\bar {B}$ using fixed discretization rules. A zero-order hold (ZOH) is usually used as a discrete rule, which can be defined as:

$$\begin{gathered} \bar {A}=\exp (\delta A) \hfill \\ \bar {B}={(\delta A)^{ - 1}}(\exp (\delta A) - I) \cdot \delta B \hfill \\ \end{gathered}$$

(2)

After discretization, SSM-based models can be computed either by linear recursion or global convolution, defined as Eqs. 3 and 4, respectively.

$$\begin{gathered} h^{\prime}(t)=\bar {A}h(t)+\bar {B}x(t) \hfill \\ y(t)=Ch(t) \hfill \\ \end{gathered}$$

(3)

$$\begin{gathered} \bar {K}=(C\bar {B},C\bar {A}\bar {B}, \ldots ,C{{\bar {A}}^{L - 1}}\bar {B}) \hfill \\ y=x * K \hfill \\ \end{gathered}$$

(4)

Where, $\bar {K} \in {\mathcal{R}^L}$represents a structured convolution kernel and L represents the length of the input sequencex.

VMUnet-MSADI architecture

The overall structure of VMUnet-MSADI consists of three main modules: Encoder, DIB (Detail Infusion Block), and Decoder. Given the input image I, where $I \in {R^{H \times W \times 3}}$, the encoder generates M-levels of features. We represent the characteristics of layer i-th as $f_{i}^{o}$, where $1 \leqslant i \leqslant M$. These accumulated features as $\{ f_{1}^{o},f_{2}^{o},…,f_{M}^{o}\}$are then forwarded to the DIB for further enhancement. After entering the DIB, the encoder output channel of ${f_i}$is ${2^i} \times C$, these accumulated features $\{ f_{1}^{o},f_{2}^{o},…,f_{M}^{o}\}$enters the DIB for feature fusion, ${f_i}$corresponds to $f_{i}^{\prime }$as the output of i-th stage. $\frac{H}{{{2^{i+1}}}} \times \frac{W}{{{2^{i+1}}}} \times {2^i}C$is characterized by $f_{i}^{\prime }$. In our model, we use deep supervision to calculate the loss of $f_{i}^{\prime }$and$f_{{i - 1}}^{\prime }$ features.

In this article, we use the $[{N_1},{N_2},{N_3},{N_4}]$VSS block on the four stages of the encoder, with a channel count of [C,2C,4C,8C] for each stage. According to our observations in VMamba, the different values of ${N_3}$andc are important factors in differentiating the Tiny, Small and Base framework specifications. According to the VMamba specification, we let c take the value of 96, ${N_1}$and${N_2}$ take the value of 2, respectively, and ${N_3}$take the values in the set ⁴⁶. This represents our intention to use tiny and small models of VMamba as the backbone of our ablation experiments.

VSS module

VSS Block is the backbone of VMUnet-MSADI, and SS2D is the core of VSS Block. The Detail Infusion Block consists of the trunk output feature, a multi-scale attention module, and the DIB output feature is the same size as the trunk output feature.

The VSS block, derived from VMamba, is the backbone of the VM UNetV2 encoder and its structure is shown in Fig. 5a. The input is first processed through an initial linear embedding layer, which is then split into two separate information streams. A stream passes through the 3 × 3 deep convolution layer and the subsequent Silu activation function before entering the main SS2D module. The SS2D output is then processed by the layer normalization layer and combined with the output of other information flows, which are also processed by Silu activation. The combined output forms the final result of the VSS block.

DIB, as shown in Fig. 5b. For hierarchical feature mapping $f_{i}^{o}=\frac{H}{{{2^{i+1}}}} \times \frac{W}{{{2^{i+1}}}} \times {2^i}C$, the encoder generates $1 \leqslant i \leqslant 4$where i represents layer i-th.

Different attention mechanisms can be used in the DIB to calculate the attention score of space and channel. According to what is mentioned in UNetV2, we use CBAM to achieve temporal and spatial attention. The calculation formula is as follows, $\phi _{i}^{{att}}$represents i-th time attention calculation:

$$f_{i}^{1}=\phi _{i}^{{att}}\left( {f_{i}^{0}} \right)$$

(5)

We use 1 × 1 convolution to align the $f_{i}^{1}$channels to c, and the resulting feature map is denoted as $f_{i}^{2} \in {R^{{H_i} \times {W_i} \times C}}$.

DIB decoder stage i-th, $f_{i}^{2}$represents the target reference. Then, we adjust the size of the feature map at each j-th layer so that it matches the size of $f_{i}^{2}$, as follows:

$$f_{{ij}}^{3}=\left\{ {\begin{array}{*{20}{c}} {{{\text{G}}_{\text{d}}}\left( {f_{j}^{2},({H_i},{W_i})} \right)}&{if}&{j<i} \\ {{{\text{G}}_1}\left( {f_{j}^{2}} \right)}&{if}&{j=i} \\ {{{\text{G}}_u}\left( {f_{j}^{2},({H_i},{W_i})} \right)}&{if}&{j>i} \end{array}} \right.$$

(6)

In Formula 6, Gd, Gi and Gu represent adaptive averaging pooling, identity mapping and bilinear interpolation respectively. In Eq. 7, ${\theta _{ij}}$is the parameter of smooth convolution, and $f_{{ij}}^{4}$is the j-th smooth feature map at i-th level. Here, $H()$represents the Hadamard product. The $f_{i}^{5}$is then forwarded to the decoder at layer i-th for further resolution reconstruction and segmentation.

$$\begin{gathered} f_{{ij}}^{4}={\theta _{ij}}\left( {f_{{ij}}^{3}} \right) \hfill \\ f_{i}^{5}=H\left( {\left[ {f_{{i1}}^{4},f_{{i2}}^{4},f_{{i3}}^{4},f_{{i4}}^{4}} \right]} \right) \hfill \\ \end{gathered}$$

(7)

Multi-scale attention module (MSAM)

We introduced an efficient multi-scale convolutional attention module to refine feature maps. $MSAM$ sequentially incorporates a channel attention block $CA( \bullet )$that emphasizes channel correlations, a spatial attention block $SA( \bullet )$designed to capture local contextual information, and an efficient multi-scale convolution block $MSC( \bullet )$that enhances the retention of contextual relationships in feature maps. The definition of $MSAM( \bullet )$is defined by Eq. (8):

$$MSAM({x_{tensor}})=MSC(SA(CA({x_{tensor}})))$$

(8)

Here, ${x_{tensor}}$represents the input tensor. Due to the use of deep convolution across multiple scales, our $MSAM$module is more efficient compared to the convolutional attention module with a significant reduction in computational cost.

Multi-scale convolution block (MSCB)

We introduce an efficient Multi-scale Convolution Block to enhance the features generated by our cascade expansion path. In our MSCB, we follow the design principles of the Inverted Residual Block (IRB) from MobileNetV2. However, unlike IRB, our MSCB performs deep convolution across multiple scales and utilizes channel shuffling to shuffle channels across groups. Specifically, in our MSCB, we first use a pointwise (1 × 1) convolution layer $PWC1( \bullet )$to increase the number of channels (i.e., expansion factor = 2), followed by batch normalization $BN( \bullet )$and activation$ReLU6( \bullet )$. Next, we apply multi-scale depth-wise convolution $MSDC( \bullet )$to capture context information at various scales and resolutions. Since depth-wise convolution ignores inter-channel relationships, we implement channel shuffling to facilitate cross-channel mixing. Subsequently, another pointwise convolution layer $PWC2( \bullet )$is used, followed by $BN( \bullet )$to revert to the original number of channels, thereby capturing channel dependencies. The definition of $MSCB( \bullet )$is provided by Eq. (9)

$$\begin{gathered} MSDC({x_{tensor}})=ReLU6(BN(PWC1({x_{tensor}}))) \hfill \\ MSCB({x_{tensor}})=BN(PWC2(CS(MSDC))) \hfill \\ \end{gathered}$$

(9)

Where, the parallel operations with different kernel-sizes (KS) are defined as follows in Eq. (10):

$$MSDC({x_{tensor}})=\sum\limits_{{k \in KS}} {DWC{B_k}({x_{tensor}})}$$

(10)

$DWC{B_k}({x_{tensor}})=ReLU(BN(DW{C_k}({x_{tensor}})))$is defined as follows. $DW{C_k}( \bullet )$ refers to the depth-wise convolution with kernel size k. $BN( \bullet )$ and $ReLU6( \bullet )$ denote batch normalization and $ReLU$activation, respectively. Additionally, our sequence $MSDC( \bullet )$utilizes the recursively updated input${x_{tensor}}$, where ${x_{tensor}}$maintains a residual connection with the previous $DWC{B_k}( \bullet )$to enhance regularization, as defined in Eq. (11):

$${x_{tensor}}={x_{tensor}}+DWC{B_k}({x_{tensor}})$$

(11)

Channel attention block (CAB)

We employ the Channel Attention Block to assign different levels of importance to each channel, thereby emphasizing more relevant features while suppressing less useful ones. Essentially, the$CA( \bullet )$ block determines which feature maps to focus on (and subsequently refines them). We first apply maximum pooling ${P_M}( \bullet )$and average pooling ${P_A}( \bullet )$across the spatial dimensions (i.e., height and width) to extract the most prominent features from the entire feature map for each channel. Then, for each pooled feature map, we use pointwise convolution $PWC1( \bullet )$to reduce the number of channels, followed by $ReLU$activation. Next, another pointwise convolution $PWC2( \bullet )$is applied to restore the original channel dimensions. We then add the two restored feature maps together and apply a Sigmoid activation$\sigma$to estimate the attention weights as adjustment factors. Finally, these weights are integrated into the input ${x_{tensor}}$using Hadamard product. The channel attention block $CA( \bullet )$is defined as follows in Eq. (12)

$$\begin{gathered} CAB({x_{tensor}})=\sigma (CAB1({x_{tensor}})+CAB2({x_{tensor}})) \otimes {x_{tensor}} \hfill \\ CAB1({x_{tensor}})=PWC2(\operatorname{Re} LU6(PWC1({P_M}({x_{tensor}}))) \hfill \\ CAB2({x_{tensor}})=PWC2(\operatorname{Re} LU6(PWC1({P_A}({x_{tensor}}))) \hfill \\ \end{gathered}$$

(12)

Spatial attention block (SAB)

We employ the Spatial Attention Block to simulate the attention mechanism of the human brain, focusing on specific regions of the input image. Essentially, the $SA( \bullet )$module determines the focal points within the feature map. By enhancing these focal regions, the model’s ability to recognize and respond to relevant spatial features is improved, which is crucial for image segmentation as the background and positioning of objects significantly influence the output. In$SA( \bullet )$, we first aggregate the maximum ${C_M}( \bullet )$and average ${C_A}( \bullet )$values along the channel dimension to focus on local features. Next, we use a large kernel convolution (such as a 7 × 7 kernel) to enhance the local contextual relationships between features. We then apply Sigmoid activation$\sigma$to compute the attention weights. Finally, these weights are applied to the input ${x_{tensor}}$using the Hadamard product, which allows for more targeted processing of the information. The spatial attention block $SA( \bullet )$is defined as follows in Eq. (13)

$$SAB({x_{tensor}})=\sigma (LKC([{C_M}({x_{tensor}}),{C_A}({x_{tensor}})])) \otimes {x_{tensor}}$$

(13)

Loss function

For our corn anomalous grain image segmentation task, we mainly use the basic cross-entropy and Dice methods as loss functions because all of our dataset masks consist of two classes that are single target and background.

$$\begin{gathered} {L_{{\text{BceDice}}}}={\lambda _1}{L_{{\text{Bce}}}}+{\lambda _2}{L_{{\text{Dice}}}} \hfill \\ {L_{Bce}}= - \frac{1}{N}\sum\limits_{N}^{1} {\left[ {{y_i}log\left( {{{\hat {y}}_i}} \right)+\left( {1 - {y_i}} \right)log\left( {1 - {{\hat {y}}_i}} \right)} \right]} \hfill \\ {L_{{\text{Dice}}}}=1 - \frac{{2|X \cap Y|}}{{|X|+|Y|}} \hfill \\ \end{gathered}$$

(14)

Where, $({\lambda _1},{\lambda _2})$is a constant, and (1,1) is usually selected as the default parameter.

Experiments

After studying the VMamba network operation, the image size of the dataset was adjusted to 256 × 256 pixels. In order to avoid overfitting, data enhancement methods are introduced, such as random flipping, random rotation and other operations. During training, the initialization parameters are as follows, the batch size is set to 64, the optimizer’s learning rate is set to 1e-3 and by using Cosine Annealing LR as a scheduler, the operation can be performed over a maximum of 50 iterations and the learning rate can be as low as 1e-5. We did a total of 50 epochs. For VMUnet-MSADI networks, the initial weights of the encoder cells are set to align with VMamba’s weights. The implementation is carried out on the Ubuntu 20.04 system, using NVIDIA RTX 3060, 8GB memory, Python3.8, PyTorch 2.0.1 and CUDA11.7 development environment.

Datasets

We used an open-source data set provided by GaoZhe Tech to verify the validity of our framework. A total of 11,460 data were collected in the corn kernel dataset, including 2 major categories: normal (NOR) grains 4800, abnormal (DU) grains 6660 and abnormal was further divided into 6 subcategories, namely, wilted (F&S) grains 1200, germinated (SD) grains 480, moulded (MY) grains 1600. Broken (BN) seed 1400, Pest (AY) seed 1200 and black spot (BP) seed 780. The above data set is divided into normal classes and abnormal classes in a ratio of 7:3 as training and test sets.

This data set plays a crucial role in accurately identifying the categories of corn grains. Serves as the basis for a model-driven learning process that enables it to identify patterns and characteristics that are unique to each seed class. We collected 4480 corn kernel samples, among which 1500 were high-quality seed samples and the remaining 2800 were abnormal seed samples, which were further divided into 6 different categories, as shown in Table 2. In order to verify the validity of the model, we divided the data set into three groups according to different proportions for training verification, the segmentation ratio is 7:3, 8:2 and 9:1 and observed their performance. As can be seen from the experimental analysis, we find that the 8:2 ratio gives good results, so for further research, we prefer the 8:2 ratio training weight model.

Table 2 Dataset distribution.

Full size table

Implementation details

In order to train an effective VMUnet-MSADI network on the corn data set, the classification cross-entropy loss function is used to reduce the influence of data set diversity and the classification cross-entropy is paired with the activation function of the final output layer. Softmax is used to convert the original model output into a probability value and the classification cross-entropy is compared with the difference between the probability value and the true value. Finally, the exact type of corn kernel was obtained. During the training process, we used the Adam optimizer, the learning rate was set to 0.0001, the weight decay was set to 0.0005, the momentum was set to 0.9 and the batch size was set to 16. In order to train the network, 50 epochs were used to output a model. Figure 6 shows that during the training and verification process, with the increase in the number of training rounds, the accuracy of the VMUnet-MSADI model increases and the loss gradually decreases.

The following are four indicators: Precision (Pre), also known as Specificity (Spe), Sensitivity (Sen), also known as Recall, Accuracy (Acc) and score (F1_score). This index was used to evaluate the influence of the segmentation ratio of different data sets on the training model before and after adding the preprocessing module. Table 3 shows the results of VMUnet-MSADI for data sets 7:3, 8:2 and 9:1. First, the corn data set is passed through the preprocessing module, then the image is divided into the data segmentation ratio in Table 2 above and the network is trained.

Table 3 The results of VMUnet-MSADI with and without a preprocessing block for dataset split ratios (bold indicates the best).

Full size table

The experimental results show that the data segmentation ratio is 8:2, which shows a good result. Accuracy improved by 2.5% compared to no preprocessing module. Therefore, to some extent, the preprocessing module is helpful to improve the classification accuracy of the model. Figure 7 shows a graphical representation of VMUnet-MSADI performance with and without preprocessing module.

Main results

In this section, we first evaluate the performance of our proposed VMUnet-MSADI framework on corn kernel image segmentation tasks using the dataset provided by Gaozhe Technology. We compare our method against state-of-the-art approaches. Additionally, we conduct ablation studies to analyze the impact of each component used in VMUnet-MSADI.

Experimental settings

Baselines

In addition to the conventional UNet, our comparative experiments involve two broad categories of methods as baselines: CNN-based methods and Transformer-based methods.

CNN-based methods: Some advanced CNN-based models are introduced and compared with our proposed model VMUnet-MSADI, Model includes UNet, UNet++, Att-net, UNetV2, UTNetV2, SANet, MALUNet, VMUNet, VMUNetV2 and DoubleU-Net.
Transformer-based methods: Several Transformer-based models are considered major contenders, including TransUNet, TransFuse, Swin-unet and MCTrans.
VMamba-based methods: approaches have advantages in remote interaction modeling and linear computational complexity, including VMUNet and VMUNetV2.

Implementation details

: In this study, we compare our proposed VMUnet-MSADI model with several state-of-the-art models using a dataset split of 80:20 for training and testing. To ensure a comprehensive evaluation, we include additional performance metrics beyond the traditional Precision (Pre), Specificity (Spe), Sensitivity (Sen) and Accuracy (Acc). Specifically, we also evaluate the Mean Intersection over Union (mIoU) and the Mean Dice Similarity Coefficient (mDSC). Table 4 presents the test results for different network models on the Gaofe Technology corn grain dataset. The results indicate that VMUnet-MSADI outperforms other models in terms of mIoU, DSC, Pre and Acc. Our model also exceeds the performance of the leading model, UNetV2, with an improvement of up to 5% in the mIoU metric.

Table 4 Comparative experimental results on the GaoZhe Tech Corn datasets (bold indicates the best).

Full size table

Corn experimental results

Results on corn segmentation

This section evaluates the performance of VMUnet-MSADI for corn grain segmentation under various conditions. To ensure a fair comparison, we initially conducted segmentation experiments using one set of normal corn grains and six sets of abnormal corn grains. Additionally, we perform cross-validation across all datasets to verify the effectiveness of the proposed VMUnet-MSADI. The comparison results with state-of-the-art methods are presented in Table 5 and the corresponding qualitative results are shown in Fig. 8.

Based on the experimental results, we make the following observations. Compared to the conventional U-Net, various FCN and U-Net variants, such as VMUnet and VMUnetV2, have shown varying degrees of success. For instance, on the abnormal grain datasets, HD and MY, VMUnet and VMUnetV2 achieved increases in the mDSC scores of 0.8% and 1.6%, respectively. This underscores the significant value in further enhancing the stability and flexibility of standard encoder-decoder architectures. In contrast, Transformer-based models, such as Swin-Unet and TransUNet, have shown clear improvements over the aforementioned variants when guided by standard Transformer principles. Notably, TransUNet achieved mDSC scores of 0.881 and 0.897 on the AP and BN datasets, respectively, which are encouraging results. However, due to the limitations inherent in Transformers, these models still lag behind advanced CNN-based methods in performance. It is evident that the proposed VMUnet-MSADI achieves the highest scores across nearly all evaluation metrics for the independent datasets used.

Specifically, our VMUnet-MSADI attains the highest mDSC scores of 0.843 and 0.911 on the BP and FM datasets, respectively, which still represent significant improvements over previous state-of-the-art competitors such as Unet + + and Att-Unet. Moreover, Fig. 8 illustrates that the network segmentation outputs from VMUnet-MSADI are more precise and detailed compared to existing baselines. This improvement not only demonstrates the effectiveness of the proposed VMUnet-MSADI but also suggests that the introduction of advanced mechanisms like multi-scale convolution and attention module offers substantial potential to surpass traditional CNN approaches. Furthermore, as shown in Table 5, the evaluation results for VMUnet-MSADI consistently outperform previous competitors across various datasets, effectively validating its generalization capability. Compared to VMUnetV2, our VMUnet-MSADI shows an increase of 2.1% in average mDSC and 2.0% in average mIoU scores. As depicted in Fig. 8, VMUnet-MSADI generates high-quality segmentation masks for the corn kernel segmentation task. The promising ability of VMUnet-MSADI to identify and segment target regions of interest in corn kernel images underscores its advantages in automated corn segmentation. In summary, these comparative results confirm the superiority of the proposed VMUnet-MSADI in the automated segmentation of corn kernels.

Based on the experimental results, we have made several observations. Compared to the standard U-Net, various FCN and U-Net variants, such as VMUnet and VMUnetV2, have demonstrated varying degrees of success. For instance, on the anomaly datasets HD and MY, VMUnet and VMUnetV2 achieved increases in mDSC scores of 0.8% and 1.6%, respectively. This indicates the value of further enhancing the stability and adaptability of standard encoder-decoder architectures. In contrast, Transformer-based models such as Swin-Unet and TransUNet, guided by standard Transformer principles, significantly outperform the aforementioned variants. Notably, TransUNet achieved mDSC scores of 0.881 and 0.897 on the AP and BN datasets, respectively, which is encouraging.

However, despite their advantages, Transformer-based models still fall short of the performance of advanced CNN-based methods due to inherent limitations. Our proposed VMUnet-MSADI has achieved the highest scores across nearly all evaluation metrics for independent datasets. Specifically, VMUnet-MSADI showed the lowest mDSC scores of 0.843 and 0.911 on the BP and FM datasets, respectively, yet these scores still surpass those of previous leading competitors such as Unet + + and Att-Unet. Furthermore, Fig. 8 illustrates that VMUnet-MSADI produces more precise and detailed segmentation outputs compared to existing baselines. This improvement not only demonstrates the effectiveness of VMUnet-MSADI but also suggests that the proposed model has significant potential to surpass traditional CNN approaches. Additionally, as shown in Table 5, the evaluation results for VMUnet-MSADI consistently outperform previous competitors across various datasets, effectively demonstrating its generalization capability. Compared to VMUnetV2, VMUnet-MSADI improves the average mDSC and average mIoU scores by 2.1% and 2.0%, respectively. Figure 9 highlights that VMUnet-MSADI generates high-quality segmentation masks for the corn kernel segmentation task, showcasing its promising ability to identify and segment areas of interest in corn kernel images. In summary, these comparative results affirm the advantages of the proposed VMUnet-MSADI in the automated segmentation of corn kernels.

Table 5 Quantitative results of corn kernel segmentation task. (bold indicates the best)

Full size table

To further assess the effectiveness and generalization of the proposed VMUnet-MSADI network, we evaluated the model on three types of medical image segmentation tasks. These tasks included: skin lesion segmentation on the ISIC2018 dataset, gland segmentation on the Gland Segmentation (GLAS) dataset and nucleus segmentation on the 2018 Data Science Bowl (Bowl) dataset. To ensure fairness in the evaluation, the input image sizes were standardized to 256 × 256 for ISIC2018 and Bowl datasets and to 128 × 128 for the GLAS dataset. The ISIC2018 dataset, sourced from the ISIC-2018 challenge⁵⁷, is used for skin lesion analysis and comprises 2,596 images with corresponding annotations. In this section, we conducted experiments using five-fold cross-validation to demonstrate the efficacy of our VMUnet-MSADI model. The GLAS dataset, collected from the 2015 Histology Image Gland Segmentation Challenge, provides Hematoxylin and Eosin (H&E) stained slide images. It contains 165 images, with 85 used for training and 80 for testing, as detailed in²⁷. The Bowl dataset, part of the 2018 Data Science Bowl Challenge⁵⁸, is used for nucleus detection in diverging images and consists of 670 images. We followed the same setup as¹⁷, allocating 80% of the dataset for training, 10% for validation and 10% for testing. The experimental results are illustrated in Fig. 9.

Results on 2018 data science bowl

We evaluated the proposed VMUnet-MSADI network model on the nucleus segmentation task of the 2018 Data Science Bowl dataset. The comparative results with state-of-the-art methods are summarized in Table 6 and the corresponding quantitative results are shown in Fig. 9a. From Table 6, it is evident that the VMUnet-MSADI outperforms existing baselines in terms of self-attention computation and multi-scale context exploration. Specifically, our VMUnet-MSADI achieved the highest F1 score of 0.923 and a recall rate of 0.938. When compared to previous advanced methods such as TransUNet (F1 score of 0.918) and VMUNet (F1 score of 0.921), our VMUnet-MSADI shows improvements of 0.5% and 0.2%, respectively. As illustrated in Fig. 9a, the VMUnet-MSADI model provides significantly more accurate predictions of multiple nucleus boundaries compared to existing baselines. This demonstrates the strong nucleus segmentation capability of our VMUnet-MSADI model even in challenging diverging images. These experimental results further validate the generalization capability of VMUnet-MSADI across various medical image segmentation tasks.

Results on ISIC 2018 dataset

In order to evaluate the validity of the proposed work, we also conducted a comparative experiment on the skin lesion segmentation task on the ISIC 2018 dataset.

Table 6 Quantitative results of the 2018 Data Science Bowl. (bold indicates the best)

Full size table

Table 7 Quantitative results of the ISIC 2018 dataset. (bold indicates the best)

Full size table

Table 8 Quantitative results of the GLAS dataset. (bold indicates the best)

Full size table

From Table 7, we observe the following, Attention-guided models such as U-Net (F1 score of 0.786) and U-NetV2 (F1 score of 0.812) demonstrate improvements over traditional U-Net by incorporating additional attention mechanisms. This suggests that attention can optimize the traditional U-Net model effectively. CNN-based methods that utilize multi-scale context to enhance the U-Net structure, such as Att-Unet (F1 score of 0.813) and DoubleU-Net (F1 score of 0.836), confirm the effectiveness of multi-scale context fusion. In contrast, Transformer-based models, including Swin-Unet (F1 score of 0.821), TransUNet (F1 score of 0.815) and TransFuse (F1 score of 0.818), show superior performance compared to the aforementioned methods. Our VMUnet-MSADI consistently outperforms Transformer-based competitors, improving the F1 score from 0.906 to 0.913. As shown in Fig. 9b, the VMUnet-MSADI effectively captures the boundaries of skin lesions and produces superior segmentation results. Thus, these comparative results further validate the strong capability of VMUnet-MSADI in skin lesion segmentation.

Results on GLAS dataset

: We also evaluated the VMUnet-MSADI on the GLAS dataset for gland segmentation, focusing on automatic quantification of gland morphology. The results of this comparison with state-of-the-art methods are shown in Table 8 and the quantitative results are depicted in Fig. 9c. From Table 8, we observe the following, Performance of VMUnet-MSADI: The VMUnet-MSADI obtain the highest mDSC and mIoU scores of 0.919 and 0.853. This demonstrates the effectiveness of VMUnet-MSADI in gland segmentation tasks. Comparison with State-of-the-Art Models: Compared to the previous state-of-the-art model, VMUNetV2, our VMUnet-MSADI exceeds VMUNetV2 by 14% in mDSC and 7% in mIoU scores. This significant improvement highlights the model’s ability to deliver high-quality segmentation even with a limited number of training samples. The VMUnet-MSADI also shows clear advantages over recent Transformer-based approaches, including TransFuse (mDSC of 0.806), TransUNet (mDSC of 0.801) and Swin-Unet (mDSC of 0.811). This further affirms the superiority of the dual-scale encoding mechanism and the TIF module in gland segmentation. In Fig. 9c, the visualizations of the generated masks illustrate how VMUnet-MSADI effectively differentiates between the glands and surrounding tissue. This performance highlights the model’s capability to produce accurate and detailed segmentation results. Overall, these results confirm that VMUnet-MSADI delivers outstanding performance in gland segmentation, outperforming both CNN-based and Transformer-based methods.

Ablation studies

In this section, to illustrate the contribution of DIB and multiscale attention modules, etc., to the segmentation of specific corn types, four high-resolution corn kernels, namely AP, BN, BP, MY, are sequentially selected for ablation experiments here. The capability of each module in the proposed work is further evaluated and the experimental results are shown in Table 2. Here, U-Net is considered as a common baseline. “U-T” denotes the U-shaped model based on the transformer encoder. “U-S” denotes a pure transformer encoder similar to the U-shaped model, and ‘U-V’ denotes an asymmetric encoder-decoder structure based on the U-shaped model, i.e., the VMamba model. “U-V + DI” represents the multilevel characterization of the VMamba model introducing detail injection. “U-V + MSA” represents the VMamba model fusing multi-scale attention mechanisms to enhance feature representation. “U-V + MSA + DI” is the complete VMUnet-MSADI architecture, which contains the proposed multiscale attention mechanism and detail infusion module.

Table 9 Quantitative results of corn kernel segmentation task. (bold indicates the best)

Full size table

From Table 9, we have the following observations. When we replace the conventional encoder with a transformer-based encoder, the average mDSC and mIoU scores of the “U-T” are significantly improved by 5.3% and 4.3%, respectively. By comparing with the ordinary U-Net, it is demonstrated that the converter can well realize the encoding of contextual information. In comparison, “U-T” outperforms “U-S”, especially in the 0.7% higher average mDSC score, which confirms the advantage of the transformer in the encoder. Meanwhile, the work presented in this paper can effectively use Swin to model remote dependencies, so the effects of “U-V” and “U-V + DI” are unequal, and “U-V + DI” is 3.9% higher on the average mIoU score. By incorporating a multiscale attention mechanism, “U-V + MSA” improves the average mDSC and mIoU scores by 1.2% and 3.7%, respectively, suggesting that the use of additional coding branches can produce differentiated feature representations that improve segmentation performance. Although “U-V + DI” and “U-V + MSA” improved the segmentation accuracy of corn kernels to different degrees. However, “U-V + MSA + DI” can further improve the average mDSC score from 0.881 to 0.905, which is 2.4%, and the average mIoU score from 0.810 to 0.863, which is 5.3%. Such improvement proves beyond doubt that our proposed MSADI module can ensure consistency among different features and improve segmentation performance.

Table 10 Performance comparison of model size (Params) between VMUnet-MSADI and other leading methods on the GaoZhe dataset.

Full size table

It can be seen from the above experimental results that all the designed components play an indispensable role in corn kernel image segmentation and perform well in medical image segmentation tasks. To compare model size and computational complexity, we further conducted experiments on corn kernel data sets. From Table 10, we can observe that; To expand the acceptance field, CNN-based methods usually need to stack deep enough convolutional layers, which leads to high computational costs; The self-attention mechanism requires more parameters than the convolution operation, which makes the transformer-based method larger in scale. VMUnet-MSADI can not only trade-off good complexity parameters but also obtain the best segmentation performance.

Conclusions

In this paper, we presented the multi-scale attention mechanism and detail infusion VMUnet (VMUnet-MSADI), a U-shaped encoder-decoder-based framework for improving the segmentation quality of corn images. Our VMUnet-MSADI was designed based on the VSS module. Besides the encoder, we also innovatively added a multi-scale deep convolutional attention module to the decoder, allowing deep convolution to be performed at multiple scales to improve the feature maps generated by the visual encoder by suppressing uncorrelated regions to capture multi-scale salient features. Moreover, we introduced a fused multi-scale attention mechanism within the encoder to extract features at multiple levels. We further proposed a novel DI block that leverages multi-scale attention mechanisms to evaluate spatial and channel attention scores. This module generates discriminative feature representations of both coarse and fine features across different scales, thereby ensuring semantic consistency between these features and enhancing the overall effectiveness of the encoder. Extensive experiments on corn image segmentation tasks rigorously evaluated on GaoZhe Tech’s corn dataset demonstrated that our VMUnet-MSADI significantly out-performed the previous state-of-the-art methods, achieving a segmentation accuracy of 95.96%, which is 0.9% higher than the leading method. Furthermore, the inclusion of the preprocessing module enhanced the segmentation accuracy to 96.23%, marking an additional improvement of 0.27%. These results underscore the high competitiveness of our model in the segmentation task. In future work, improvements can be made by dynamically adjusting the attention weights from the characteristics of the input data using an adaptive attention mechanism^59,60. Additional focus will be placed on designing a more lightweight VMUnet-based model and enhancing the model’s ability to learn pixel-level texture structure features in order to extend the model’s generalization ability. On the other hand, there is still room for improvement in the presence of high light features in corn seeds, and there is also room to tap into the efficiency of the model by reducing the training time and memory consumption while ensuring the segmentation quality⁶¹. In addition, multimodal fast gating transformers have the potential to improve our segmentation task⁶².

Data availability

The data that support the findings of this study are openly available in [VMUnet-MSADI] at [https://github.com/corbining/VMUnet-MSADI].

References

LONG, J., SHELHAMER, E. & DARRELL, T. Fully convolutional networks for semantic segmentation; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F, . (2015).
RONNEBERGER, O. & FISCHER, P. BROX T. U-net: Convolutional networks for biomedical image segmentation; proceedings of the Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18, F, Springer. (2015) .
FAN D-P, JI G-P, ZHOU, T. et al. Pranet: Parallel reverse attention network for polyp segmentation; proceedings of the International conference on medical image computing and computer-assisted intervention, F, Springer. (2020).
ZHOU Z, RAHMAN SIDDIQUEE M M, TAJBAKHSH, N. et al. Unet++: A nested u-net architecture for medical image segmentation; proceedings of the deep learning in medical image analysis and multimodal learning for clinical decision support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, F, 2018. Springer.
HUANG, H. et al. Unet 3+: A full-scale connected unet for medical image segmentation; proceedings of the ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), F, IEEE. (2020).
ISENSEE, F. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods. 18 (2), 203–211 (2021).
Article CAS PubMed MATH Google Scholar
DONG, B. et al. Polyp-pvt: polyp segmentation with pyramid vision transformers. (2021). arXiv preprint arXiv:210806932.
NEIL, H. & DIRK, W. Transformers for image recognition at scale. Online: https://ai googleblog com/2020/12/ (2020). .
!!! INVALID CITATION !!! [9–15].
LIU, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows; proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, F. (2021).
LIN, X. et al. Plug-and-play CNN-style transformers for improving medical image segmentation; proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, F, Springer. (2023).
YU, W. et al. Metaformer is actually what you need for vision; proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, F. (2022).
CHU, X. et al. Conditional positional encodings for vision transformers. (2021). arXiv preprint arXiv:210210882.
ISLAM M A, J. I. A. S. BRUCE N D. How much position information do convolutional neural networks encode?. (2020). arXiv preprint arXiv:200108248.
SEN, N. et al. Research progress of rapid optical detection technology and equipment for grain quality. Nongye Jixie Xuebao/Transactions Chin. Soc. Agricultural Mach., 53(11). (2022).
QUAN L-Z, XIAO-YU, M. Adjusting the shape of corn kernels based on wavelet analysis. J. Agricultural Mechanization Res. Papers. 2, 154–156 (2006).
MATH Google Scholar
CHENG, H. & SHI, Z. Detection of multi-corn kernel embryos characteristic using machine vision. Trans. Chin. Soc. Agricultural Eng. 29 (19), 145–151 (2013).
MATH Google Scholar
SHI ZHIXING S Z et al. Characteristic parameters to identify varieties of corn seeds by image processing. (2008).
DING, K. & GUNASEKARAN, S. Shape feature extraction and classification of food material using computer vision. Trans. ASAE. 37 (5), 1537–1545 (1994).
Article MATH Google Scholar
NI, B. & PAULSEN, M. Corn kernel crown shape identification using image processing . Trans. ASAE. 40 (3), 833–838 (1997).
Article Google Scholar
NG, H. et al. Machine vision evaluation of corn kernel mechanical and mold damage. Trans. ASAE. 41 (2), 415–420 (1998).
Article MATH Google Scholar
CONVERSE, H. & STEELE, J. Discrimination of whole from broken corn kernels with image analysis. Trans. ASAE. 33 (5), 1–1646 (1990).
MATH Google Scholar
KRIZHEVSKY, A., SUTSKEVER I & HINTON G E. Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst., 25. (2012).
PANCHAL A V, PATEL S C, BAGYALAKSHMI, K. et al. Image-based plant diseases detection using deep learning. Materials today: Proceedings, 80: 3500-6. (2023).
LIN, P. et al. A deep convolutional neural network architecture for boosting image discrimination accuracy of rice species. Food Bioprocess Technol. 11, 765–773 (2018).
Article MATH Google Scholar
LIU, Z. & WANG, S. Broken corn detection based on an adjusted YOLO with focal loss. IEEE Access. 7, 68281–68289 (2019).
Article Google Scholar
KHAKI, S. et al. Convolutional neural networks for image-based corn kernel detection and counting . Sensors 20 (9), 2721 (2020).
Article ADS PubMed PubMed Central MATH Google Scholar
ZHAO, G. et al. Real-time recognition system of soybean seed full-surface defects based on deep learning 187106230 (Computers and Electronics in Agriculture, 2021).
JIN, C. et al. Online quality detection of machine-harvested soybean based on improved U-Net network . Trans. Chin. Soc. Agric. Eng. 38, 70–80 (2022).
MATH Google Scholar
TU, K. et al. A non-destructive and highly efficient model for detecting the genuineness of maize variety ’JINGKE 968’ using machine vision combined with deep learning 182106002 (Computers and Electronics in Agriculture, 2021).
LüMENGQI, Z. Researchon seedclassificationbasedonimprovedResNet . J. Chinese Agri. Mechanization. 42 (4), 92–98 (2021).
Google Scholar
SIMONYAN, K. & ZISSERMAN, A. Very deep convolutional networks for large-scale image recognition. (2014). arXiv preprint arXiv:14091556.
SZEGEDY, C. et al. Going deeper with convolutions; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F. (2015).
HE, K. et al. REN S,. Deep residual learning for image recognition; proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, F. (2016).
ALOM M Z, HASAN, M. et al. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation [J]. (2018). arXiv preprint arXiv:180206955.
VALANARASU J M J, SINDAGI V A, HACIHALILOGLU, I. et al. Kiu-net: Towards accurate segmentation of biomedical images using over-complete representations; proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part IV 23, F, 2020. Springer.
JHA, D. et al. Doubleu-net: A deep convolutional neural network for medical image segmentation; proceedings of the 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS), F, IEEE. (2020).
TOMAR N, K. et al. Fanet: a feedback attention network for improved biomedical image segmentation. IEEE Trans. Neural Networks Learn. Syst. 34 (11), 9375–9388 (2022).
Article MATH Google Scholar
ANDREW, G. & MENGLONG, Z. Efficient convolutional neural networks for mobile vision applications. Mobilenets 10, 151 (2017).
MATH Google Scholar
TAN M & Efficientnet Rethinking model scaling for convolutional neural networks . (2019). arXiv preprint arXiv:190511946.
TU, Z. et al. Maxvit: Multi-axis vision transformer; proceedings of the European conference on computer vision, F, Springer. (2022) .
KOESHARDIANTO, M., AGUSTIONO, W., Elinvo & Electronics SETIAWAN W. Classification of corn seed quality using residual network with transfer learning weight. (Inf. Vocat. Education), 8(1): 137–145. (2023).
Google Scholar
WANG, W. et al. Pvt v2: improved baselines with pyramid vision transformer. Comput. Visual Media. 8 (3), 415–424 (2022).
Article CAS MATH Google Scholar
XIE, E. et al. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021).
MATH Google Scholar
FAN, L. et al. GrainSpace: A large-scale dataset for fine-grained and domain-adaptive recognition of cereal grains; proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, F,. (2022).
SONG, K. et al. Rapid detection of imperfect maize kernels based on spectral and image features fusion. J. Food Meas. Charact. 18 (5), 3277–3286 (2024).
Article MATH Google Scholar
PENG, Y., SONKA, M. & CHEN D Z. U-Net v2: rethinking the skip connections of U-Net for medical image segmentation. (2023). arXiv preprint arXiv:231117791.
OKTAY O. Attention u-net: learning where to look for the pancreas. (2018). arXiv preprint arXiv:180403999.
GAO, Y. & ZHOU, M. LIU D, et al. A multi-scale transformer for medical image segmentation: Architectures, model efficiency, and benchmarks. arXiv 2022. arXiv preprint arXiv:220300131.
WEI, J. et al. Shallow attention network for polyp segmentation; proceedings of the medical image computing and computer assisted intervention–MICCAI. : 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, F, 2021 . Springer. (2021).
RUAN, J. et al. MALUNet: A multi-attention and light-weight unet for skin lesion segmentation; proceedings of the 2022 IEEE International Conference on Bioinformatics and IEEE. (2022).
RUAN, J. XIANG S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:240202491, (2024).
ZHANG, M. et al. VM-UNET-V2: rethinking vision mamba UNet for medical image segmentation; proceedings of the international symposium on bioinformatics research and applications, F, Springer. (2024) .
CHEN, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:210204306, (2021).
ZHANG, Y., LIU, H. & Transfuse, H. U. Q. Fusing transformers and cnns for medical image segmentation; proceedings of the Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24, F, Springer. (2021).
CAO, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation; proceedings of the European conference on computer vision, F, Springer. (2022).
CODELLA, N. et al. Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). (2019). arXiv preprint arXiv:190203368.
CAICEDO, J. C. et al. Nucleus segmentation across imaging experiments: the 2018 data science bowl. Nat. Methods, 16(12): 1247–1253. (2019).
KHAN, S. D. BASALAMAH S, NASEER A. Classification of plant diseases in images using dense-inception architecture with attention modules . Multimedia Tools Appl., : 1–26. (2024).
KHAN, S. D. Segmentation of farmlands in aerial images by deep learning framework with feature fusion and context aggregation modules. Multimedia Tools Appl. 82 (27), 42353–42372 (2023).
Article Google Scholar
GU, Y. et al. MFGTN: a multi-modal fast gated transformer for identifying single trawl marine fishing vessel . Ocean Eng. 303, 117711 (2024).
Article MATH Google Scholar
CHEN, J. et al. Specular removal of industrial metal objects without changing lighting configuration (IEEE Transactions on Industrial Informatics, 2023).

Download references

Acknowledgements

Zhongyuan Science and Technology Innovation Leading Talent Project (244200510024).

Author information

Kuibin Zhao and Qinghui Zhang contributed equally to this work.

Authors and Affiliations

College of Information Science and Engineering, Henan University of Technology, Zhengzhou, 450001, China
Kuibin Zhao, Qinghui Zhang, Chenxia Wan, Quan Pan & Yao Qin
School of Automation, Northwestern Polytechnical University, Xi’an, 710114, China
Quan Pan

Authors

Kuibin Zhao
View author publications
Search author on:PubMed Google Scholar
Qinghui Zhang
View author publications
Search author on:PubMed Google Scholar
Chenxia Wan
View author publications
Search author on:PubMed Google Scholar
Quan Pan
View author publications
Search author on:PubMed Google Scholar
Yao Qin
View author publications
Search author on:PubMed Google Scholar

Contributions

K.Z. interpretation of data, conducted the experiment performed, statistical analysis, figure generation and writing of manuscript; Q.Z. and C.W design of the VMUnet-MSADI study, and revision of the manuscript. Q.P. and Y.Q. revision of the manuscript; All authors reviewed the manuscript.

Corresponding author

Correspondence to Qinghui Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, K., Zhang, Q., Wan, C. et al. Visual Mamba UNet fusion multi-scale attention and detail infusion for unsound corn kernels segmentation. Sci Rep 15, 10933 (2025). https://doi.org/10.1038/s41598-024-80977-z

Download citation

Received: 28 September 2024
Accepted: 22 November 2024
Published: 29 March 2025
DOI: https://doi.org/10.1038/s41598-024-80977-z

Subjects

Abstract

Similar content being viewed by others

Dynamic atrous attention and dual branch context fusion for cross scale Building segmentation in high resolution remote sensing imagery

High precision banana variety identification using vision transformer based feature extraction and support vector machine

MD-Unet for tobacco leaf disease spot segmentation based on multi-scale residual dilated convolutions

Introduction

Related work

Corn image segmentation

Vision encoders

Materials

Sample preparation

Images acquisition system

Data preprocessing

Data flow

Methods

Preliminaries

VMUnet-MSADI architecture

VSS module

Multi-scale attention module (MSAM)

Multi-scale convolution block (MSCB)

Channel attention block (CAB)

Spatial attention block (SAB)

Loss function

Experiments

Datasets

Implementation details

Main results

Experimental settings

Baselines

Implementation details

Corn experimental results

Results on corn segmentation

Results on 2018 data science bowl

Results on ISIC 2018 dataset

Results on GLAS dataset

Ablation studies

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links