A lightweight crack segmentation network based on the importance-enhanced Mamba model

Wang, Yunfeng; Jin, Jie; Chen, Xiong; Wu, Ze; Zhang, Lu

doi:10.1038/s41598-025-25504-4

Download PDF

Article
Open access
Published: 24 November 2025

A lightweight crack segmentation network based on the importance-enhanced Mamba model

Yunfeng Wang^1,2,3,
Jie Jin^1,2,3,
Xiong Chen⁴,
Ze Wu⁴ &
…
Lu Zhang^1,2,3

Scientific Reports volume 15, Article number: 41549 (2025) Cite this article

958 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

In the maintenance of transportation infrastructure, crack segmentation is critical for ensuring road safety and prolonging the service life of bridges and other facilities. Existing methods struggle with complex background interference and intricate crack morphologies (e.g., mesh-like or tree-like morphologies). Meanwhile, high-precision models frequently suffer from excessive computational costs. To address these limitations, this study proposes, a lightweight crack segmentation network based on an Importance-Enhanced Mamba model. Built upon the U-Net architecture, innovatively features a dual-branch design that integrates CNN and Mamba modules for synergistic feature extraction. Within the Mamba branch, we designed an importance-enhanced dynamic scanning module, which adaptively adjusts scanning paths according to actual crack geometries, thereby significantly enhancing the perception of global key crack features. Concurrently, the CNN branch specializes in capturing fine-grained local features such as edges and textures. These complementary features are fused via an attention-guided module, which assigns adaptive weights to enable pixel-wise integration of local and global information, thus preserving both microstructural details and macroscopic relationships of crack. Comprehensive experiments conducted on three public datasets (Crack500, CrackTree260, and CrackForest) demonstrate that outperforms other advanced methods in segmentation accuracy while achieving significant reductions in model parameters and computational complexity.

An efficient semantic segmentation method for road crack based on EGA-UNet

Article Open access 30 September 2025

VM-UNet++ research on crack image segmentation based on improved VM-UNet

Article Open access 15 March 2025

Efficient crack and surface-type recognition via CNN-block development mechanism and edge profiling

Article Open access 17 November 2025

Introduction

In modern transportation infrastructure construction and maintenance, crack detection serves as a pivotal task for safeguarding road safety and enhancing structural durability. Timely and precise identification of cracks in key facilities such as bridges and roads plays a critical role in rationalizing maintenance scheduling, extending service life, and elevating driving safety^1,2. Traditional detection approaches have primarily relied on manual visual inspection, a method fraught with inherent limitations, including substantial labor intensity, prolonged detection cycles, and non-uniform detection outcomes stemming from variations in subjective judgment. These drawbacks collectively render traditional methods inadequate for meeting the operational and maintenance demands of large-scale infrastructure projects³.

With the advancement of automation technology, image processing methods such as edge detection and texture feature analysis have been widely applied in crack detection tasks⁴. However, the segmentation performance of these traditional approaches depends strongly on the calibration of predefined parameters, leading to significant generalization constraints in complex environments (e.g., uneven lighting and texture interference). Leveraging breakthroughs in deep learning and computer vision, automated crack detection has achieved transformative progress^5,6,7. For example, Zhong et al.⁸ proposed a lightweight dual-encoder framework with cross-feature fusion mechanisms to improve detection accuracy. Additionally, Quan et al.⁹ integrated transformer architectures into their pipelines, enabling high-precision crack segmentation even in challenging operational scenarios. Even with the progress mentioned earlier, current models continue to encounter difficulties regarding segmentation efficiency. In particular, crack segmentation methods based on the Transformer architecture often have a large number of parameters. Therefore, deploying these models in scenarios with limited computing resources becomes extremely difficult.

In contrast, the Mamba model has gained widespread application in medical image processing due to its significant advantage of linear computational complexity¹⁰. Additionally, it can effectively establish long-range dependencies through a state space model (SSM), which is crucial for segmentation tasks involving dense predictions. However, the Mamba model typically employs a static scanning method, performing sequential scans only in horizontal or vertical directions. This approach struggles with irregular complex shapes such as tree-like or network-like structures, limiting Mamba’s performance in crack segmentation tasks.

To achieve an optimal balance between computational resource consumption and segmentation accuracy, we draw inspiration from previous related studies^11,12,13 and propose a lightweight crack segmentation network based on the importance-enhanced Mamba model—IEM-UNet. Specifically, IEM-UNet adopts the symmetric encoder-decoder architecture of U-Net as its backbone network, and innovatively uses a dual-branch module that combines convolutional neural networks (CNN) and Mamba as a hybrid feature extraction unit. In the Mamba branch, we designed an importance-enhanced dynamic scanning module that adaptively adjusts the scanning path based on the actual crack shape, enhancing the model’s sensitivity to critical crack regions. Additionally, we employed the SSM module to establish contextual dependencies of cracks, significantly improving the model’s perception of global critical crack features. The CNN branch is used to extract fine-grained local features, such as crack edges and textures. To further enhance segmentation performance, a multi-scale feature aggregation module is introduced to integrate local and global information, ultimately achieving precise and fast crack segmentation. In summary, the contributions of this paper are as follows:

(1)
We introduce a lightweight crack segmentation network based on the Importance-Enhanced Mamba model. Extensive experimental results on three public datasets demonstrate that IEM-UNet not only attains segmentation accuracy on par with or even surpassing that of other advanced models, but also effectively manages to keep the model size and computational complexity in check.
(2)
We propose a novel dynamic scanning strategy grounded in importance-enhancement, which effectively overcomes the limitations of traditional sequential scanning methods in accommodating complex, variable crack morphologies.
(3)
We introduce a dual-branch architecture integrating CNN and Mamba, which facilitates effective extraction of fine-grained local features and global features of cracks. Moreover, we embed a multi-scale feature integration module to merge these features, thus further boosting segmentation performance.

Related work

Within the realm of computer vision, CNNs have found extensive use in crack segmentation tasks^14,15. As a classic segmentation architecture, U-Net¹⁶ has laid a robust foundation for subsequent research. U-Net++¹⁷ incorporates a nested skip connection mechanism to further enhance segmentation accuracy. DeepCrack¹⁸ effectively integrates multi-level features via a multi-scale feature fusion mechanism, enabling pixel-level crack segmentation. CrackSegAN¹⁹ innovatively introduces an elastic deformation data augmentation strategy, which enhances model stability under complex background interference. Furthermore, to mitigate the gradient vanishing problem in traditional U-Net architectures, RUC-Net²⁰ restores feature propagation paths through residual connections, thereby improving the training stability of deep networks.

Nonetheless, CNNs have an inherent limitation in their ability to capture long-range dependencies because of their localized receptive fields. This limitation is particularly detrimental to dense prediction tasks such as object segmentation, where global contextual information is critical for accurately identifying object boundaries and maintaining the topological consistency of segmentation results²¹. In comparison, the Transformer architecture utilizes self-attention mechanisms to effectively create global pixel-level dependencies, which greatly improves the topological accuracy of crack modeling due to its strong contextual understanding^22,23. CrackFormer²⁴ first achieved cross-channel global information extraction, with subsequent work²⁵ subsequently refining this methodology. FAT-Net²⁶ pioneered a feature-adaptive Transformer module, whereas DTrC-Net²⁷ further optimized intermediate-level feature fusion within this framework. PCTNet²⁸ introduced a dual-attention network and conducted a systematic study on the impact of camera pose on detection performance. TransUnet²⁹ improves segmentation accuracy and efficiency by introducing a self-attention mechanism and global context information transfer. SwinCrack³⁰ achieves adaptive spatial aggregation through Swin-Transformer blocks, enabling more accurate and continuous modeling of cracks.

While the self-attention mechanism in Transformer models allows for the creation of global contextual relationships, its computational complexity increases quadratically with the size of the image.. This incurs substantial computational overhead and memory consumption when processing high-resolution images^31,32, thereby posing significant challenges to their practical deployment in resource-constrained scenarios. Consequently, researchers have been exploring efficient alternatives to Transformer models. The Mamba model, which is based on the state space model (SSM), has gained significant interest because of its benefit of linear computational complexity³³. It was initially applied to medical image segmentation tasks^34,35. Chen et al.³⁶ later pioneered the visual Mamba framework, which effectively establishes global dependencies in crack segmentation while significantly enhancing computational efficiency compared to Transformer architectures. Qi et al.³⁷ further proposed the Channel Parallel Mamba Module (CPM), which decomposes input images along channel dimensions and processes them in parallel via Mamba to further optimize runtime efficiency.

While Mamba demonstrates remarkable advantages in computational efficiency, traditional static scanning mechanisms (horizontal or vertical) struggle to effectively adapt to the intricate crack structures. This gives rise to segmentation inaccuracies in regions featuring blurred boundaries and small crack branches, thereby resulting in a reduction in segmentation accuracy.

Proposed method

We propose IEM-UNet, a lightweight crack segmentation network grounded in the importance-enhanced Mamba model. This framework adds a dynamic scanning module to the original Mamba model and combines it with a CNN network to form a dual-branch structure. This enables accurate segmentation of complex cracks while significantly improving computational efficiency. In this section, we first elaborate on the fundamental concepts and principles of Mamba. We then provide an overview of the IEM-UNet framework, as illustrated in Fig. 1a. Finally, we elaborate on each key component of the framework.

Preliminaries

As a novel neural architecture grounded in the state-space model (SSM), Mamba exhibits remarkable advantages in modeling long-range temporal dependencies. Its core resides in the dynamic modeling of sequence information via continuous-time state-space equations. Moreover, its computational complexity scales linearly, maintaining efficiency even for large input sequences. Formally, the fundamental SSM equation is defined as follows:

$$h{\prime} (t) = Ah(t) + Bx(t)$$

(1)

$$y(t) = Ch(t) + Dx(t)$$

(2)

where x(t)∈${\mathbb{R}}$ ^L and y(t)∈${\mathbb{R}}$ ^L represent continuous input and output sequences, respectively, where L denotes the sequence dimension. h(t)∈${\mathbb{R}}$ ^N represents the hidden state vector, and N denotes its length. A, B, C, D represent parameter matrices.

Similar to the attention mechanism, the Selective State Space Model (S6) assumes that all input data x have different levels of importance for the output result y. Therefore, S6 further extends SSM by using the input sequence to initialize the parameter matrix. B and C are initialized directly using the linear projection of the input sequence.

$$B = Linear(x)$$

(3)

$$C = Linear(x)$$

(4)

where Linear denotes linear projection. Mamba combines SMM and S6. This allows for varying degrees of attention to be paid to the input sequence during training, thereby filtering out irrelevant information and retaining more important information.

However, for discrete input signals, the zero-order hold (ZOH) technique is leveraged to convert them into continuous signals compatible with SSM. Specifically, for a discrete time step k, the input sequence is transformed into a piecewise constant signal by holding the current value constant until the next sampling instant. This discrete-to-continuous mapping mechanism renders SSM compatible with the grid structure of images. The mathematical formulation for this discrete-to-continuous conversion is given as:

$$h_{k} = \overline{A} h_{k - 1} + \overline{B} x_{k}$$

(5)

$$y_{k} = Ch_{k} + Dx_{k}$$

(6)

$$\overline{A} = \exp (\Delta A)$$

(7)

$$\overline{B} = (\Delta A)^{ - 1} (\exp (\Delta A) - {\mathbf{I}}) \cdot \Delta B$$

(8)

where ∆∈${\mathbb{R}}$ ^L represents the learnable time step parameter, which is one of the core parameters of Mamba, and I represents the identity matrix.

IEM-UNet

The network architecture of IEM-UNet is depicted in Fig. 1(a). This network adopts the symmetric encoder-decoder framework of U-Net, comprising three core components: an encoder, a decoder, and a bottleneck layer.

The encoder comprises an initial layer and four cascaded processing blocks. Given an input image I ∈${\mathbb{R}}$ ^H×W×C, where H, W, and C represent the height, width, and number of channels of the input image, respectively. The initial layer employs DepthWise convolution (DW-conv) to extract low-level features F_L ∈${\mathbb{R}}$ ^H×W×32 from the input image I, thereby retaining edge and texture details of the target to be segmented. Subsequently, a four-layer cascaded structure is employed, with each layer containing two IE-VSS modules and one Down-Sampling module. The IE-VSS module functions as the primary unit for feature extraction, aimed at capturing long-range relationships between objects in the image.

The Down-Sampling module employs max pooling operations to reduce the spatial resolution of feature maps. Subsequently, the feature map undergoes sequential processing through 1 × 1 convolutions, normalization, and the SiLU activation function, doubling the number of feature channels.

After each processing step, the feature map resolution halves and the number of channels doubles. The feature map dimensions evolve sequentially as $\frac{H}{2}\times \frac{W}{2}\times 64$, $\frac{H}{4}\times \frac{W}{4}\times 128$, $\frac{H}{8}\times \frac{W}{8}\times 256$, $\frac{H}{16}\times \frac{W}{16}\times 512$. Finally, deep semantic feature encoding is achieved via a single IE-VSS module in the bottleneck layer.

The decoder adopts a mirrored architecture symmetric to the encoder, with its core being the restoration of spatial detail information through a multi-level skip connection mechanism. Each cascaded structure within the decoder comprises two IE-VSS modules and one Up-Sampling module. Transposed convolution is used for Up-Sampling, increasing the resolution of the feature map by two times while reducing the number of channels by half. Subsequently, unsampled features are concatenated along the channel axis with the corresponding encoder—layer features and fed into the IE-VSS module for feature calibration.

Importance-enhanced visual state space model (IE-VSS)

The IE-VSS module, a parallel fusion architecture integrating CNN and Visual State Space (VSS) (as illustrated in Fig. 1(b)), achieves synergistic extraction of local and global information through a dual-branch parallel design. The CNN branch (left) employs 3 × 3 DepthWise convolutions to construct lightweight convolutional modules, which sequentially extract texture, edge, and other fine-grained details from the spatial neighborhood of cracks, generating local features F₁ ∈${\mathbb{R}}$ ^h×w×c. The right branch of the module is a improved VSS component used to extract global features F₂ ∈${\mathbb{R}}$ ^h×w×c from images. The AG Modules is a feature fusion module based on the attention mechanism. It is used to better fuse the local features F₁ obtained by CNN and the global features F₂ obtained by the improved VSS component.

Importance-enhanced scan (IE-scan)

Most existing Mamba architectures employ fixed horizontal-vertical scanning paths, a static strategy that is inconsistent with the spatial anisotropy of cracks. Specifically, real-world cracks often exhibit irregular tree-like branching or mesh-like topological patterns, whereas the predetermined scanning paths of the SS2D module fail to adaptively align with these dynamically changing geometric features. More critically, this static strategy may perform unnecessary processing on non-crack regions, exacerbating background interference and thereby compromising segmentation accuracy.

To address these limitations, we developed a dynamic scanning strategy with Importance-Enhanced (IE-scan). Instead of applying uniform processing across all spatial locations, our approach dynamically filters key regions containing crack features to reconstruct the scanning path.

Specifically, inspired by recent advancements in medical image segmentation^38,39, they emphasized the importance of differentiating key target features to enhance segmentation accuracy. Therefore, we attempt to explore dynamically adjusting the image scanning order based on differences in feature importance. Intuitively, this should enhance the model’s adaptability to the irregular shapes of cracks. Furthermore, according to the Class Activation Mapping (CAM) theory⁴⁰, in a neural network, the activation magnitude reflects the extent to which a neuron responds to input features. Regions with high activation magnitudes typically contain more discriminative information relevant to the task objective. Therefore, we adopt activation intensity as a measure of feature importance.

Consistent with the traditional Mamba model processing method, assume that the feature F∈${\mathbb{R}}$ ^h×w×c uniformly divided into n rows and n columns, resulting in N non-overlapping blocks ${f}_{i}\in {\mathbb{R}}$ ^p×q×c, where i ∈ [1,N], $p=\frac{h}{n}$, $q=\frac{w}{n}$. The importance of feature blocks ${f}_{i}$ is calculated as follows:

$$\Gamma_{act} (f_{i} ) = \sqrt {\sum\limits_{j = 1}^{p} {} \sum\limits_{k = 1}^{q} {f_{i} (j,k)^{2} } }$$

(9)

where, ${\Gamma }_{\text{act}}(\bullet )$ represents the importance of the feature block ${f}_{i}$, p and q represent the length and width of feature block ${f}_{i}$, and ${f}_{i}(j,k)$ represents the sum of the values across all channels in the i-th row and j-th column of the feature block ${f}_{i}$, respectively. This reflects the response intensity and distribution across all channels in the spatial dimension, while also indicating the relevance of this feature block to the segmentation task.

We sort all feature blocks in descending order based on their corresponding importance scores, thereby generating a dynamic scanning sequence.

$$Scan_{idx} = sort(\Gamma_{act} (f_{i} )),i \in [1,N]$$

(10)

where, ${Scan}_{idx}$ denotes the dynamic scanning sequence derived based on importance.

As shown in Fig. 2, when using a static scanning strategy, the sequence of feature blocks obtained remains fixed. For example, if sequential scanning is performed horizontally, the resulting sequence is {1,2,3,4,5,6,7,8,9}. If sequential scanning is performed vertically, the sequence obtained is {1,4,7,2,5,8,3,6,9}. However, when employing a descending-order scanning method based on importance, the sequence obtained in the example of Fig. 2 could be {4,6,1,5,8,…}. Such scanning sequences can be dynamically adjusted according to the crack morphology.

Improved VSS component

The structure of the improved VSS component is shown in the dotted box in Fig. 1(b). First, input features undergo layer normalization, adjusting the input data to a standard distribution with a mean of 0 and variance of 1. Subsequently, in the first path, feature maps sequentially pass through a linear layer, a deep convolutional layer with SiLU activation, an SS2D module, and layer normalization.

The structure of SS2D is depicted in Fig. 3. To enhance the model’s capability to characterize key semantic regions, importance-enhanced dynamic scanning strategy is employed during the scanning phase, which adaptively adjusts spatial sampling priorities to emphasize salient regions. An input sequence X(t), t ∈ {1} is obtained through IE scan. As shown in Fig. 1(a), the IE-VSS module is employed across four levels of the IEM-UNet encoder. The $\text{p}\times \text{q}$ (hyperparameters) for the IE-Scan process in each level is set sequentially from top to bottom as $16\times 16$, $8\times 8$, $4\times 4$, and $2\times 2$, with identical $\text{p}\times \text{q}$ values across the two IE-VSS modules in each level. The decoder section corresponds layer-by-layer to the encoder.

Additionally, to preserve both horizontal and vertical crack features, the cross-scan method from the base Mamba model is retained. Specifically, cross-scan involves sequential scanning along the horizontal and vertical directions of the input feature map, yielding a total of four input sequences X(t), t ∈ {2,3,4,5}. This branch integrates both scanning strategies (IE-scan and Cross-scan) to enhance the network’s adaptability to complex crack morphologies.

For each input sequence X(t), a shared-weight S6 module is used to model long-range dependencies, thereby generating the output sequence Y(t). Specifically, the projection matrices B and C, along with parameters such as A and D, are shared across time steps within each sequence. For a single sequence, the same set of parameters is used across all time steps. However, weights are not shared between different sequences. Each sequence possesses an independent set of parameters. Without the weight sharing mechanism, each time step would have its own distinct parameters. This would directly lead to an explosion in model parameters, contradicting Mamba’s lightweight nature. Moreover, this approach cannot generalize to sequence lengths not encountered during training.

Then, an inverse scan operation is performed on each output sequence Y(t), restoring it to its initial structure in reverse scan order. Finally, the five restored structures are merged by summation. The merged feature map undergoes layer normalization processing once more. It is worth noting that the computational pipeline of this module maintains linear complexity throughout the entire process, ensuring high computational efficiency. As shown in Table 1, we have used pseudocode to further detail the computational process of SS2D.

Table 1 Pseudocode for the SS2D calculation process.

Full size table

The second path adopts a simplified processing pipeline, transforming features solely via a linear projection layer with an activation function. Finally, the output feature maps from both paths are element-wise multiplied to generate global features F₂ ∈${\mathbb{R}}$ ^h×w×c.

Attention-gated (AG) module

The local features extracted by the CNN branch excel at capturing fine details of cracks but are prone to incorporating irrelevant textures, artifacts, and other background noise. Conversely, the global features obtained by the improved Mamba model effectively compensate for the shortcomings of local features in extracting the overall structure and spatial relationships of cracks. It is important to note that indiscriminate fusion of these two feature types may dilute or even obscure core features due to redundant information, thereby compromising the model’s segmentation performance.

To realize the effective fusion of local feature F₁ and global feature F₂, an Attention-Gated (AG) module is constructed, as shown in Fig. 4. By automatically assigning feature weights, the fusion ratio of local and global features is controlled pixel-wise to better preserve the crack detail information and global relationship.

Specifically, F₁ and F₂ first concatenated along the channel dimension to obtain the feature F_concat ∈${\mathbb{R}}$ ^h×w×2c. Then, the number of channels is compressed by 3 × 3 Conv to generate the intermediate feature F₃ ∈${\mathbb{R}}$ ^h×w×c. In the channel branching, the intermediate feature F₃ is subjected to the Global Average Pooling (GAP) to generate the channel descriptor, and sequentially connect the fully connected layers and the Sigmoid function to generate the channel weight ω₁ ∈${\mathbb{R}}$ ^1×1×c. In the spatial branch, a 3 × 3 convolution is performed on the intermediate feature F₃, and the Sigmoid function is connected to generate the spatial weight ω₂ ∈${\mathbb{R}}$ ^{h× w×1}. The weights ω₁ and ω₂ obtained from the channel branch and the spatial branch are multiplied together to obtain the joint attention weight ω_f ∈${\mathbb{R}}$ ^h×w×c. Finally, local and global features are fused based on the joint attention weight ω_f:

$$F_{fusion} = \omega_{f} \cdot F_{1} + (1 - \omega_{f} ) \cdot F_{2}$$

(11)

where F_fusion ∈${\mathbb{R}}$ ^h×w×c denotes the fused features.

Experiments

Dataset

To ensure the reliability and comprehensiveness of the experiment, we employed three publicly available crack segmentation datasets for testing and evaluation—specifically Crack500⁴¹, CrackTree260⁴², and CrackForest⁴³. These datasets encompass a variety of challenging samples, including different lighting conditions, shadow interference, similar background interference, complex crack morphology, and fine cracks.

Crack500 encompasses images gathered from road surfaces constructed with diverse materials, such as asphalt and concrete. These images capture a wide spectrum of lighting conditions and crack degradation levels, and are accompanied by highly accurate pixel-wise semantic annotations. The Crack500 dataset was randomly divided into a training set and a test set, containing 2,244 and 1,124 images, respectively.

CrackTree260 and CrackForest consist of 260 and 118 images with pixel-wise annotations, respectively. Primarily, these datasets feature images of small cracks and those with intricate background noise. Given the modest size of these two datasets, we combined them into a unified dataset and standardized the image dimensions to eradicate resolution disparities. As shown in Fig. 5, to reduce the risk of overfitting caused by the limited number of images and improve the network’s generalization performance, we enhanced the original data by reducing image brightness, adding Gaussian noise, rotating, and cropping the images. We named the merged and enhanced dataset Crack756, which consists of 529 training samples and 227 test samples.

Evaluation indicators

In line with related work in the field¹, We assess the model’s segmentation accuracy performance using four metrics: Precision, Recall, F1, and mIoU. Furthermore, we assess the model’s efficiency by considering the number of parameters and the floating-point operations per second (FLOPs):

$$\Pr ecision = \frac{TP}{{TP + FP}}$$

(12)

$${\text{Re}} call = \frac{TP}{{TP + FN}}$$

(13)

$$F1 = 2 \times \frac{{\Pr ecision \times {\text{Re}} call}}{{\Pr ecision + {\text{Re}} call}}$$

(14)

$$mIoU = \frac{1}{N}\sum\limits_{i = 1}^{N} {\frac{TP}{{TP + FP + FN}}}$$

(15)

where TP, FP, and FN represent true positive, false positive, and false negative, respectively.

Experimental setup

The project was implemented using Python 3.11 and the PyTorch 2.3.1 deep learning framework. For the training process, the Adam optimizer was deployed with an initial learning rate set to 0.01, a batch size of 2, and a total of 150 training epochs. Detailed configurations are outlined in Table 2:

Table 2 Experimental platform.

Full size table

Analysis of experiments results

The training loss curves for the proposed method and the comparison methods are shown in Fig. 6. (a) shows the loss curve on Crack500, and (b) shows the loss curve on Crack756. All models tend to stabilize after approximately 50 epochs of training. As shown in Fig. 6, our method achieves smaller training losses on both datasets.

Results on the Crack500

We first evaluated the segmentation performance of various models on the Crack500 dataset, which comprises a large number of highly challenging segmentation instances. These instances place higher demands on the segmentation capabilities of the models, and the results are presented in Table 3.

Table 3 Experimental results of different models on the crack500.

Full size table

Compared to models based on Mamba and Transformer architectures, CNN-based models generally performed less well. Among these, DeepCrack improved the F1-score to 68.37% via multi-scale fusion techniques. However, the inherent limitations of CNNs in modeling global context hinder further improvements in segmentation performance when dealing with complex background interference.

Among Transformer-based models, SwinCrack effectively captures long-range dependencies through self-attention mechanisms and achieves the highest Precision (73.38%). Nevertheless, IEM-UNet excels across all models in overall performance: it outperforms all other models in three key metrics—Recall, F1-score, and mIoU—with values of 73.45%, 73.01%, and 72.93% respectively. Moreover, compared to other Mamba-based models, IEM-UNet yields performance gains of at least 2.29%, 0.74%, 0.88%, and 1.02% across the four metrics.

Results on the Crack756

The performance evaluation results on the Crack756 dataset are shown in Table 4. The IEM-UNet model delivered outstanding results across four key metrics: Precision (82.64%), Recall (81.53%), F1-score (81.615%), and mIoU (80.47%). Notably, IEM-UNet exhibited exceptional performance in Recall and mIoU, outperforming all comparative models by at least 1.06% and 0.7%, respectively. While it did not claim the top spot in Precision and F1-score—where the SwinCrack model achieved the highest values of 83.81% and 81.73%, respectively—IEM-UNet still secured second place through its robust performance.

Table 4. Experimental results of different models on the crack756.

Full size table

The performance enhancement serves as unequivocal evidence of the efficacy of the importance-enhanced dynamic scanning strategy employed in IEM-UNet. This sophisticated strategy is designed to autonomously adapt the scanning sequence in accordance with the unique characteristics of cracks. It excels in accentuating critical crack information while robustly suppressing interference from background noise. When synergistically combined with the local detail extraction capabilities of CNNs, it substantially bolsters the segmentation performance.

Computational efficiency analysis

Regarding computational efficiency, this study conducted experiments on the Crack756 dataset. The corresponding experimental results are presented in Table 5. On the one hand, models based on the Mamba architecture are significantly more efficient than those based on CNN and Transformer architectures. Focusing on two key efficiency metrics—the number of parameters and the number of floating-point operations (FLOPs)—the Mamba series models outperform traditional architectures. On the other hand, IEM-UNet, with only 10.6 M parameters and 6.2 G FLOPs, achieves the best performance among all tested models. Compared with other models, the number of parameters and FLOPs are improved by at least 30.26% and 27.06% respectively.

Table 5 Compare the parameters and FLOPs of different models on the Crack756.

Full size table

Although we did not achieve the best performance in terms of a certain segmentation accuracy metric, this represents the inevitable trade-off between model simplicity and segmentation accuracy. Extensive experimental evaluations clearly show that our method achieves a good balance between these key aspects.

Visualization results.

The notable disparities in performance among various methods become conspicuous in the crack segmentation task within complex scenes. We selected segmentation results from several complex samples for visual analysis, as depicted in Fig. 7.

In the first row of the figure, when encountering interference from abnormal objects like manhole covers, traditional methods predominantly manifest detection omissions and disruptions in the topological structure of dense crack regions (highlighted by the yellow box). The images in Rows 2 and 3 show samples with similar background interference. When the background contains interfering elements highly similar to cracks, the false detection rate of crack edge detail segmentation increases significantly. The images in Rows 4 and 5 respectively present samples with lane line interference and samples under normal environments. The images in Row 6 display segmentation results with shadow interference. For other segmentation models, when crack edges overlap with shadow edges, errors such as discontinuous segmentation tend to occur in the overlapping areas, which leads to a decrease in segmentation accuracy.

In contrast, our proposed method effectively extracts both local and global features of cracks through CNN and Mamba fusion, successfully restoring more comprehensive crack details while ensuring good continuity. It outperforms traditional counterparts in these challenging scenarios.

Ablation experiment

In order to evaluate the individual impact of the IE-Scan, CNN module, and feature fusion modules (AG Modules) on the overall performance of the model, we designed six ablation experiments on the Crack500 dataset. The experimental results are shown in Table. 6.

Table 6 Experimental results of different modules on the crack500. ②

Full size table

Compared to the initial Mamba model (Scheme ①), the improved model (Scheme ②) incorporating the proposed IE-Scan module achieves varying degrees of enhancement across multiple performance metrics. Specifically, both Precision and Recall witnessed substantial improvements, unequivocally validating the module’s efficacy in mitigating false positives and false negatives. Moreover, F1 and mIoU increased by 2.56% and 3.67%, respectively, demonstrating that the importance scanning strategy endows the model with heightened sensitivity to global key features.

Building upon Scheme ②, Scheme ③ further incorporates a CNN branch for local feature extraction. We fuse local and global features by concatenating them along the channel dimension. On one hand, experiments comparing Scheme ② and Scheme ③ demonstrate that the CNN branch enhances segmentation performance.

On the other hand, to evaluate the effectiveness of the feature fusion strategy, we replaced the channel concatenation fusion method in Scheme ③ with the feature fusion approach adopted in this paper (labeled as Scheme ⑥). The comparison results demonstrate that Approach 6 achieves higher segmentation accuracy than Approach 3, intuitively proving the effectiveness of the feature fusion method proposed in this paper.

Furthermore, in Scheme ⑤, we directly added local and global features for fusion. Compared to Scheme ⑤, Scheme ④ achieved improvements of 2.64%, 0.59%, 1.21%, and 1.70% across four key metrics. These experiments collectively validate the significant advantages of our feature fusion strategy from multiple dimensions.

Discussion

Extensive experiments across three public datasets demonstrate that IEM-UNet achieve segmentation accuracy on par with other advanced models while maintaining a lightweight architecture. The Mamba module’s strength lies in its exceptional capabilities to model long-range dependencies, which is advantageous for dense prediction tasks like crack segmentation. However, real-world cracks exhibit irregular morphologies (e.g., tree-like or mesh-like structures), rendering Mamba’s conventional horizontal or vertical sequential scanning suboptimal. To address this, IEM-UNet introduces an importance-enhanced dynamic scanning mechanism, which can adaptively adjust the scanning order based on crack features instead of relying solely on a fixed scanning method. Nevertheless, relying solely on Mamba to capture detailed information about cracks poses a certain challenge. Thus, we reintroduce CNN. CNN supplies supplementary local features to enhance IEM-UNet’s segmentation accuracy for cracks.

Nonetheless, IEM-UNet’s element-wise adaptive fusion module incurs additional computational overhead. Future work will focus on designing more efficient fusion modules or exploring lightweight alternatives to further optimize the framework’s computational efficiency without compromising accuracy.

Conclusion

To address the issue that traditional Mamba models struggle to adapt to the complex morphology of cracks, resulting in reduced segmentation accuracy in intricate scenarios, and the high computational costs associated with conventional methods, we propose a lightweight crack segmentation network, IEM-UNet. This network delivers effective improvements in crack detection accuracy while achieving significant reductions in model parameters by optimizing the scanning path of the Mamba model and enhancing mechanisms for extracting and fusing global and local features. Experimental validation on three public datasets (Crack500, CrackTree260, CrackForest) demonstrates that IEM-UNet achieves state-of-the-art segmentation accuracy while reducing model parameters by 23.68% compared to comparable methods, showcasing its superiority in both precision and efficiency. Therefore, the following conclusions are drawn:

(1)
The proposed importance-enhanced dynamic scanning mechanism effectively helps the model perceive the complex morphology of cracks, thereby enhancing its ability to capture global features.
(2)
The fusion structure of CNN and improved Mamba can more efficiently extract fine-grained local features of cracks and organically combine them with global context modeling. These feature components are fused through a pixel-level adaptive weight module, achieving efficient integration of multi-scale features. This will effectively enhance the model’s segmentation performance for complex cracks and promote the further deployment and application of this model on mobile devices.

Future research will focus on three primary directions: (1) further exploring Mamba-based architectures to handle complex scenarios such as occlusion and variable lighting conditions; (2) constructing large-scale, high-precision crack datasets with diverse environmental backgrounds to enhance model generalization; and (3) optimizing IEM-UNet for edge computing deployment to enable real-time infrastructure inspection in resource-constrained settings. These extensions aim to solidify the practical utility of the proposed framework in real-world transportation infrastructure maintenance.

Data availability

The data used in this study can be obtained from the corresponding authors.

References

Wang, X. et al. ASARU-Net: Superimposed U-Net with residual squeeze-and-excitation layer for road crack segmentation. J. Electron. Imaging 32, 023040–023040. https://doi.org/10.1117/1.JEI.32.2.023040 (2023).
Article ADS Google Scholar
Ai, D., Jiang, G., Lam, S.-K., He, P. & Li, C. Computer vision framework for crack detection of civil infrastructure—A review. Eng. Appl. Artif. Intell. 117, 105478. https://doi.org/10.1016/j.engappai.2022.105478 (2023).
Article Google Scholar
Luo, J., Lin, H., Wei, X. & Wang, Y. Adaptive canny and semantic segmentation networks based on feature fusion for road crack detection. IEEE Access 11, 51740–51753. https://doi.org/10.1109/ACCESS.2023.3279888 (2023).
Article Google Scholar
Zhu, X. Detection and recognition of concrete cracks on building surface based on machine vision. Prog. Artif. Intell. 11, 143–150. https://doi.org/10.1007/s13748-021-00265-z (2022).
Article Google Scholar
Ren, W. & Zhong, Z. Building construction crack detection with BCCD YOLO enhanced feature fusion and attention mechanisms. Sci. Rep. 15, 23167. https://doi.org/10.1038/s41598-025-05665-y (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Tang, W., Wu, Z., Wang, W., Pan, Y. & Gan, W. VM-UNet++ research on crack image segmentation based on improved VM-UNet. Sci. Rep. 15, 8938. https://doi.org/10.1038/s41598-025-92994-7 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. Algorithm for pixel-level concrete pavement crack segmentation based on an improved U-Net model. Sci. Rep. 15, 6553. https://doi.org/10.1038/s41598-025-91352-x (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Qu, Z., Mu, G. & Yuan, B. A lightweight network with dual encoder and cross feature fusion for cement pavement crack detection. CMES-Comput. Model. Eng. Sci. https://doi.org/10.32604/cmes.2024.048175 (2024).
Article Google Scholar
Quan, J., Ge, B. & Wang, M. CrackViT: a unified CNN-transformer model for pixel-level crack extraction. Neural Comput. Appl. 35, 10957–10973. https://doi.org/10.1007/s00521-023-08277-7 (2023).
Article Google Scholar
Yang, S., Wang, Y. & Chen, H. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 296–306. https://doi.org/10.1007/978-3-031-72083-3_28 (2024).
Pei, X., Huang, T. & Xu, C. Efficientvmamba: Atrous selective scan for light weight visual mamba. In Proceedings of the AAAI Conference on Artificial Intelligence 39, 6443–6451. https://doi.org/10.1609/aaai.v39i6.32690 (2025).
Liao, W. et al. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024. arXiv preprint arXiv:2403.05246.https://doi.org/10.48550/arXiv.2403.05246 (2024).
Bansal, S. et al. A comprehensive survey of mamba architectures for medical image analysis: Classification, segmentation, restoration and beyond. arXiv preprint arXiv:2410.02362.https://doi.org/10.48550/arXiv.2410.02362 (2024).
Iriawan, N. et al. YOLO-UNet architecture for detecting and segmenting the localized MRI brain tumor image. Appl. Comput. Intell. Soft Comput. 2024, 3819801. https://doi.org/10.1155/2024/3819801 (2024).
Article Google Scholar
Guo, Q., Wang, Y., Zhang, Y., Zhao, M. & Jiang, Y. AIE-YOLO: Effective object detection method in extreme driving scenarios via adaptive image enhancement. Sci. Prog. 107, 00368504241263165. https://doi.org/10.1177/00368504241263165 (2024).
Article PubMed PubMed Central Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18. (Springer, 234–241). https://doi.org/10.1007/978-3-662-54345-0_3.
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N. & Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 39, 1856–1867. https://doi.org/10.1109/TMI.2019.2959609 (2019).
Article ADS PubMed PubMed Central Google Scholar
Zou, Q. et al. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process. 28, 1498–1512. https://doi.org/10.1109/TIP.2018.2878966 (2018).
Article ADS MathSciNet Google Scholar
Pan, Z., Lau, S. L., Yang, X., Guo, N. & Wang, X. Automatic pavement crack segmentation using a generative adversarial network (GAN)-based convolutional neural network. Results Eng. 19, 101267. https://doi.org/10.1016/j.rineng.2023.101267 (2023).
Article Google Scholar
Yu, G., Dong, J., Wang, Y. & Zhou, X. RUC-Net: A residual-Unet-based convolutional neural network for pixel-level pavement crack segmentation. Sensors 23, 53. https://doi.org/10.3390/s23010053 (2022).
Article ADS PubMed PubMed Central Google Scholar
Chen, W., Mu, Q. & Qi, J. TrUNet: Dual-branch network by fusing CNN and transformer for skin lesion segmentation. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3463713 (2024).
Article Google Scholar
Wang, W. & Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 139, 104275. https://doi.org/10.1016/j.autcon.2022.104275 (2022).
Article Google Scholar
Liu, C., Zhu, C., Xia, X., Zhao, J. & Long, H. FFEDN: Feature fusion encoder decoder network for crack detection. IEEE Trans. Intell. Transp. Syst. 23, 15546–15557. https://doi.org/10.1109/TITS.2022.3141827 (2022).
Article Google Scholar
Liu, H., Miao, X., Mertz, C., Xu, C. & Kong, H. Crackformer: Transformer network for fine-grained crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (3783–3792). https://doi.org/10.1109/ICCV48922.2021.00376 (2021).
Liu, H., Yang, J., Miao, X., Mertz, C. & Kong, H. CrackFormer network for pavement crack segmentation. IEEE Trans. Intell. Transp. Syst. 24, 9240–9252. https://doi.org/10.1109/TITS.2023.3266776 (2023).
Article Google Scholar
Wu, H. et al. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 76, 102327. https://doi.org/10.1016/j.media.2021.102327 (2022).
Article PubMed Google Scholar
Xiang, C., Guo, J., Cao, R. & Deng, L. A crack-segmentation algorithm fusing transformers and convolutional neural networks for complex detection scenarios. Autom. Constr. 152, 104894. https://doi.org/10.1016/j.autcon.2023.104894 (2023).
Article Google Scholar
Wu, Y. et al. Dual attention transformer network for pixel-level concrete crack segmentation considering camera placement. Autom. Constr. 157, 105166. https://doi.org/10.1016/j.autcon.2023.105166 (2024).
Article Google Scholar
Chen, J. et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 97, 103280. https://doi.org/10.1016/j.media.2024.103280 (2024).
Article PubMed Google Scholar
Wang, C., Liu, H., An, X., Gong, Z. & Deng, F. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digit. Signal Process. 145, 104297. https://doi.org/10.1016/j.dsp.2023.104297 (2024).
Article Google Scholar
Wang, Z., Zheng, J., Zhang, Y., Cui, G. & Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024. arXiv preprint arXiv:2402.05079. https://doi.org/10.48550/arXiv.2402.05079 (2024).
Ruan, J., Li, J. & Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491. https://doi.org/10.48550/arXiv.2402.02491 (2024).
Liu, X., Zhang, C. & Zhang, L. Vision mamba: A comprehensive survey and taxonomy. arXiv 2024. arXiv preprint arXiv:2405.04404. https://doi.org/10.48550/arXiv.2405.04404 (2024).
Wang, J., Chen, J., Chen, D. & Wu, J. LKM-UNet: Large kernel vision mamba unet for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (360–370). https://doi.org/10.1007/978-3-031-72111-3_34 (2024).
Liu, J. et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In International Conference on Medical Image Computing and Computer-Assisted Intervention (615–625). https://doi.org/10.1007/978-3-031-72114-4_59 (2024).
Chen, Z., Shamsabadi, E. A., Jiang, S., Shen, L. & Dias-da-Costa, D. Vision Mamba-based autonomous crack segmentation on concrete, asphalt, and masonry surfaces. arXiv preprint arXiv:2406.16518. https://doi.org/10.48550/arXiv.2406.16518 (2024).
Liu, Y. et al. Vmamba: Visual state space model. Adv. Neural. Inf. Process. Syst. 37, 103031–103063 (2024).
Google Scholar
Lu, L., Yin, M., Fu, L. & Yang, F. Uncertainty-aware pseudo-label and consistency for semi-supervised medical image segmentation. Biomed. Signal Process. Control 79, 104203. https://doi.org/10.1016/j.bspc.2022.104203 (2023).
Article Google Scholar
Li, H., Nan, Y., Del Ser, J. & Yang, G. Region-based evidential deep learning to quantify uncertainty and improve robustness of brain tumor segmentation. Neural Comput. Appl. 35, 22071–22085. https://doi.org/10.1007/s00521-022-08016-4 (2023).
Article PubMed Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2921–2929). https://doi.org/10.1109/CVPR.2016.319 (2016).
Yang, F. et al. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 21, 1525–1535. https://doi.org/10.1109/TITS.2019.2910595 (2019).
Article Google Scholar
Zou, Q., Cao, Y., Li, Q., Mao, Q. & Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recogn. Lett. 33, 227–238. https://doi.org/10.1016/j.patrec.2011.11.004 (2012).
Article ADS Google Scholar
Shi, Y., Cui, L., Qi, Z., Meng, F. & Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 17, 3434–3445. https://doi.org/10.1109/TITS.2016.2552248 (2016).
Article Google Scholar
Ma, J., Li, F. & Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024. arXiv preprint arXiv:2401.04722. https://doi.org/10.48550/arXiv.2401.04722 (2024).

Download references

Funding

This research was funded by Pilot Project for Building a Country with Strong Transportation Network of Research Institute of Highway, Ministry of Transport (grant number QG 2021-3-15-5), Key Science and Technology Project of the Transportation Industry (grant number 2021-ZD1-032).

Author information

Authors and Affiliations

Research Institute of Highway, Ministry of Transport, Beijing, 100088, China
Yunfeng Wang, Jie Jin & Lu Zhang
Key Laboratory of Operation Safety Technology for Transportation Vehicles of Transport Industry, Beijing, 100088, China
Yunfeng Wang, Jie Jin & Lu Zhang
Research and Development Center of Automated Construction Technology of Transport Industry, Beijing, 100088, China
Yunfeng Wang, Jie Jin & Lu Zhang
China Automotive Engineering Research Institute Co.,Ltd, Chongqing, 401122, China
Xiong Chen & Ze Wu

Authors

Yunfeng Wang
View author publications
Search author on:PubMed Google Scholar
Jie Jin
View author publications
Search author on:PubMed Google Scholar
Xiong Chen
View author publications
Search author on:PubMed Google Scholar
Ze Wu
View author publications
Search author on:PubMed Google Scholar
Lu Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

J.J. and Y.W. mainly proposed the concept and methods. Y.W. and Z.W. were responsible for the algorithm implementation. J.J., L.Z., and X.C. verified the overall content. X.C. and Z.W. preprocessed the experimental data. Y.W. was responsible for writing, editing, and reviewing the draft.

Corresponding author

Correspondence to Jie Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Y., Jin, J., Chen, X. et al. A lightweight crack segmentation network based on the importance-enhanced Mamba model. Sci Rep 15, 41549 (2025). https://doi.org/10.1038/s41598-025-25504-4

Download citation

Received: 22 August 2025
Accepted: 21 October 2025
Published: 24 November 2025
Version of record: 24 November 2025
DOI: https://doi.org/10.1038/s41598-025-25504-4

Subjects

Abstract

Similar content being viewed by others

An efficient semantic segmentation method for road crack based on EGA-UNet

VM-UNet++ research on crack image segmentation based on improved VM-UNet

Efficient crack and surface-type recognition via CNN-block development mechanism and edge profiling

Introduction

Related work

Proposed method

Preliminaries

IEM-UNet

Importance-enhanced visual state space model (IE-VSS)

Importance-enhanced scan (IE-scan)

Improved VSS component

Attention-gated (AG) module

Experiments

Dataset

Evaluation indicators

Experimental setup

Analysis of experiments results

Results on the Crack500

Results on the Crack756

Computational efficiency analysis

Ablation experiment

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links