An efficient camouflaged image segmentation with modified UNet and attention techniques

Padhy, Isha; Dansena, Prabhat; Sahoo, Sampa; Priyadarshi, Rahul

doi:10.1038/s41598-025-07571-9

Download PDF

Article
Open access
Published: 01 July 2025

An efficient camouflaged image segmentation with modified UNet and attention techniques

Isha Padhy^1,3,
Prabhat Dansena¹,
Sampa Sahoo¹ &
…
Rahul Priyadarshi²

Scientific Reports volume 15, Article number: 21086 (2025) Cite this article

2470 Accesses
2 Citations
Metrics details

Subjects

Abstract

Camouflaged object segmentation (COS) is a challenging task in computer vision where the objective is to recognize and precisely separate objects that blend in with their environment. Traditional models, including the standard UNet architecture, struggle with this task due to ambiguous object boundaries, texture similarity between object and background, and over-segmentation or under-segmentation caused by redundant skip connections. CAMO-UNet addresses these issues by including residual blocks which improve feature learning by easing the gradient flow and enabling deeper architectures. The attention mechanism focuses on ‘what’ is important, ‘where’ important features are located in the spatial domain and captures long-range dependencies across the image. The Depth-aware triangular cyclic learning rate (CLR) dynamically adjusts learning rates at different network depths to enhance training efficiency. CAMO-UNet achieved 93.8% accuracy on benchmark datasets and outperformed state-of-the-art models like SINet, BGNet, PFNet, etc., in metrics including S-measure, F-measure, MAE, and accuracy.

Identification of camouflage military individuals with deep learning approaches DFAN and SINETV2

Article Open access 26 September 2025

A collaborative multi-attention network for real-time small object detection in UAV imagery

Article Open access 20 January 2026

A ternary encoding network fusing scale awareness and large kernel attention for camouflaged object detection

Article Open access 24 April 2025

Introduction

In the animal kingdom, prey frequently adapt to appear like their surroundings to trick predators¹. Some animals use natural camouflage² to blend in with their environment and manipulate their body color and structure to avoid being seen by predators. Researchers have used Camouflaged Object Detection (COD)³ techniques to find and examine hidden objects. Modern COD approaches employ advanced instance segmentation techniques to precisely identify and outline hidden objects. State-of-the-art frameworks like Mask R-CNN⁴ and its enhanced variant MS R-CNN⁵ utilize region-based convolutional networks to achieve accurate segmentation. More recent innovations, such as CenterMask⁶ and BlendMask⁷, incorporate attention mechanisms and feature blending to better handle complex camouflage patterns. These models typically leverage powerful backbone architectures like ResNet50 and ResNet101-FPN⁸ for robust feature extraction. The task proves significantly more demanding than conventional salient object detection^9,10, as camouflaged targets often exhibit minimal contrast with their surroundings and may appear fragmented due to occlusion. An example is shown in Fig. 1, where the occluded part of the object provides the notion of two objects in the image. Here, it has also been considered and tried to overcome this issue. This complexity has driven the development of specialized solutions like SOLO¹¹, which implements a novel grid-based segmentation approach, and transformer-based models that capture long-range visual dependencies. Evaluation of COD systems requires careful consideration, with metrics like the structure measure¹² providing a nuanced assessment of segmentation quality. As the field progresses, researchers are exploring innovative directions, including self-supervised learning and multi-modal data fusion, to further improve detection capabilities in real-world scenarios.

The renowned UNet model¹³ used for medical image segmentation failed to produce satisfactory results for camouflaged or occluded images. The encoder-decoder stage comprises the UNet architecture, where the encoder stage extracts the necessary features for building the image, and the decoder stage uses this extracted information to reconstruct the image. These models are dimensionally reduced, making them appropriate for classification issues. Skip connections were added between the relevant encoder and decoder levels by using UNet to alleviate congestion. UNet is a cutting-edge segmentation model owing to its excellent intuition; however, skip connections can lead to over- and under-segmented conditions.

Camouflage can occur in almost every field, such as medical diagnosis¹⁴, detection of defective items¹⁵, identification of agricultural pests, etc., and it demands extensive examination to avoid hazardous situations. Almost the same color, texture, brightness, etc., between camouflaged items and their surroundings makes Camouflaged Object Detection (COD) a challenging task. Further, low-level features like edge, brightness, color, gradient, and texture extraction don’t always seem enough because well-done camouflage is adept at obscuring them¹⁶. The authors in Ref.¹⁷ proposed a model that can classify and segment camouflaged images and also came up with a dataset CAMO for classifying an image as camouflaged or not. The work in Ref.¹⁸ introduces a camouflage fusion learning (CFL) framework to segment different instances of camouflaged objects. In numerous applications, including image segmentation^19,20, adding an attention mechanism to the network is proved to be fruitful. The attention mechanism is a process that can draw attention to information that is relevant to the job at hand in both the channel and spatial directions while ignoring irrelevant data. The researchers in Ref.^8,21 demonstrate how to segment medical images using the U-Net while paying attention spatially to the context. Hu et al.¹⁹ proposed a block composed of channel-wise attention for segmentation. The block significantly improved performance in the classification job and could draw attention to significant dimensions. A model given by Li et al.²² uses channel attention for sentence-level training and spatial attention for word level. The authors have produced a UNet architecture with a Self-attention Module for Retinal Vessel segmentation. An extended version of UNet model is also used in Ref.²³ to segment parasites from the background. In the paper²⁴, the authors speak about a technique to optimize object detection in camouflaged image tasks and list a comparative analysis related to existing studies. The applications of Unet can also be found in Ref.²⁵ where a modified U-Net is used for semantic segmentation of satellite imagery data. The attention mechanism of the proposed approach is used to extract and manipulate the spatial and channel-wise information to enhance the representation and capture relevant features for a given task. Traditional models such as UNet, while effective for general segmentation tasks, often struggle to accurately delineate camouflaged or occluded regions. Consequently, there is a growing need for advanced architectures capable of learning complex features and emphasizing subtle cues that distinguish target regions from the background. The motivation behind developing the CAMO-UNet model stems from the limitations observed in existing approaches when applied to intricate segmentation tasks. By integrating attention mechanisms and refined connectivity patterns into the foundational UNet architecture, CAMO-UNet is designed to selectively enhance relevant spatial and contextual information. CAMO-UNet, goes beyond standard UNet by incorporating three types of attention: Spatial Attention (SA) that focuses on “where” in the image to look, Channel Attention (CA) which Highlights the most informative feature channels, Self-Attention that Captures long-range dependencies and global context within feature maps. Most UNet variants use either spatial or channel attention—but this work uses all three synergistically, improving feature discrimination for subtle camouflage patterns. Moreover, the escalating complexity and parameter density of modern deep learning models pose additional concerns, especially in specialized tasks like camouflaged image segmentation where overfitting and computational overhead can hinder performance. CAMO-UNet aims to strike a balance between model expressiveness and efficiency, offering a focused solution to these challenges through targeted architectural innovations. CAMO-UNet introduces a layer wise dynamic learning rate schedule based on network depth. Shallow layers receive smaller updates to retain basic features and deeper layers learn high-level abstractions more aggressively.

Results and discussions

This section discusses the popularly camouflaged datasets and the data preprocessing step of the proposed approach.

Datasets

The Camouflaged image segmentation has a relatively small number of datasets, making it difficult to discover and prepare images for Ground Truth (GT) annotations and image collection, respectively. The images used in this study were obtained from COD10K and CAMO datasets. The COD10K³ dataset comprises 10000 images exhibiting natural camouflage, whereas the CAMO¹⁷ dataset comprises 1250 images showing camouflage in the ecological environment. As camouflage pattern complexity increases, the visibility of the object’s surface area inside the image is frequently challenging for humans to recognize. Thus, camouflaged object detection has garnered the attention of computer vision experts. A glimpse of the CAMO and CODK10 dataset images and their ground truths are presented in Figure 2.

Data preprocessing

The input image is passed through a few preprocessing steps. The Resize operation is first applied to the image to resize it to a size of (224, 224) dimensions/pixels maintaining the aspect ratio. Normalization was used to scale the pixel values to fall between 0 and 1 which was accomplished by dividing it by 255.0. Both operations are commonly used in image preprocessing pipelines to standardize the input data, such as resizing images to an acceptable size and applying the normalization procedure to pixel values for better numerical stability during model training or inference. The resultant normalized image was presented for augmentation. The practice of increasing volumes of input data into a large sample space is known as data augmentation. The classification models permit the alteration of images when the dataset is small however any solution must be applied both on image and its mask to undergo the same transformation. The flip and affine functions introduce variations in the dataset by flipping images horizontally and applying random translations. These changes were added to the dataset during the preprocessing stage, which improved the robustness and generalization capabilities of the model. For the COD10k dataset, the number of images was increased to 12000 for train and 8000 for testing. The shadow detection process³⁷ was also performed as part of preprocessing.

Evaluation settings

The experiment in the CAMO-UNet model was conducted on various datasets, with all images resized 256 $\times$ 256 pixels and converted to grayscale. Grayscale images were used because the UNet model structure, which was used as a refer- ence, performs well on grayscale images. The following Table 2 shows the experimental setup for the proposed method.

Table 2 Model training parameters.

Full size table

The model was implemented using TensorFlow with the Keras backend in Python. For each epoch, measurements of accuracy and the Dice coefficient were used to track the training process. To avoid overfitting, early stopping was performed based on the loss value of validation and 2 was considered as the patience value. Steps to facilitate speeder and deep learning model execution were performed using the Google Colab Pro platform. It provides 25 GB RAM, 147 GB hard disk, and an on-demand GPU. Experiments were also performed on the PyCharm application using a platform with 1920 cores and graphic processors. Overall, these experimental settings and infrastructure were chosen to ensure the efficient training and evaluation of the CAMO-UNet model on the selected datasets. training and evaluation of the CAMO-UNet model on selected datasets.

Performance metrics

Comparing the anticipated segmentation with the actual segmentation is one way to evaluate a segmentation algorithm. The CAMO-UNet model was evaluated using performance metrics such as accuracy, Cross-Entropy Loss, Focal Loss, MAE, and Dice Coefficient.

From Fig. 3, to obtain accuracy, the model was trained with focal and cross-entropy losses. The red line indicates the validation loss and the green line indicates the training loss. Cross-entropy loss works well when the classes are balanced and there are no extreme class imbalances. The reason behind The unstable nature of the graph is owing to the imbalanced nature of the dataset. In the context of the CAMO-UNet model, MAE was used as a metric to evaluate the accuracy of the predicted segmentations compared to the ground truth segmentations^49,50. The lower the MAE, the closer the predicted segmentations are to the ground truth, indicating better performance of the model. The MAE results of CAMO-UNet are listed in Tables 3 and 4, when compared with other existing models. Another F2 metric has been considered, which is commonly used to evaluate the performance of binary classification models, particularly when the focus is on optimizing the recall (sensitivity) rather than precision. It is an extension of the F 1 measure that puts more emphasis on recall. S-measure, or Structure Similarity Measure⁵¹, is a metric used to evaluate the structural similarity between two images. It is based on the concept of the Structural Similarity Index (SSIM), which measures the similarity in terms of luminance, contrast, and structure between two images^52,53. In the CAMO-UNet model, the F2 measure is used to evaluate the performance of the model in segmenting camouflaged objects in images⁵⁴. It takes into account both precision and recall, providing an overall assessment of the effectiveness of the model.

Adaptive learning rate with depth-aware triangular cyclic learning rate (CLR)

Since CAMO-UNet consists of an encoder–decoder structure with residual and attention mechanisms, the Cyclic Learning Rate (CLR) is used to account for feature extraction at different depths. Instead of a basic cyclic pattern, the effect of multiple stages is integrated into the learning rate schedule. The following sections describe the stages of optimization^55,56.

Depth-aware triangular CLR

The static attention mechanisms may not always adapt efficiently to varying feature complexities in camouflaged images, so the Convolutional Block Attention Module (CBAM) is introduced to dynamically select features^57,58. It dynamically recalibrates the channel and spatial attention weights based on feature importance. For a specific layer l in the encoder-decoder network, the cyclic learning rate is given by:

$$\begin{aligned} \eta _t^l = \eta _{\min } + (\eta _{\max } - \eta _{\min }) \times \max \left( 0, 1 - \left| 2C^l - 1 \right| \right) , \end{aligned}$$

(1)

where

$\eta _t^l$ is the learning rate at iteration t for layer l.
$C^l = \frac{\text {iteration} \mod (2 \times \text {step size}_l)}{2 \times \text {step size}_l}$ represents the cycle progress for layer l.
$\text {step size}_l = \alpha \cdot 2^l$, where $\alpha$ is a scaling factor that increases with deeper layers.

The deeper the layer, the larger the step size, ensuring that shallower layers update faster while deeper layers stabilize. The implementation of CLR was carried out in the following steps:

Applied depth-aware triangular ring in the training stage to dynamically adjust learning rates at different depths of CAMO-UNet.
It was used with AdamW optimizer to improve weight decay handling and optimize training convergence.
It was integrated into encoder layers, where lower layers have lower learning rates to retain low-level features, while deeper layers benefit from higher learning rates for complex pattern learning.
Also applied to fine-tuning & transfer learning by gradually refining feature maps while preventing overfitting.

Learning Rate Schedule is given by,

$$\begin{aligned} \eta _t = \eta _{\text {min}} + \frac{1}{2} (\eta _{\text {max}} - \eta _{\text {min}}) \left( 1 + \cos \left( \frac{T_{\text {cur}}}{T_{\text {max}}} \pi \right) \right) , \end{aligned}$$

(2)

where

$\eta _t$ is the learning rate at epoch t.
$T_{\text {cur}}$ is the current iteration.
$T_{\text {max}}$ is the maximum iteration count.
$\eta _{\text {min}}$ and $\eta _{\text {max}}$ are the minimum and maximum learning rates.

Exponential decay with attention scaling

Since self-attention modules help in refining segmentation maps, they require stable learning rates. The exponential decay CLR is modified to include an attention factor $A_l$:

$$\begin{aligned} \eta _t^l = \eta _{\min } + (\eta _{\max } - \eta _{\min }) \times A_l \times e^{-k t}, \end{aligned}$$

(3)

where:

$A_l = 1 + \lambda \sum _{i=1}^{L} \text {Attention}i^l$
$\lambda$ is a weight factor for attention influence.
$\sum {i=1}^{L} \text {Attention}_i^l$ represents the cumulative contribution of attention across all layers i for a specific layer l.
$e^{-k t}$ ensures gradual decay of the learning rate.

Comparison results

This section compares performance metrics such as recall, specificity, accuracy, F-measure, precision and S-measure with existing methods.

The prediction results are shown in Fig. 4. From the comparison results, it can be seen that CAMO-UNet predicts the shape of the object close to the GT. The results of CAMO-UNet were compared with other states of artwork for S Measure, F Measure, MAE, E Measure, and accuracy. The results indicate that CAMO-UNet works better than the other models. Experiments were done on COD10K and CAMO datasets. Tables 3 and 4 represents the results when the models are applied individually to the CAMO and COD10K dataset respectively. Here, it can be seen clearly that CAMO-UNet works remarkably better than present models. Figure 5 shows how CAMO-UNet consistently achieves superior accuracy and the F measure while maintaining a low MAE, indicating precise segmentation. Bar colors distinguish performance across datasets and models, with inverted MAE axis to emphasize lower error as better performance.

Table 3 Comparative performance measure on CAMO dataset.

Full size table

Table 4 Comparative performance measure on COD10K dataset.

Full size table

Figures 6 and 7 compare performance metrics such as Accuracy, Recall, Precision, and Specificity of different models across two datasets: (b) COD10K (c) CAMO, where CAMO-UNet performs far better than state-of-the-art models. The CAMO dataset, exhibited approximately 97% Accuracy, 95% recall, 98% precision, and 96% specificity, surpassing models such as JCNet [7], FSL [8], LINet [9] and DGNet [14]. For COD10K (b), the proposed model achieved approximately 97% accuracy, 97% recall, 99% precision, and 98% specificity, leading to models such as JCNet [7], DGNet [14], MRR-Net [15], and BGNet [16]. Models like PFNet and BGNet suffer from over-segmentation due to over-reliance on skip connections. CAMO-UNet integrates attention-controlled skip connections and residual blocks, which preserve important features while suppressing noise.

Ablation study

The experiments on COD10K and CAMO datasets, comparing the optimized CAMO-UNet with the standard CAMO-UNet were carried out.

Table 5 Performance comparison of CAMO-UNet variants on COD10K and CAMO datasets.

Full size table

The Table 5 presents a performance comparison between standard CAMO-UNet and optimized CAMO-UNet (which includes the adaptive learning rate and the depth-aware triangular cyclic learning rate). The Optimized CAMO-UNet consistently outperforms the standard version across all metrics.

Discussion

The experimental results of the extensive performance evaluation and comparative analysis underscored the robustness and superiority of the proposed CAMO-UNet model across multiple datasets. CAMO-UNet’s integration of spatial and channel attention significantly enhances its ability to delineate ambiguous borders that often confuse baseline models. The inclusion of attention-controlled skip connections and residual blocks help reduce redundancy, a common issue in traditional UNet-based architectures. CAMO-UNet achieves the highest accuracy (93–94%) and F-measure (0.87–0.89), and matches the lowest MAE (0.03–0.04) on both CAMO and COD10K datasets. The use of depth-aware CLR enables dynamic learning across network depths, leading to faster convergence and better generalization. CAMO-UNet consistently ranks at or near the top in all metrics in both the CAMO and COD10K datasets—especially in: precision (0.93–0.94), F measure (0.87–0.89), S measure and E measure (0.91–0.92), with low MAE (0.03–0.04). Models like PFNet and BGNet suffer from over-segmentation due to over-reliance on skip connections. CAMO-UNet integrates attention-controlled skip connections and residual blocks, which preserve important features while suppressing noise. Most prior models use fixed learning schedules, which don’t adapt well to feature complexity in deeper layers. CAMO-UNet employs a Depth-Aware Cyclic Learning Rate (CLR) to better train shallow and deep layers with tailored learning rates. However, despite its strengths, CAMO-UNet is not without limitations. The added attention modules and residual blocks introduces more parameters and computational overhead. Performance may be sensitive to the diversity and quality of training samples, especially in the case of rare camouflage patterns.

Proposed CAMO-UNet framework

This section describes the architecture of the proposed model.

The CAMO-UNet network overview

The success of any segmentation algorithm for camouflaged images can be challenging. The proposed CAMO- UNet network combines the benefits of the UNet, Residual networks, and attention ideas to make the UNet architecture more easily understood and concentrate on the required notion. It contains phases for the encoder and decoder, similar to UNet. In addition, attention and residual blocks were used to improve the accuracy of the model. The major feature of the CAMO-UNet model is the skip connections that allow the decoder to receive features from the earlier stages of the encoder, enabling fine-grained localization. The intricacy of the camouflage patterns, the quality and variety of the training data, and the precise implementation details of the CAMO-UNet algorithm are only a few of the variables that affect how well the algorithm performs on camouflaged images.

The model architecture is illustrated in Fig. 8. The proposed CAMO- UNet comprises an encode, decoder, Channel Attention(CA), Spatial Attention(SA), self-attention (Self-A), residual blocks and skip connections. CAMO-UNet utilizes self-attention, channel attention, and spatial attention mechanisms to obtain both local and global dependencies between pixels within an input image. Spatial attention can enhance the discriminative power of a model by highlighting important regions while downplaying less informative regions. By applying channel attention, the model can learn to emphasize the channels that carry the most discriminative information to segment the camouflaged object. Self-attention is applied to the last layer to obtain long- range dependencies which refer to the relationships between elements that are far apart in a sequence or spatial context.

Stages of encoder and decoder

The various stages of the Encoder and Decoder are discussed here.

Encoder The input image was passed through the first convolutional block where 32 filters were applied. This can be expressed as a convolution of the input tensor X0 using a kernel K1 of size 3$\times$3$\times$C$\times$32. The output X1, is then passed through a max pooling operation, reducing its spatial dimensions.
$$\begin{aligned} X_1 = \text {Conv}(X_0, K_1), \end{aligned}$$
(4)
where $K_1$ represents the set of 32 filters, followed by:
$$\begin{aligned} X_1' = \text {MaxPool}(X_1). \end{aligned}$$
(5)
The output from the first block, $X_1'$, was fed into the second convolutional block, where 64 filters were applied. This operation is represented as the convolution of $X_1'$ with kernel $K_2$ of size $3 \times 3 \times 32 \times 64$ producing $X_2$. A max pooling operation downsamples $X_2$ yielding $X_2'$:
$$\begin{aligned} X_2 = \text {Conv}(X_1', K_2), \end{aligned}$$
(6)
$$\begin{aligned} X_2' = \text {MaxPool}(X_2). \end{aligned}$$
(7)
The process continues for two additional layers to obtain $X_4$ and $X_4'$. After the fourth block, two additional convolutional layers are applied. First, the output $X_4'$ is convolved with kernel $K_5$ of size $3 \times 3 \times 256 \times 256$, producing $X_5$. This is followed by another convolution using kernel $K_6$ of size $3 \times 3 \times 256 \times 256$, resulting in the final output, $X_6$.
$$\begin{aligned} X_5 = \text {Conv}(X_4', K_5), \end{aligned}$$
(8)
$$\begin{aligned} X_6 = \text {Conv}(X_5, K_6). \end{aligned}$$
(9)
Decoder The feature map is denoted as F, which is a tensor of size $C \times R \times M \times N$, where C represents the spatial dimensions with height H and width W, R denotes the number of channels, M is the height of the feature map, and N is the width of the feature map. Each element in F corresponds to a specific location in the spatial dimensions and channels. The stored values represent activations or responses at the locations. Skip Connection and Upsampling Block The output from the previous blocks, denoted as a feature map F, and the feature map received from the encoder, are connected through a skip connection and passed into the upsampling block. Upsampling improves the spatial resolution of the feature map to match the original input size better.
$$\begin{aligned} F_1 = \text {SkipConnection}(F_{\text {Encoder}}, F_{\text {Decoder}}), \end{aligned}$$
(10)
where $F_{\text {Encoder}}$ represents the feature map from the encoder, and $F_{\text {Decoder}}$ is the output from the decoder. After upsampling, the skip connection and the upsampled feature maps were concatenated. The concatenated feature map was processed through convolutional blocks to refine and extract the necessary features. The feature maps that pass through the network can be represented as:
$$\begin{aligned} F \in \mathbb {R}^{C \times R \times M \times N}, \end{aligned}$$
(11)
where C denotes the number of channels; M and N represent the height and width of the spatial dimensions, respectively; and R is the number of channels in the feature map. Each element in the feature map, F represents the activation or response of specific patterns at particular spatial locations and channels. Upsampling block and resizing The output from the final upsampling block was passed to the subsequent layer, where it was resized to match the input image dimensions. This can be written as:
$$\begin{aligned} F_{\text {Upsampled}} = \text {Resize}(F_{\text {FinalUpsampling}}, M_{\text {Input}}, N_{\text {Input}}), \end{aligned}$$
(12)
where $M_{\text {Input}}$ and $N_{\text {Input}}$ represent the height and width of the input image, respectively, and $F_{\text {FinalUpsampling}}$ is the feature map from the last upsampled block. The upsampled feature map and the input image are combined to generate the final output. This can be represented as:
$$\begin{aligned} F_{\text {Combined}} = \text {Combine}(F_{\text {Upsampled}}, X_{\text {Input}}). \end{aligned}$$
(13)
Finally, a $1 \times 1$ convolution layer was applied to the combined output to generate a segmentation mask.
$$\begin{aligned} F_{\text {SegmentationMask}} = \text {Conv}_{1 \times 1}(F_{\text {Combined}}). \end{aligned}$$
(14)
Residual block Each residual block contains two convolutional blocks, each with three layers: a convolution layer, batch normalization layer, and ReLU activation layer. The output of each residual block is linked to that of the previous block through a residual (skip) connection. Let X represent the input to the residual block and let $W_1$ and $W_2$ represent the weights of the convolution filters in the first and second convolution layers, respectively. The output of the residual block can be written as:
$$\begin{aligned} Y = \text {ReLU}(\text {BatchNorm}(\text {Conv}(X, W_2))) +\,\, \text {ReLU}(\text {BatchNorm}(\text {Conv}(X, W_1))) + X. \end{aligned}$$
(15)
This formulation shows that input X is added element-wise to the final output of the block through a skip connection. A demonstration of this is shown in Fig. 9. Skip connections are used between the downsampled blocks in the encoder and the upsampled blocks in the decoder. The output of each downsample block is added element-wise to the output of the corresponding upsample block to preserve the important features and facilitate information flow. Let $D_i$ be the output of the i-th downsample block in the encoder and, $U_i$ be the output of the corresponding i-th upsample block in the decoder. The skip connection between them is mathematically expressed as follows:
$$\begin{aligned} U_i' = U_i + D_i, \end{aligned}$$
(16)
where $U_i'$ is the new output of the upsample block after adding the output off the corresponding downsample block.
Attention blocks: The Attention mechanisms capture crucial spatial and channel information at various levels, and the connectedness between layers enables the flow of information across the network. To ensure that the initial feature maps adequately captured the regional patterns and features, spatial attention was only applied to the top layer of the model. Applying channel attention to deeper layers allows the model to selectively emphasize or suppress specific channels in the feature maps at those layers. This section discusses the attention mechanisms, that are added to the output of the encoder layers through a skip connection. The Spatial Attention layer in the CAMO-UNet model is shown in Fig. 10 highlighting the significant spatial regions for the task at hand. This helps the model consider relevant areas of the given image while suppressing less informative regions. Channel attention (CA), as shown in Fig. 11, is the channel attention block in the CAMO-UNet model that benefits the model by allowing it to selectively emphasize or suppress different channels of the feature maps. This helps the model to focus on the most in- formative and discriminative channels while downplaying less relevant channels. By learning the attention weights for each channel, the model can dynamically adapt its feature representation based on the importance of each channel to a given task. Self-attention(Self-A), in the CAMO-UNet model, is a key component used to capture spatial relationships within feature maps. It aids the model’s ability to concentrate on key areas and discover contextual relationships among various locations on feature maps. Mathematically, the self-attention mechanism in CAMO-UNet can be described as follows: The input to the self-attention mechanism is a feature map, denoted as
$$\begin{aligned} F \in \mathbb {R}^{C \times U \times V} \end{aligned}$$
where U and V represent the spatial dimensions of the feature map (height and width) and C is the number of channels. Input feature maps F with dimensions $C \times U \times V$ undergo operations such as global average pooling, global max pooling, concatenation, and fully connected layers to obtain channel-wise recalibrated feature maps. The output feature maps are reshaped and projected to obtain the query (Q), key (K), and value (L) tensors. The reshaped feature maps have dimensions $C' \times U' \times V'$, and the projection matrices ($W_Q$, $W_K$, $W_L$) transform them into $d \times U' \times V'$ shape. Self-attention is computed using the query (Q), key (K), and value (L) tensors. The attention scores were obtained by taking the dot product between Q and K transposed, followed by normalization using a softmax function. Normalized attention scores were applied to L to obtain self-attended feature maps. The self-attended feature maps were projected back onto the original channel dimensions using a projection matrix ($W_S$). The output is a self-attended feature map with dimensions $C \times U' \times V'$

Conclusion and future work

The outcome of the CAMO-UNet algorithm is good because it is trained on a diverse and representative dataset that includes a wide range of camouflaged images. In addition, incorporating data augmentation techniques specific to camouflage patterns and carefully tuning the hyperparameters of the model enhances its effectiveness. Overall, while CAMO-UNet provides a framework that incorporates attention mechanisms to handle complex image segmentation tasks, its effectiveness on camouflaged images ultimately depends on the specific challenges posed by camouflage patterns. This study compares its findings with those of a previous study and proposes an optimization approach for COD tasks. In certain investigations, preprocessing techniques, including transformation, image enhancement, and shadow removal, have been applied. The proposed approach effectively addresses the issue of camouflaged objects and offers the best solution compared with previous approaches. Testing using publicly accessible datasets on bench- mark models, existing models, and the proposed CAMO-UNet model revealed that it performs optimally and correctly. In the future, we will develop a method to determine the optimum loss function for a given model to improve accuracy. In addition, we aimed to design a model for run-time inference and sample-to-sample comparison analysis.

Data availability

The data that support the findings of this study are available from the corresponding author, Isha Padhy, upon reasonable request.

References

How, M. J. & Santon, M. Cuttlefish camouflage: Blending in by matching background features. Curr. Biol. 32, R523–R525. https://doi.org/10.1016/j.cub.2022.04.042 (2022).
Article CAS PubMed Google Scholar
Soofi, M. et al. Lichens and animal camouflage: Some observations from central Asian ecoregions. J. Threat. Taxa 14, 20672–20676 (2022).
Article Google Scholar
Fan, D.-P., Ji, G.-P., Sun, G., Cheng, M.-M. & Shen, S. L. Camouflaged object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2774–2784. https://doi.org/10.1109/CVPR42600.2020.00285 (2020).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (2017).
Huang, Z., Huang, L., Gong, Y., Huang, C. & Wang, X. Mask Scoring R-CNN. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6402–6411. https://doi.org/10.1109/CVPR.2019.00657 (IEEE Computer Society, 2019).
Lee, Y. & Park, J. Centermask: Real-time anchor-free instance segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13903–13912. https://doi.org/10.1109/CVPR42600.2020.01392 (2020).
Chen, H. et al. BlendMask: Top-down meets bottom-up for instance segmentation . In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8570–8578. https://doi.org/10.1109/CVPR42600.2020.00860 (IEEE Computer Society, 2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778. https://doi.org/10.1109/CVPR.2016.90 (2016).
Borji, C. M. Salient object detection: A survey. Comp. Visual Media 5, 117–150 (2019).
Article Google Scholar
Wei, Y. et al. High efficiency Wiener filter-based point cloud quality enhancement for MPEG G-PCC. IEEE Trans. Circuits Syst. Video Technol. 1, 2049. https://doi.org/10.1109/TCSVT.2025.3552049 (2025).
Article Google Scholar
Wang, X. et al. Solo: Segmenting objects by locations. In Computer Vision—ECCV 2020 649–665 (Springer, 2020).
Cai, Z. & Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 6154–6162. https://doi.org/10.1109/CVPR.2018.00644 (2018).
Yao, F., Zhang, H., Gong, Y., Zhang, Q. & Xiao, P. A study of enhanced visual perception of marine biology images based on diffusion-GAN. Complex Intell. Syst. 11, 227. https://doi.org/10.1007/s40747-025-01832-w (2025).
Article Google Scholar
Kumari, K. & Barpanda, S. S. Residual unet with dual attention—An ensemble residual unet with dual attention for multi-modal and multi-class brain mri segmentation. Int. J. Imaging Syst. Technol. 33, 644–658 (2022).
Article Google Scholar
Jiang, C. et al. Magnet: A camouflaged object detection network simulating the observation effect of a magnifier. Entropy 24, 4. https://doi.org/10.3390/e24121804 (2022).
Article Google Scholar
Xue, C. X. et al. Camouflage performance analysis and evaluation framework based on features fusion. Multimedia Tools Appl. 75, 1. https://doi.org/10.1007/s11042-015-2946-1 (2016).
Article Google Scholar
Chen, J., Pan, S., Peng, W. & Xu, W. Bilinear spatiotemporal fusion network: An efficient approach for traffic flow prediction. Neural Netw. 187, 107382. https://doi.org/10.1016/j.neunet.2025.107382 (2025).
Article PubMed Google Scholar
Wang, H., Li, Y. F., Men, T. & Li, L. Physically interpretable wavelet-guided networks with dynamic frequency decomposition for machine intelligence fault prediction. IEEE Trans. Syst. Man Cybern. Syst. 54, 4863–4875. https://doi.org/10.1109/TSMC.2024.3389068 (2024).
Article Google Scholar
Hu, S. L. et al. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 7132–7141. https://doi.org/10.1109/CVPR.2018.00745 (2018).
Liu, K. et al. On image transformation for partial discharge source identification in vehicle cable terminals of high-speed trains. High Voltage 9, 1090–1100. https://doi.org/10.1049/hve2.12487 (2024).
Article Google Scholar
Shi, G. et al. One for all: A unified generative framework for image emotion classification. IEEE Trans. Circuits Syst. Video Technol. 34, 7057–7068. https://doi.org/10.1109/TCSVT.2023.3341840 (2024).
Article Google Scholar
Wu, S. J. et al. Encoding–decoding network with pyramid self-attention module for retinal vessel segmentation. Int. J. Autom. Comput. 18, 973–980 (2021).
Article Google Scholar
Libouga, I. O., Bitjoka, L., Gwet, D. L. L., Boukar, O. & Nlôga, A. M. N. A supervised u-net based color image semantic segmentation for detection and classification of human intestinal parasites. Adv. Electr. Eng. Electron. Energy 2, 100069. https://doi.org/10.1016/j.prime.2022.100069 (2022).
Article Google Scholar
Deng, S. et al. Learning to compose diversified prompts for image emotion classification. Comput. Visual Media 10, 1169–1183 (2024).
Article Google Scholar
Wang, Z., Zhang, Z., Qi, W., Yang, F. & Xu, J. FreqGAN: Infrared and visible image fusion via unified frequency adversarial learning. IEEE Trans. Circuits Syst. Video Technol. 35, 728–740. https://doi.org/10.1109/TCSVT.2024.3460172 (2024).
Article Google Scholar
Mei, H. et al. Camouflaged object segmentation with distraction mining. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8768–8777. https://doi.org/10.1109/CVPR46437.2021.00866 (2021).
Sun, Y., Chen, G., Zhou, T., Zhang, Y. & Liu, N. Context-aware cross-level fusion network for camouflaged object detection. In International Joint Conference on Artificial Intelligence (2021).
Fan, D., Ji, G., Cheng, M. & Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6024–6042 (2022).
Article PubMed Google Scholar
Zhou, G. et al. True2 orthoimage map generation. Remote Sens. 14, 4396. https://doi.org/10.3390/rs14174396 (2022).
Article ADS Google Scholar
Li, Z. et al. MonoAux: Fully exploiting auxiliary information and uncertainty for monocular 3D object detection. Cyborg Bionic Syst. 5, 97 (2024).
Article Google Scholar
Chen, T., Xiao, J., Hu, X., Zhang, G. & Wang, S. Boundary-guided network for camouflaged object detection. Knowl.-Based Syst. 248, 108901 (2022).
Article Google Scholar
Liu, X., Li, Z., Zhou, Y., Peng, Y. & Luo, J. Camera-radar fusion with modality interaction and radar gaussian expansion for 3D object detection. Cyborg Bionic Syst. 5, 79 (2024).
Article Google Scholar
Sun, Y., Chen, G., Zhou, T., Zhang, Y. & Liu, N. Context-Aware Cross-Level Fusion Network for Camouflaged Object Detection 1025–1031. https://doi.org/10.24963/ijcai.2021/142 (2021).
Uesugi, K., Mayama, H. & Morishima, K. Analysis of rowing force of the water strider middle leg by direct measurement using a bio-appropriating probe and by indirect measurement using image analysis. Cyborg Bionic Syst. 4, 61 (2023).
Article Google Scholar
Chen, J., Ye, H., Ying, Z., Sun, Y. & Xu, W. Dynamic trend fusion module for traffic flow prediction. Appl. Soft Comput. 174, 112979. https://doi.org/10.1016/j.asoc.2025.112979 (2025).
Article Google Scholar
Padhy, I., Kanungo, P. & Sahoo, S. Multiclass classification of camouflage images using combined wld and lpq feature set using a ann classifier. In Advances in Signal Processing and Communication Engineering 85–97 (Springer, 2024).
Padhy, I. et al. ${YC}_b{C}_r$ model based shadow detection and removal approach on camouflaged images. In 2022 OITS International Conference on Information Technology (OCIT) 574–579. https://doi.org/10.1109/OCIT56763.2022.00112 (IEEE, 2022).
Padhy, J. Camouflaged object detection using hybrid-deep learning model. Multimed Tools and Applications (2024).
Lu, L. et al. Generative deep-learning-embedded asynchronous structured light for three-dimensional imaging. Adv. Photon. 6, 4. https://doi.org/10.1117/1.AP.6.4.046004 (2024).
Article Google Scholar
Deng, W. Learning to compose diversified prompts for image emotion classification. Comp. Visual Media 10, 1169–1183. https://doi.org/10.1007/s41095-023-0389-6 (2024).
Article Google Scholar
Wang, B., Yang, M., Cao, P. & Liu, Y. A novel embedded cross framework for high-resolution salient object detection: A novel embedded cross framework for high-resolution salient object detection. Appl. Intell. 55, 1. https://doi.org/10.1007/s10489-024-06073-x (2025).
Article Google Scholar
Liao, H. et al. Meta-learning based domain prior with application to optical-ISAR image translation. IEEE Trans. Circuits Syst. Video Technol. 34, 7041–7056. https://doi.org/10.1109/TCSVT.2023.3318401 (2024).
Article Google Scholar
Zhu, J., Zhang, X., Zhang, S. & Liu, J. Inferring camouflaged objects by texture-aware interactive guidance network. Proc. AAAI Conf. Artif. Intell. 35, 3599–3607. https://doi.org/10.1609/aaai.v35i4.16475 (2021).
Article Google Scholar
Wang, Q. et al. Depth-aided camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23 3297–3306. https://doi.org/10.1145/3581783.3611874 (Association for Computing Machinery, 2023).
Kamran, M., Rehman, S. U., Meraj, T., Alnowibet, K. A. & Rauf, H. T. Camouflage object segmentation using an optimized deep-learning approach. Mathematics 10, 219. https://doi.org/10.3390/math10224219 (2022).
Article Google Scholar
Ji, G.-P., Zhu, L., Zhuge, M. & Fu, K. Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recogn. 123, 108414. https://doi.org/10.1016/j.patcog.2021.108414 (2022).
Article Google Scholar
Shi, J., Liu, C. & Liu, J. Hypergraph-based model for modelling multi-agent Q-learning dynamics in public goods games. IEEE Trans. Netw. Sci. Eng. 11, 6169–6179. https://doi.org/10.1109/TNSE.2024.3473941 (2024).
Article Google Scholar
Zhang, R. et al. MvMRL: A multi-view molecular representation learning method for molecular property prediction. Brief. Bioinform. 25, 298. https://doi.org/10.1093/bib/bbae298 (2024).
Article CAS Google Scholar
Cai, J. et al. Broken ice circumferential crack estimation via image techniques. Ocean Eng. 259, 111735. https://doi.org/10.1016/j.oceaneng.2022.111735 (2022).
Article Google Scholar
Huang, C. et al. Correlation information enhanced graph anomaly detection via hypergraph transformation. IEEE Trans. Cybern. 55, 2865–2878. https://doi.org/10.1109/TCYB.2025.3558941 (2025).
Article PubMed Google Scholar
Wang, Z., Bovik, A. & Sheikh, H. Structural similarity based image quality assessment. In Digital Video Image Quality and Perceptual Coding, Series in Signal Processing and Communications. https://doi.org/10.1201/9781420027822.ch7 (2005).
Zhou, Z. et al. Resource-saving and high-robustness image sensing based on binary optical computing. Laser Photon. Rev. 19, 2400936. https://doi.org/10.1002/lpor.202400936 (2024).
Article Google Scholar
Xu, X. et al. Three-dimensional reconstruction and geometric morphology analysis of lunar small craters within the patrol range of the Yutu-2 Rover. Remote Sens. 15, 4251. https://doi.org/10.3390/rs15174251 (2023).
Article ADS Google Scholar
Wang, W. et al. Low-light image enhancement based on virtual exposure. Signal Process. Image Commun. 118, 117016. https://doi.org/10.1016/j.image.2023.117016 (2023).
Article Google Scholar
Yu, Y. et al. Optimization of 3D reconstruction of granular systems based on refractive index matching scanning. Opt. Laser Technol. 186, 112662. https://doi.org/10.1016/j.optlastec.2025.112662 (2025).
Article Google Scholar
Lu, L. et al. Generative deep-learning-embedded asynchronous structured light for three-dimensional imaging. Adv. Photon. 6, 46004. https://doi.org/10.1117/1.AP.6.4.046004 (2024).
Article Google Scholar
Zhuang, J., Zheng, Y., Guo, B. & Yan, Y. Globally deformable information selection transformer for underwater image enhancement. IEEE Trans. Circuits Syst. Video Technol. 35, 19–32. https://doi.org/10.1109/TCSVT.2024.3451553 (2024).
Article Google Scholar
Zeng, Y. et al. GCCNet: A novel network leveraging gated cross-correlation for multi-view classification. IEEE Trans. Multimedia 27, 1086–1099. https://doi.org/10.1109/TMM.2024.3521733 (2024).
Article Google Scholar
Zhai, Q. et al. Mgl: Mutual graph learning for camouflaged object detection. IEEE Trans. Image Process. 32, 1897–1910. https://doi.org/10.1109/TIP.2022.3223216 (2023).
Article ADS PubMed Google Scholar
Zhuge, M., Lu, X., Guo, Y., Cai, Z. & Chen, S. Cubenet: X-shape connection for camouflaged object detection. Pattern Recogn. 127, 44. https://doi.org/10.1016/j.patcog.2022.108644 (2022).
Article Google Scholar

Download references

Funding

Open access funding provided by Siksha 'O' Anusandhan (Deemed To Be University)

Author information

Authors and Affiliations

Department of Computer Science, C V Raman Global University, Bhubaneswar, India
Isha Padhy, Prabhat Dansena & Sampa Sahoo
Faculty of Engineering and Technology, Siksha ’O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
Rahul Priyadarshi
Department of Computer Science and Engineering, Chaitanya Bharathi Institute of Technology, Hyderabad, India
Isha Padhy

Authors

Isha Padhy
View author publications
Search author on:PubMed Google Scholar
Prabhat Dansena
View author publications
Search author on:PubMed Google Scholar
Sampa Sahoo
View author publications
Search author on:PubMed Google Scholar
Rahul Priyadarshi
View author publications
Search author on:PubMed Google Scholar

Contributions

Isha Padhy: Conceptualization, Methodology, Investigation, Writing—Original Draft, Writing—Review & Editing (Lead Author). Prabhat Dansena: Methodology, Supervision, Data Curation, Writing—Review & Editing. Sampa Sahoo: Conceptualization, Supervision, Writing—Review & Editing. Rahul Priyadarshi: Supervision, Data Curation, Writing—Review & Editing. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Rahul Priyadarshi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Padhy, I., Dansena, P., Sahoo, S. et al. An efficient camouflaged image segmentation with modified UNet and attention techniques. Sci Rep 15, 21086 (2025). https://doi.org/10.1038/s41598-025-07571-9

Download citation

Received: 21 March 2025
Accepted: 16 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41598-025-07571-9

Subjects

Abstract

Similar content being viewed by others

Identification of camouflage military individuals with deep learning approaches DFAN and SINETV2

A collaborative multi-attention network for real-time small object detection in UAV imagery

A ternary encoding network fusing scale awareness and large kernel attention for camouflaged object detection

Introduction

Related work and research questions

Research questions

Results and discussions

Datasets

Data preprocessing

Evaluation settings

Performance metrics

Adaptive learning rate with depth-aware triangular cyclic learning rate (CLR)

Depth-aware triangular CLR

Exponential decay with attention scaling

Comparison results

Ablation study

Discussion

Proposed CAMO-UNet framework

The CAMO-UNet network overview

Stages of encoder and decoder

Conclusion and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links