Abstract
Camouflaged object segmentation (COS) is a challenging task in computer vision where the objective is to recognize and precisely separate objects that blend in with their environment. Traditional models, including the standard UNet architecture, struggle with this task due to ambiguous object boundaries, texture similarity between object and background, and over-segmentation or under-segmentation caused by redundant skip connections. CAMO-UNet addresses these issues by including residual blocks which improve feature learning by easing the gradient flow and enabling deeper architectures. The attention mechanism focuses on ‘what’ is important, ‘where’ important features are located in the spatial domain and captures long-range dependencies across the image. The Depth-aware triangular cyclic learning rate (CLR) dynamically adjusts learning rates at different network depths to enhance training efficiency. CAMO-UNet achieved 93.8% accuracy on benchmark datasets and outperformed state-of-the-art models like SINet, BGNet, PFNet, etc., in metrics including S-measure, F-measure, MAE, and accuracy.
Similar content being viewed by others
Introduction
In the animal kingdom, prey frequently adapt to appear like their surroundings to trick predators1. Some animals use natural camouflage2 to blend in with their environment and manipulate their body color and structure to avoid being seen by predators. Researchers have used Camouflaged Object Detection (COD)3 techniques to find and examine hidden objects. Modern COD approaches employ advanced instance segmentation techniques to precisely identify and outline hidden objects. State-of-the-art frameworks like Mask R-CNN4 and its enhanced variant MS R-CNN5 utilize region-based convolutional networks to achieve accurate segmentation. More recent innovations, such as CenterMask6 and BlendMask7, incorporate attention mechanisms and feature blending to better handle complex camouflage patterns. These models typically leverage powerful backbone architectures like ResNet50 and ResNet101-FPN8 for robust feature extraction. The task proves significantly more demanding than conventional salient object detection9,10, as camouflaged targets often exhibit minimal contrast with their surroundings and may appear fragmented due to occlusion. An example is shown in Fig. 1, where the occluded part of the object provides the notion of two objects in the image. Here, it has also been considered and tried to overcome this issue. This complexity has driven the development of specialized solutions like SOLO11, which implements a novel grid-based segmentation approach, and transformer-based models that capture long-range visual dependencies. Evaluation of COD systems requires careful consideration, with metrics like the structure measure12 providing a nuanced assessment of segmentation quality. As the field progresses, researchers are exploring innovative directions, including self-supervised learning and multi-modal data fusion, to further improve detection capabilities in real-world scenarios.
The renowned UNet model13 used for medical image segmentation failed to produce satisfactory results for camouflaged or occluded images. The encoder-decoder stage comprises the UNet architecture, where the encoder stage extracts the necessary features for building the image, and the decoder stage uses this extracted information to reconstruct the image. These models are dimensionally reduced, making them appropriate for classification issues. Skip connections were added between the relevant encoder and decoder levels by using UNet to alleviate congestion. UNet is a cutting-edge segmentation model owing to its excellent intuition; however, skip connections can lead to over- and under-segmented conditions.
Camouflage can occur in almost every field, such as medical diagnosis14, detection of defective items15, identification of agricultural pests, etc., and it demands extensive examination to avoid hazardous situations. Almost the same color, texture, brightness, etc., between camouflaged items and their surroundings makes Camouflaged Object Detection (COD) a challenging task. Further, low-level features like edge, brightness, color, gradient, and texture extraction don’t always seem enough because well-done camouflage is adept at obscuring them16. The authors in Ref.17 proposed a model that can classify and segment camouflaged images and also came up with a dataset CAMO for classifying an image as camouflaged or not. The work in Ref.18 introduces a camouflage fusion learning (CFL) framework to segment different instances of camouflaged objects. In numerous applications, including image segmentation19,20, adding an attention mechanism to the network is proved to be fruitful. The attention mechanism is a process that can draw attention to information that is relevant to the job at hand in both the channel and spatial directions while ignoring irrelevant data. The researchers in Ref.8,21 demonstrate how to segment medical images using the U-Net while paying attention spatially to the context. Hu et al.19 proposed a block composed of channel-wise attention for segmentation. The block significantly improved performance in the classification job and could draw attention to significant dimensions. A model given by Li et al.22 uses channel attention for sentence-level training and spatial attention for word level. The authors have produced a UNet architecture with a Self-attention Module for Retinal Vessel segmentation. An extended version of UNet model is also used in Ref.23 to segment parasites from the background. In the paper24, the authors speak about a technique to optimize object detection in camouflaged image tasks and list a comparative analysis related to existing studies. The applications of Unet can also be found in Ref.25 where a modified U-Net is used for semantic segmentation of satellite imagery data. The attention mechanism of the proposed approach is used to extract and manipulate the spatial and channel-wise information to enhance the representation and capture relevant features for a given task. Traditional models such as UNet, while effective for general segmentation tasks, often struggle to accurately delineate camouflaged or occluded regions. Consequently, there is a growing need for advanced architectures capable of learning complex features and emphasizing subtle cues that distinguish target regions from the background. The motivation behind developing the CAMO-UNet model stems from the limitations observed in existing approaches when applied to intricate segmentation tasks. By integrating attention mechanisms and refined connectivity patterns into the foundational UNet architecture, CAMO-UNet is designed to selectively enhance relevant spatial and contextual information. CAMO-UNet, goes beyond standard UNet by incorporating three types of attention: Spatial Attention (SA) that focuses on “where” in the image to look, Channel Attention (CA) which Highlights the most informative feature channels, Self-Attention that Captures long-range dependencies and global context within feature maps. Most UNet variants use either spatial or channel attention—but this work uses all three synergistically, improving feature discrimination for subtle camouflage patterns. Moreover, the escalating complexity and parameter density of modern deep learning models pose additional concerns, especially in specialized tasks like camouflaged image segmentation where overfitting and computational overhead can hinder performance. CAMO-UNet aims to strike a balance between model expressiveness and efficiency, offering a focused solution to these challenges through targeted architectural innovations. CAMO-UNet introduces a layer wise dynamic learning rate schedule based on network depth. Shallow layers receive smaller updates to retain basic features and deeper layers learn high-level abstractions more aggressively.
Related work and research questions
This section examines previous research on salient and camouflaged object segmentation. To segment salient objects, we must concentrate on finding such exceptional and discriminative patches. However, camouflaged items typically blend into the backdrop environment by lowering their level of discrimination. When first studying camouflage segmentation, the goal is to identify the foreground even when some of its texture matches the background26,27. However, owing to the considerable resemblance between the foreground and background, none of these approaches perform satisfactorily in segmenting camouflaged objects in non-uniform background images. Researchers have created deep neural networks28,29 to resolve the issue of getting camouflaged objects separated from background and applied them to a massive amount of data. This has resulted in a superior performance in a variety of computer vision applications. Le et al.17 presented a network that includes a double path, one for classification and the other for segmentation for segmenting camouflaged objects. In addition a camouflaged object (CAMO) image dataset was created during the segmentation process. The authors in Ref.26,30 presented a framework to enhance and refine feature maps, as well as an optimized algorithm based on the best feature nodes to reduce the complexity of the model. The authors in Refs.31,32 have shown that boundary information is helpful in improving the performance of camouflaged image processing. The study by Sun et al.33 proposes a context-aware cross-level fusion network that effectively captures semantic features across multiple levels using attention mechanisms. By integrating context-awareness, the model refines object boundaries, which is essential for accurately detecting subtly camouflaged regions. This approach achieves strong segmentation performance by balancing high-level contextual understanding with low-level spatial details. Uesugi et al.34 introduce an edge-aware and reversible recalibration network that adaptively enhances spatial features and emphasizes structural information, enabling accurate and efficient camouflaged object detection suitable for real-time applications, particularly in low-contrast scenarios. Jing et al.35 propose a gradient-based learning approach using implicit supervision within a coarse-to-fine architecture, which enhances edge prediction and achieves competitive camouflaged object detection performance with fewer parameters by leveraging gradient cues over dense annotations. There is another method of obtaining properly segmented camouflaged images if noise can be reduced, as mentioned in Ref.36, or by removing shadows from the images37. A properly segmented image can lead to proper detection of objects38. The authors in Ref.39 proposes a novel 3D imaging system that integrates generative deep learning into asynchronous structured light. This can improve depth estimation and feature recovery, which are conceptually linked to the attention and residual mechanisms used in CAMO-UNet to resolve hidden or ambiguous visual cues. The research done in Ref.40 is a prompt-based emotion classification that deals with learning diversified prompts that help models extract emotionally salient features from images, recognizing non-obvious cues through learned guidance the same to CAMO-UNet. In Ref.41, the embedded cross framework aims to preserve fine details and structural features in salient object detection similar to how CAMO-UNet tackles low-contrast and ambiguous boundaries in camouflaged images.
Enhances robustness under complex visual conditions, a shared challenge with camouflaged object detection. In the proposed scenario, the relationship between the pixels in the short and long range is considered, and pattern changes are also considered42. The different segmentation techniques related to the camouflaged images used by researchers are presented in Table 1.
Research questions
The intrinsic complexity of these images, where the foreground fades nearly seamlessly into the background, made it difficult for traditional segmentation models like U-Net to perform well.
To solve this problem, several models based on deep learning have been proposed, such as feature refinement methods and attention mechanisms. Nevertheless, over-segmentation, loss of fine features, and sensitivity to background noise are some of the drawbacks of current techniques. Even though methods like residual learning and channel-wise attention have been studied recently, an optimal strategy that increases segmentation accuracy while preserving computational economy is still required. By combining spatial, channel, and self-attention mechanisms and improving feature extraction through residual connections, the CAMO-UNet model seeks to address these problems. We establish the following research questions in order to methodically examine these contributions.
-
[RQ1] How can deep learning techniques, particularly U-Net and attention mechanisms, be optimized to improve camouflaged image segmentation? (Discussed in Sect. 0.1), where the section discusses the integration of residual blocks and attention mechanisms into U-Net to improve segmentation. To address this, we propose CAMO-UNet, a modified version of the traditional U-Net architecture that incorporates residual blocks and multilevel attention mechanisms (spatial, channel, and self-attention). These additions significantly enhance the model’s ability to extract subtle features and delineate ambiguous object boundaries in camouflaged scenes.
-
[RQ2] How does incorporating spatial and channel-wise attention mechanisms improve the accuracy and robustness of camouflaged object detection? (Discussed in “Attention blocks” section) To address this, the attention modules are used to reduce segmentation errors and make the model more robust to visual ambiguity in camouflaged scenes. Increasing the number of convolution blocks ensures that low-level characteristics such as boundaries are highlighted.
-
[RQ3] How does CAMO-UNet compare to other similar works in terms of accuracy, computational efficiency, and performance across different datasets? (Discussed in “Comparison results” section, Tables 3, 4) To address this, existing works typically use uniform learning rate schedules, and this approach matches learning behavior with the complexity of the features in each layer. The model also incorporates an attention weight factor into the learning rate decay formula which ensures that layers with higher attention receive more refined updates, improving precision. The proposed model achieves higher accuracy, F-measure, and lower MAE than existing models like SINet, BGNet, PFNet, etc. The extensive ablation studies confirm the value of each proposed module.
-
[RQ4] What role do residual connections and connectivity patterns play in enhancing segmentation performance in complex camouflaged images? (Discussed in “Residual block” section) To achieve this, the CAMO-UNet model incorporates connectivity patterns similar to UNet, including skip connections that enable the model to leverage features from earlier stages of the encoding process. These skip connections can help the model capture and retain critical information regarding camouflaged and occluded objects. By combining features from different scales and levels of abstraction, the CAMO-UNet model can improve the segmentation results.
It is advantageous to train the CAMO-UNet model with a varied data set that contains camouflaged image samples to increase the efficiency and performance of the model47,48. The model can adapt and generalize to hidden camouflaging during inference by being exposed to a variety of camouflaging patterns and situations during training.Therefore, different data sets were used to perform the experiments. The remainder of this paper is organized as follows: “Related work and research questions” section highlights the work done by various researchers related to camouflaged images and image segmentation. “Results and discussions” section discusses the material and methodologies used in the proposed approach. “Discussion” section discusses the structure of the proposed CAMO-UNet model. In “Proposed CAMO-UNet framework” section, the experimental results are discussed. Finally, “Conclusion and future work” section concludes the study.
Results and discussions
This section discusses the popularly camouflaged datasets and the data preprocessing step of the proposed approach.
Datasets
The Camouflaged image segmentation has a relatively small number of datasets, making it difficult to discover and prepare images for Ground Truth (GT) annotations and image collection, respectively. The images used in this study were obtained from COD10K and CAMO datasets. The COD10K3 dataset comprises 10000 images exhibiting natural camouflage, whereas the CAMO17 dataset comprises 1250 images showing camouflage in the ecological environment. As camouflage pattern complexity increases, the visibility of the object’s surface area inside the image is frequently challenging for humans to recognize. Thus, camouflaged object detection has garnered the attention of computer vision experts. A glimpse of the CAMO and CODK10 dataset images and their ground truths are presented in Figure 2.
Data preprocessing
The input image is passed through a few preprocessing steps. The Resize operation is first applied to the image to resize it to a size of (224, 224) dimensions/pixels maintaining the aspect ratio. Normalization was used to scale the pixel values to fall between 0 and 1 which was accomplished by dividing it by 255.0. Both operations are commonly used in image preprocessing pipelines to standardize the input data, such as resizing images to an acceptable size and applying the normalization procedure to pixel values for better numerical stability during model training or inference. The resultant normalized image was presented for augmentation. The practice of increasing volumes of input data into a large sample space is known as data augmentation. The classification models permit the alteration of images when the dataset is small however any solution must be applied both on image and its mask to undergo the same transformation. The flip and affine functions introduce variations in the dataset by flipping images horizontally and applying random translations. These changes were added to the dataset during the preprocessing stage, which improved the robustness and generalization capabilities of the model. For the COD10k dataset, the number of images was increased to 12000 for train and 8000 for testing. The shadow detection process37 was also performed as part of preprocessing.
Evaluation settings
The experiment in the CAMO-UNet model was conducted on various datasets, with all images resized 256 \(\times\) 256 pixels and converted to grayscale. Grayscale images were used because the UNet model structure, which was used as a refer- ence, performs well on grayscale images. The following Table 2 shows the experimental setup for the proposed method.
The model was implemented using TensorFlow with the Keras backend in Python. For each epoch, measurements of accuracy and the Dice coefficient were used to track the training process. To avoid overfitting, early stopping was performed based on the loss value of validation and 2 was considered as the patience value. Steps to facilitate speeder and deep learning model execution were performed using the Google Colab Pro platform. It provides 25 GB RAM, 147 GB hard disk, and an on-demand GPU. Experiments were also performed on the PyCharm application using a platform with 1920 cores and graphic processors. Overall, these experimental settings and infrastructure were chosen to ensure the efficient training and evaluation of the CAMO-UNet model on the selected datasets. training and evaluation of the CAMO-UNet model on selected datasets.
Performance metrics
Comparing the anticipated segmentation with the actual segmentation is one way to evaluate a segmentation algorithm. The CAMO-UNet model was evaluated using performance metrics such as accuracy, Cross-Entropy Loss, Focal Loss, MAE, and Dice Coefficient.
From Fig. 3, to obtain accuracy, the model was trained with focal and cross-entropy losses. The red line indicates the validation loss and the green line indicates the training loss. Cross-entropy loss works well when the classes are balanced and there are no extreme class imbalances. The reason behind The unstable nature of the graph is owing to the imbalanced nature of the dataset. In the context of the CAMO-UNet model, MAE was used as a metric to evaluate the accuracy of the predicted segmentations compared to the ground truth segmentations49,50. The lower the MAE, the closer the predicted segmentations are to the ground truth, indicating better performance of the model. The MAE results of CAMO-UNet are listed in Tables 3 and 4, when compared with other existing models. Another F2 metric has been considered, which is commonly used to evaluate the performance of binary classification models, particularly when the focus is on optimizing the recall (sensitivity) rather than precision. It is an extension of the F 1 measure that puts more emphasis on recall. S-measure, or Structure Similarity Measure51, is a metric used to evaluate the structural similarity between two images. It is based on the concept of the Structural Similarity Index (SSIM), which measures the similarity in terms of luminance, contrast, and structure between two images52,53. In the CAMO-UNet model, the F2 measure is used to evaluate the performance of the model in segmenting camouflaged objects in images54. It takes into account both precision and recall, providing an overall assessment of the effectiveness of the model.
Adaptive learning rate with depth-aware triangular cyclic learning rate (CLR)
Since CAMO-UNet consists of an encoder–decoder structure with residual and attention mechanisms, the Cyclic Learning Rate (CLR) is used to account for feature extraction at different depths. Instead of a basic cyclic pattern, the effect of multiple stages is integrated into the learning rate schedule. The following sections describe the stages of optimization55,56.
Depth-aware triangular CLR
The static attention mechanisms may not always adapt efficiently to varying feature complexities in camouflaged images, so the Convolutional Block Attention Module (CBAM) is introduced to dynamically select features57,58. It dynamically recalibrates the channel and spatial attention weights based on feature importance. For a specific layer l in the encoder-decoder network, the cyclic learning rate is given by:
where
-
\(\eta _t^l\) is the learning rate at iteration t for layer l.
-
\(C^l = \frac{\text {iteration} \mod (2 \times \text {step size}_l)}{2 \times \text {step size}_l}\) represents the cycle progress for layer l.
-
\(\text {step size}_l = \alpha \cdot 2^l\), where \(\alpha\) is a scaling factor that increases with deeper layers.
The deeper the layer, the larger the step size, ensuring that shallower layers update faster while deeper layers stabilize. The implementation of CLR was carried out in the following steps:
-
Applied depth-aware triangular ring in the training stage to dynamically adjust learning rates at different depths of CAMO-UNet.
-
It was used with AdamW optimizer to improve weight decay handling and optimize training convergence.
-
It was integrated into encoder layers, where lower layers have lower learning rates to retain low-level features, while deeper layers benefit from higher learning rates for complex pattern learning.
-
Also applied to fine-tuning & transfer learning by gradually refining feature maps while preventing overfitting.
Learning Rate Schedule is given by,
where
-
\(\eta _t\) is the learning rate at epoch t.
-
\(T_{\text {cur}}\) is the current iteration.
-
\(T_{\text {max}}\) is the maximum iteration count.
-
\(\eta _{\text {min}}\) and \(\eta _{\text {max}}\) are the minimum and maximum learning rates.
Exponential decay with attention scaling
Since self-attention modules help in refining segmentation maps, they require stable learning rates. The exponential decay CLR is modified to include an attention factor \(A_l\):
where:
-
\(A_l = 1 + \lambda \sum _{i=1}^{L} \text {Attention}i^l\)
-
\(\lambda\) is a weight factor for attention influence.
-
\(\sum {i=1}^{L} \text {Attention}_i^l\) represents the cumulative contribution of attention across all layers i for a specific layer l.
-
\(e^{-k t}\) ensures gradual decay of the learning rate.
Comparison results
This section compares performance metrics such as recall, specificity, accuracy, F-measure, precision and S-measure with existing methods.
The prediction results are shown in Fig. 4. From the comparison results, it can be seen that CAMO-UNet predicts the shape of the object close to the GT. The results of CAMO-UNet were compared with other states of artwork for S Measure, F Measure, MAE, E Measure, and accuracy. The results indicate that CAMO-UNet works better than the other models. Experiments were done on COD10K and CAMO datasets. Tables 3 and 4 represents the results when the models are applied individually to the CAMO and COD10K dataset respectively. Here, it can be seen clearly that CAMO-UNet works remarkably better than present models. Figure 5 shows how CAMO-UNet consistently achieves superior accuracy and the F measure while maintaining a low MAE, indicating precise segmentation. Bar colors distinguish performance across datasets and models, with inverted MAE axis to emphasize lower error as better performance.
Figures 6 and 7 compare performance metrics such as Accuracy, Recall, Precision, and Specificity of different models across two datasets: (b) COD10K (c) CAMO, where CAMO-UNet performs far better than state-of-the-art models. The CAMO dataset, exhibited approximately 97% Accuracy, 95% recall, 98% precision, and 96% specificity, surpassing models such as JCNet [7], FSL [8], LINet [9] and DGNet [14]. For COD10K (b), the proposed model achieved approximately 97% accuracy, 97% recall, 99% precision, and 98% specificity, leading to models such as JCNet [7], DGNet [14], MRR-Net [15], and BGNet [16]. Models like PFNet and BGNet suffer from over-segmentation due to over-reliance on skip connections. CAMO-UNet integrates attention-controlled skip connections and residual blocks, which preserve important features while suppressing noise.
Ablation study
The experiments on COD10K and CAMO datasets, comparing the optimized CAMO-UNet with the standard CAMO-UNet were carried out.
The Table 5 presents a performance comparison between standard CAMO-UNet and optimized CAMO-UNet (which includes the adaptive learning rate and the depth-aware triangular cyclic learning rate). The Optimized CAMO-UNet consistently outperforms the standard version across all metrics.
Discussion
The experimental results of the extensive performance evaluation and comparative analysis underscored the robustness and superiority of the proposed CAMO-UNet model across multiple datasets. CAMO-UNet’s integration of spatial and channel attention significantly enhances its ability to delineate ambiguous borders that often confuse baseline models. The inclusion of attention-controlled skip connections and residual blocks help reduce redundancy, a common issue in traditional UNet-based architectures. CAMO-UNet achieves the highest accuracy (93–94%) and F-measure (0.87–0.89), and matches the lowest MAE (0.03–0.04) on both CAMO and COD10K datasets. The use of depth-aware CLR enables dynamic learning across network depths, leading to faster convergence and better generalization. CAMO-UNet consistently ranks at or near the top in all metrics in both the CAMO and COD10K datasets—especially in: precision (0.93–0.94), F measure (0.87–0.89), S measure and E measure (0.91–0.92), with low MAE (0.03–0.04). Models like PFNet and BGNet suffer from over-segmentation due to over-reliance on skip connections. CAMO-UNet integrates attention-controlled skip connections and residual blocks, which preserve important features while suppressing noise. Most prior models use fixed learning schedules, which don’t adapt well to feature complexity in deeper layers. CAMO-UNet employs a Depth-Aware Cyclic Learning Rate (CLR) to better train shallow and deep layers with tailored learning rates. However, despite its strengths, CAMO-UNet is not without limitations. The added attention modules and residual blocks introduces more parameters and computational overhead. Performance may be sensitive to the diversity and quality of training samples, especially in the case of rare camouflage patterns.
Proposed CAMO-UNet framework
This section describes the architecture of the proposed model.
The CAMO-UNet network overview
The success of any segmentation algorithm for camouflaged images can be challenging. The proposed CAMO- UNet network combines the benefits of the UNet, Residual networks, and attention ideas to make the UNet architecture more easily understood and concentrate on the required notion. It contains phases for the encoder and decoder, similar to UNet. In addition, attention and residual blocks were used to improve the accuracy of the model. The major feature of the CAMO-UNet model is the skip connections that allow the decoder to receive features from the earlier stages of the encoder, enabling fine-grained localization. The intricacy of the camouflage patterns, the quality and variety of the training data, and the precise implementation details of the CAMO-UNet algorithm are only a few of the variables that affect how well the algorithm performs on camouflaged images.
The model architecture is illustrated in Fig. 8. The proposed CAMO- UNet comprises an encode, decoder, Channel Attention(CA), Spatial Attention(SA), self-attention (Self-A), residual blocks and skip connections. CAMO-UNet utilizes self-attention, channel attention, and spatial attention mechanisms to obtain both local and global dependencies between pixels within an input image. Spatial attention can enhance the discriminative power of a model by highlighting important regions while downplaying less informative regions. By applying channel attention, the model can learn to emphasize the channels that carry the most discriminative information to segment the camouflaged object. Self-attention is applied to the last layer to obtain long- range dependencies which refer to the relationships between elements that are far apart in a sequence or spatial context.
Stages of encoder and decoder
The various stages of the Encoder and Decoder are discussed here.
-
Encoder The input image was passed through the first convolutional block where 32 filters were applied. This can be expressed as a convolution of the input tensor X0 using a kernel K1 of size 3\(\times\)3\(\times\)C\(\times\)32. The output X1, is then passed through a max pooling operation, reducing its spatial dimensions.
$$\begin{aligned} X_1 = \text {Conv}(X_0, K_1), \end{aligned}$$(4)where \(K_1\) represents the set of 32 filters, followed by:
$$\begin{aligned} X_1' = \text {MaxPool}(X_1). \end{aligned}$$(5)The output from the first block, \(X_1'\), was fed into the second convolutional block, where 64 filters were applied. This operation is represented as the convolution of \(X_1'\) with kernel \(K_2\) of size \(3 \times 3 \times 32 \times 64\) producing \(X_2\). A max pooling operation downsamples \(X_2\) yielding \(X_2'\):
$$\begin{aligned} X_2 = \text {Conv}(X_1', K_2), \end{aligned}$$(6)$$\begin{aligned} X_2' = \text {MaxPool}(X_2). \end{aligned}$$(7)The process continues for two additional layers to obtain \(X_4\) and \(X_4'\). After the fourth block, two additional convolutional layers are applied. First, the output \(X_4'\) is convolved with kernel \(K_5\) of size \(3 \times 3 \times 256 \times 256\), producing \(X_5\). This is followed by another convolution using kernel \(K_6\) of size \(3 \times 3 \times 256 \times 256\), resulting in the final output, \(X_6\).
$$\begin{aligned} X_5 = \text {Conv}(X_4', K_5), \end{aligned}$$(8)$$\begin{aligned} X_6 = \text {Conv}(X_5, K_6). \end{aligned}$$(9) -
Decoder The feature map is denoted as F, which is a tensor of size \(C \times R \times M \times N\), where C represents the spatial dimensions with height H and width W, R denotes the number of channels, M is the height of the feature map, and N is the width of the feature map. Each element in F corresponds to a specific location in the spatial dimensions and channels. The stored values represent activations or responses at the locations. Skip Connection and Upsampling Block The output from the previous blocks, denoted as a feature map F, and the feature map received from the encoder, are connected through a skip connection and passed into the upsampling block. Upsampling improves the spatial resolution of the feature map to match the original input size better.
$$\begin{aligned} F_1 = \text {SkipConnection}(F_{\text {Encoder}}, F_{\text {Decoder}}), \end{aligned}$$(10)where \(F_{\text {Encoder}}\) represents the feature map from the encoder, and \(F_{\text {Decoder}}\) is the output from the decoder. After upsampling, the skip connection and the upsampled feature maps were concatenated. The concatenated feature map was processed through convolutional blocks to refine and extract the necessary features. The feature maps that pass through the network can be represented as:
$$\begin{aligned} F \in \mathbb {R}^{C \times R \times M \times N}, \end{aligned}$$(11)where C denotes the number of channels; M and N represent the height and width of the spatial dimensions, respectively; and R is the number of channels in the feature map. Each element in the feature map, F represents the activation or response of specific patterns at particular spatial locations and channels. Upsampling block and resizing The output from the final upsampling block was passed to the subsequent layer, where it was resized to match the input image dimensions. This can be written as:
$$\begin{aligned} F_{\text {Upsampled}} = \text {Resize}(F_{\text {FinalUpsampling}}, M_{\text {Input}}, N_{\text {Input}}), \end{aligned}$$(12)where \(M_{\text {Input}}\) and \(N_{\text {Input}}\) represent the height and width of the input image, respectively, and \(F_{\text {FinalUpsampling}}\) is the feature map from the last upsampled block. The upsampled feature map and the input image are combined to generate the final output. This can be represented as:
$$\begin{aligned} F_{\text {Combined}} = \text {Combine}(F_{\text {Upsampled}}, X_{\text {Input}}). \end{aligned}$$(13)Finally, a \(1 \times 1\) convolution layer was applied to the combined output to generate a segmentation mask.
$$\begin{aligned} F_{\text {SegmentationMask}} = \text {Conv}_{1 \times 1}(F_{\text {Combined}}). \end{aligned}$$(14) -
Residual block Each residual block contains two convolutional blocks, each with three layers: a convolution layer, batch normalization layer, and ReLU activation layer. The output of each residual block is linked to that of the previous block through a residual (skip) connection. Let X represent the input to the residual block and let \(W_1\) and \(W_2\) represent the weights of the convolution filters in the first and second convolution layers, respectively. The output of the residual block can be written as:
$$\begin{aligned} Y = \text {ReLU}(\text {BatchNorm}(\text {Conv}(X, W_2))) +\,\, \text {ReLU}(\text {BatchNorm}(\text {Conv}(X, W_1))) + X. \end{aligned}$$(15)This formulation shows that input X is added element-wise to the final output of the block through a skip connection. A demonstration of this is shown in Fig. 9. Skip connections are used between the downsampled blocks in the encoder and the upsampled blocks in the decoder. The output of each downsample block is added element-wise to the output of the corresponding upsample block to preserve the important features and facilitate information flow. Let \(D_i\) be the output of the i-th downsample block in the encoder and, \(U_i\) be the output of the corresponding i-th upsample block in the decoder. The skip connection between them is mathematically expressed as follows:
$$\begin{aligned} U_i' = U_i + D_i, \end{aligned}$$(16)where \(U_i'\) is the new output of the upsample block after adding the output off the corresponding downsample block.
-
Attention blocks: The Attention mechanisms capture crucial spatial and channel information at various levels, and the connectedness between layers enables the flow of information across the network. To ensure that the initial feature maps adequately captured the regional patterns and features, spatial attention was only applied to the top layer of the model. Applying channel attention to deeper layers allows the model to selectively emphasize or suppress specific channels in the feature maps at those layers. This section discusses the attention mechanisms, that are added to the output of the encoder layers through a skip connection. The Spatial Attention layer in the CAMO-UNet model is shown in Fig. 10 highlighting the significant spatial regions for the task at hand. This helps the model consider relevant areas of the given image while suppressing less informative regions. Channel attention (CA), as shown in Fig. 11, is the channel attention block in the CAMO-UNet model that benefits the model by allowing it to selectively emphasize or suppress different channels of the feature maps. This helps the model to focus on the most in- formative and discriminative channels while downplaying less relevant channels. By learning the attention weights for each channel, the model can dynamically adapt its feature representation based on the importance of each channel to a given task. Self-attention(Self-A), in the CAMO-UNet model, is a key component used to capture spatial relationships within feature maps. It aids the model’s ability to concentrate on key areas and discover contextual relationships among various locations on feature maps. Mathematically, the self-attention mechanism in CAMO-UNet can be described as follows: The input to the self-attention mechanism is a feature map, denoted as
$$\begin{aligned} F \in \mathbb {R}^{C \times U \times V} \end{aligned}$$where U and V represent the spatial dimensions of the feature map (height and width) and C is the number of channels. Input feature maps F with dimensions \(C \times U \times V\) undergo operations such as global average pooling, global max pooling, concatenation, and fully connected layers to obtain channel-wise recalibrated feature maps. The output feature maps are reshaped and projected to obtain the query (Q), key (K), and value (L) tensors. The reshaped feature maps have dimensions \(C' \times U' \times V'\), and the projection matrices (\(W_Q\), \(W_K\), \(W_L\)) transform them into \(d \times U' \times V'\) shape. Self-attention is computed using the query (Q), key (K), and value (L) tensors. The attention scores were obtained by taking the dot product between Q and K transposed, followed by normalization using a softmax function. Normalized attention scores were applied to L to obtain self-attended feature maps. The self-attended feature maps were projected back onto the original channel dimensions using a projection matrix (\(W_S\)). The output is a self-attended feature map with dimensions \(C \times U' \times V'\)
Conclusion and future work
The outcome of the CAMO-UNet algorithm is good because it is trained on a diverse and representative dataset that includes a wide range of camouflaged images. In addition, incorporating data augmentation techniques specific to camouflage patterns and carefully tuning the hyperparameters of the model enhances its effectiveness. Overall, while CAMO-UNet provides a framework that incorporates attention mechanisms to handle complex image segmentation tasks, its effectiveness on camouflaged images ultimately depends on the specific challenges posed by camouflage patterns. This study compares its findings with those of a previous study and proposes an optimization approach for COD tasks. In certain investigations, preprocessing techniques, including transformation, image enhancement, and shadow removal, have been applied. The proposed approach effectively addresses the issue of camouflaged objects and offers the best solution compared with previous approaches. Testing using publicly accessible datasets on bench- mark models, existing models, and the proposed CAMO-UNet model revealed that it performs optimally and correctly. In the future, we will develop a method to determine the optimum loss function for a given model to improve accuracy. In addition, we aimed to design a model for run-time inference and sample-to-sample comparison analysis.
Data availability
The data that support the findings of this study are available from the corresponding author, Isha Padhy, upon reasonable request.
References
How, M. J. & Santon, M. Cuttlefish camouflage: Blending in by matching background features. Curr. Biol. 32, R523–R525. https://doi.org/10.1016/j.cub.2022.04.042 (2022).
Soofi, M. et al. Lichens and animal camouflage: Some observations from central Asian ecoregions. J. Threat. Taxa 14, 20672–20676 (2022).
Fan, D.-P., Ji, G.-P., Sun, G., Cheng, M.-M. & Shen, S. L. Camouflaged object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2774–2784. https://doi.org/10.1109/CVPR42600.2020.00285 (2020).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (2017).
Huang, Z., Huang, L., Gong, Y., Huang, C. & Wang, X. Mask Scoring R-CNN. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 6402–6411. https://doi.org/10.1109/CVPR.2019.00657 (IEEE Computer Society, 2019).
Lee, Y. & Park, J. Centermask: Real-time anchor-free instance segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 13903–13912. https://doi.org/10.1109/CVPR42600.2020.01392 (2020).
Chen, H. et al. BlendMask: Top-down meets bottom-up for instance segmentation . In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8570–8578. https://doi.org/10.1109/CVPR42600.2020.00860 (IEEE Computer Society, 2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778. https://doi.org/10.1109/CVPR.2016.90 (2016).
Borji, C. M. Salient object detection: A survey. Comp. Visual Media 5, 117–150 (2019).
Wei, Y. et al. High efficiency Wiener filter-based point cloud quality enhancement for MPEG G-PCC. IEEE Trans. Circuits Syst. Video Technol. 1, 2049. https://doi.org/10.1109/TCSVT.2025.3552049 (2025).
Wang, X. et al. Solo: Segmenting objects by locations. In Computer Vision—ECCV 2020 649–665 (Springer, 2020).
Cai, Z. & Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 6154–6162. https://doi.org/10.1109/CVPR.2018.00644 (2018).
Yao, F., Zhang, H., Gong, Y., Zhang, Q. & Xiao, P. A study of enhanced visual perception of marine biology images based on diffusion-GAN. Complex Intell. Syst. 11, 227. https://doi.org/10.1007/s40747-025-01832-w (2025).
Kumari, K. & Barpanda, S. S. Residual unet with dual attention—An ensemble residual unet with dual attention for multi-modal and multi-class brain mri segmentation. Int. J. Imaging Syst. Technol. 33, 644–658 (2022).
Jiang, C. et al. Magnet: A camouflaged object detection network simulating the observation effect of a magnifier. Entropy 24, 4. https://doi.org/10.3390/e24121804 (2022).
Xue, C. X. et al. Camouflage performance analysis and evaluation framework based on features fusion. Multimedia Tools Appl. 75, 1. https://doi.org/10.1007/s11042-015-2946-1 (2016).
Chen, J., Pan, S., Peng, W. & Xu, W. Bilinear spatiotemporal fusion network: An efficient approach for traffic flow prediction. Neural Netw. 187, 107382. https://doi.org/10.1016/j.neunet.2025.107382 (2025).
Wang, H., Li, Y. F., Men, T. & Li, L. Physically interpretable wavelet-guided networks with dynamic frequency decomposition for machine intelligence fault prediction. IEEE Trans. Syst. Man Cybern. Syst. 54, 4863–4875. https://doi.org/10.1109/TSMC.2024.3389068 (2024).
Hu, S. L. et al. Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 7132–7141. https://doi.org/10.1109/CVPR.2018.00745 (2018).
Liu, K. et al. On image transformation for partial discharge source identification in vehicle cable terminals of high-speed trains. High Voltage 9, 1090–1100. https://doi.org/10.1049/hve2.12487 (2024).
Shi, G. et al. One for all: A unified generative framework for image emotion classification. IEEE Trans. Circuits Syst. Video Technol. 34, 7057–7068. https://doi.org/10.1109/TCSVT.2023.3341840 (2024).
Wu, S. J. et al. Encoding–decoding network with pyramid self-attention module for retinal vessel segmentation. Int. J. Autom. Comput. 18, 973–980 (2021).
Libouga, I. O., Bitjoka, L., Gwet, D. L. L., Boukar, O. & Nlôga, A. M. N. A supervised u-net based color image semantic segmentation for detection and classification of human intestinal parasites. Adv. Electr. Eng. Electron. Energy 2, 100069. https://doi.org/10.1016/j.prime.2022.100069 (2022).
Deng, S. et al. Learning to compose diversified prompts for image emotion classification. Comput. Visual Media 10, 1169–1183 (2024).
Wang, Z., Zhang, Z., Qi, W., Yang, F. & Xu, J. FreqGAN: Infrared and visible image fusion via unified frequency adversarial learning. IEEE Trans. Circuits Syst. Video Technol. 35, 728–740. https://doi.org/10.1109/TCSVT.2024.3460172 (2024).
Mei, H. et al. Camouflaged object segmentation with distraction mining. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8768–8777. https://doi.org/10.1109/CVPR46437.2021.00866 (2021).
Sun, Y., Chen, G., Zhou, T., Zhang, Y. & Liu, N. Context-aware cross-level fusion network for camouflaged object detection. In International Joint Conference on Artificial Intelligence (2021).
Fan, D., Ji, G., Cheng, M. & Shao, L. Concealed object detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6024–6042 (2022).
Zhou, G. et al. True2 orthoimage map generation. Remote Sens. 14, 4396. https://doi.org/10.3390/rs14174396 (2022).
Li, Z. et al. MonoAux: Fully exploiting auxiliary information and uncertainty for monocular 3D object detection. Cyborg Bionic Syst. 5, 97 (2024).
Chen, T., Xiao, J., Hu, X., Zhang, G. & Wang, S. Boundary-guided network for camouflaged object detection. Knowl.-Based Syst. 248, 108901 (2022).
Liu, X., Li, Z., Zhou, Y., Peng, Y. & Luo, J. Camera-radar fusion with modality interaction and radar gaussian expansion for 3D object detection. Cyborg Bionic Syst. 5, 79 (2024).
Sun, Y., Chen, G., Zhou, T., Zhang, Y. & Liu, N. Context-Aware Cross-Level Fusion Network for Camouflaged Object Detection 1025–1031. https://doi.org/10.24963/ijcai.2021/142 (2021).
Uesugi, K., Mayama, H. & Morishima, K. Analysis of rowing force of the water strider middle leg by direct measurement using a bio-appropriating probe and by indirect measurement using image analysis. Cyborg Bionic Syst. 4, 61 (2023).
Chen, J., Ye, H., Ying, Z., Sun, Y. & Xu, W. Dynamic trend fusion module for traffic flow prediction. Appl. Soft Comput. 174, 112979. https://doi.org/10.1016/j.asoc.2025.112979 (2025).
Padhy, I., Kanungo, P. & Sahoo, S. Multiclass classification of camouflage images using combined wld and lpq feature set using a ann classifier. In Advances in Signal Processing and Communication Engineering 85–97 (Springer, 2024).
Padhy, I. et al. \({YC}_b{C}_r\) model based shadow detection and removal approach on camouflaged images. In 2022 OITS International Conference on Information Technology (OCIT) 574–579. https://doi.org/10.1109/OCIT56763.2022.00112 (IEEE, 2022).
Padhy, J. Camouflaged object detection using hybrid-deep learning model. Multimed Tools and Applications (2024).
Lu, L. et al. Generative deep-learning-embedded asynchronous structured light for three-dimensional imaging. Adv. Photon. 6, 4. https://doi.org/10.1117/1.AP.6.4.046004 (2024).
Deng, W. Learning to compose diversified prompts for image emotion classification. Comp. Visual Media 10, 1169–1183. https://doi.org/10.1007/s41095-023-0389-6 (2024).
Wang, B., Yang, M., Cao, P. & Liu, Y. A novel embedded cross framework for high-resolution salient object detection: A novel embedded cross framework for high-resolution salient object detection. Appl. Intell. 55, 1. https://doi.org/10.1007/s10489-024-06073-x (2025).
Liao, H. et al. Meta-learning based domain prior with application to optical-ISAR image translation. IEEE Trans. Circuits Syst. Video Technol. 34, 7041–7056. https://doi.org/10.1109/TCSVT.2023.3318401 (2024).
Zhu, J., Zhang, X., Zhang, S. & Liu, J. Inferring camouflaged objects by texture-aware interactive guidance network. Proc. AAAI Conf. Artif. Intell. 35, 3599–3607. https://doi.org/10.1609/aaai.v35i4.16475 (2021).
Wang, Q. et al. Depth-aided camouflaged object detection. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23 3297–3306. https://doi.org/10.1145/3581783.3611874 (Association for Computing Machinery, 2023).
Kamran, M., Rehman, S. U., Meraj, T., Alnowibet, K. A. & Rauf, H. T. Camouflage object segmentation using an optimized deep-learning approach. Mathematics 10, 219. https://doi.org/10.3390/math10224219 (2022).
Ji, G.-P., Zhu, L., Zhuge, M. & Fu, K. Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recogn. 123, 108414. https://doi.org/10.1016/j.patcog.2021.108414 (2022).
Shi, J., Liu, C. & Liu, J. Hypergraph-based model for modelling multi-agent Q-learning dynamics in public goods games. IEEE Trans. Netw. Sci. Eng. 11, 6169–6179. https://doi.org/10.1109/TNSE.2024.3473941 (2024).
Zhang, R. et al. MvMRL: A multi-view molecular representation learning method for molecular property prediction. Brief. Bioinform. 25, 298. https://doi.org/10.1093/bib/bbae298 (2024).
Cai, J. et al. Broken ice circumferential crack estimation via image techniques. Ocean Eng. 259, 111735. https://doi.org/10.1016/j.oceaneng.2022.111735 (2022).
Huang, C. et al. Correlation information enhanced graph anomaly detection via hypergraph transformation. IEEE Trans. Cybern. 55, 2865–2878. https://doi.org/10.1109/TCYB.2025.3558941 (2025).
Wang, Z., Bovik, A. & Sheikh, H. Structural similarity based image quality assessment. In Digital Video Image Quality and Perceptual Coding, Series in Signal Processing and Communications. https://doi.org/10.1201/9781420027822.ch7 (2005).
Zhou, Z. et al. Resource-saving and high-robustness image sensing based on binary optical computing. Laser Photon. Rev. 19, 2400936. https://doi.org/10.1002/lpor.202400936 (2024).
Xu, X. et al. Three-dimensional reconstruction and geometric morphology analysis of lunar small craters within the patrol range of the Yutu-2 Rover. Remote Sens. 15, 4251. https://doi.org/10.3390/rs15174251 (2023).
Wang, W. et al. Low-light image enhancement based on virtual exposure. Signal Process. Image Commun. 118, 117016. https://doi.org/10.1016/j.image.2023.117016 (2023).
Yu, Y. et al. Optimization of 3D reconstruction of granular systems based on refractive index matching scanning. Opt. Laser Technol. 186, 112662. https://doi.org/10.1016/j.optlastec.2025.112662 (2025).
Lu, L. et al. Generative deep-learning-embedded asynchronous structured light for three-dimensional imaging. Adv. Photon. 6, 46004. https://doi.org/10.1117/1.AP.6.4.046004 (2024).
Zhuang, J., Zheng, Y., Guo, B. & Yan, Y. Globally deformable information selection transformer for underwater image enhancement. IEEE Trans. Circuits Syst. Video Technol. 35, 19–32. https://doi.org/10.1109/TCSVT.2024.3451553 (2024).
Zeng, Y. et al. GCCNet: A novel network leveraging gated cross-correlation for multi-view classification. IEEE Trans. Multimedia 27, 1086–1099. https://doi.org/10.1109/TMM.2024.3521733 (2024).
Zhai, Q. et al. Mgl: Mutual graph learning for camouflaged object detection. IEEE Trans. Image Process. 32, 1897–1910. https://doi.org/10.1109/TIP.2022.3223216 (2023).
Zhuge, M., Lu, X., Guo, Y., Cai, Z. & Chen, S. Cubenet: X-shape connection for camouflaged object detection. Pattern Recogn. 127, 44. https://doi.org/10.1016/j.patcog.2022.108644 (2022).
Funding
Open access funding provided by Siksha 'O' Anusandhan (Deemed To Be University)
Author information
Authors and Affiliations
Contributions
Isha Padhy: Conceptualization, Methodology, Investigation, Writing—Original Draft, Writing—Review & Editing (Lead Author). Prabhat Dansena: Methodology, Supervision, Data Curation, Writing—Review & Editing. Sampa Sahoo: Conceptualization, Supervision, Writing—Review & Editing. Rahul Priyadarshi: Supervision, Data Curation, Writing—Review & Editing. All authors have read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Padhy, I., Dansena, P., Sahoo, S. et al. An efficient camouflaged image segmentation with modified UNet and attention techniques. Sci Rep 15, 21086 (2025). https://doi.org/10.1038/s41598-025-07571-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-07571-9













