Introduction

Water bodies, including rivers, lakes, reservoirs, and wetlands, play a critical role in sustaining ecosystems, supporting biodiversity, and meeting the water needs of human societies. Accurate mapping and monitoring of these water bodies are essential for effective water resource management, climate change adaptation, and disaster preparedness1. Satellite imagery, with its ability to cover large spatial extents and provide repeatable observations, has emerged as a powerful tool for identifying and analyzing surface water features. Advances in remote sensing technology and data availability, such as high-resolution imagery from Sentinel-2 satellites, have further enhanced the precision and efficiency of water body segmentation.

Sentinel-2, part of the Copernicus program initiated by the European Space Agency (ESA), offers multi-spectral imagery with a spatial resolution ranging from 10 to 60 m. Its spectral bands, spanning the visible, near-infrared (NIR), and shortwave infrared (SWIR) regions, provide valuable data for distinguishing water bodies from other land covers2. These bands make it possible to analyze surface reflectance properties and accurately delineate water features. However, the complexity of real-world conditions, such as the presence of vegetation, shadows, and atmospheric disturbances, poses significant challenges for accurate water body extraction from satellite imagery.

Conventional techniques for water body segmentation often rely on spectral indices. Among these, the Normalized Difference Water Index (NDWI)3 and its modified versions, such as the Modified NDWI (MNDWI)4 and Automated Water Extraction Index (AWEI), have been widely used5. These indices exploit the differences in reflectance between water and other surfaces in specific spectral bands, providing a straightforward method for water detection. While effective under controlled conditions, these methods are sensitive to external factors such as cloud cover, turbidity, and mixed pixels, leading to inaccuracies in complex or heterogeneous environments6.

To overcome these limitations, researchers have increasingly turned to machine learning approaches, which leverage data-driven algorithms to enhance classification and segmentation accuracy7. Techniques such as Support Vector Machines (SVM), Random Forests (RF), and Gradient Boosting classifiers have been successfully applied to combine spectral, spatial, and textural features for water body segmentation. These models, which do not depend solely on pre-defined indices, demonstrate greater adaptability to varying environmental conditions and datasets. However, traditional machine learning approaches often require manual feature engineering, which can be time-consuming and limits scalability8.

In recent years, advancements in artificial intelligence (AI) and deep learning (DL) have introduced a paradigm shift in remote sensing applications. Deep learning models, such as convolutional neural networks (CNNs), have shown remarkable performance in image analysis tasks, including semantic segmentation. These models learn hierarchical features directly from raw data, eliminating the need for manual feature design and enabling high precision across diverse scenarios. Furthermore, the flexibility of deep learning architectures allows them to adapt to complex patterns, such as the dynamic nature of water boundaries, seasonal changes, and environmental disturbances.

This study highlights the significance of utilizing advanced remote sensing techniques and deep learning models to tackle the challenges of water body segmentation. By incorporating multi-spectral satellite imagery and cutting-edge algorithms, the aim is to improve the accuracy, efficiency, and scalability of water body monitoring systems, ultimately contributing to a deeper understanding of global water dynamics and better resource management. In line with this objective, recent research has increasingly focused on deep learning-based models. In this context, we have reviewed several methodologies for extracting water body areas from satellite images.

Dmytro and Ghulam9 proposed a U-Net-based model for segmenting water bodies from satellite images, achieving an Intersection over Union (IoU) score of 0.60. Tin Moh and Zin Mar [10] implemented a block attention-based U-Net approach to distinguish water and non-water regions in remote satellite images, achieving an IoU of 0.61. Silpalatha and Jayadeva11 introduced a ResNet-based method for segmenting water bodies in satellite images with an IoU of 0.75.

Harika et al.12 applied a DeepLabV3 + model to extract water-contaminated areas from color-based satellite imagery, achieving an IoU of 0.72. The semantic segmentation network (SegNet) model, used by Badrinarayanan et al.13, was also employed to segment water bodies in satellite images, resulting in an IoU of 0.77. Finally, Paszke et al.14 utilized the efficient neural network (ENet) model to segment water bodies, yielding an IoU of 0.79. “Table 1 represents different state-of-the-art Models.”

Table 1 Different state of Art models.

From the above-mentioned literature, we address the following issues:

  1. 1.

    Despite the use of advanced models like U-Net, ResNet, and DeepLabV3+, the Intersection over Union (IoU) scores achieved by these models remain relatively modest, with values ranging from 0.60 to 0.79. This suggests that while the models are effective to some degree, they still struggle to achieve high levels of precision in complex satellite imagery.

  2. 2.

    Several models, including the DeepLabV3 + and ResNet-based approaches, may be sensitive to the quality and resolution of the satellite images. Variations in lighting conditions, weather patterns, or cloud cover can degrade the model’s ability to accurately segment water bodies.

These challenges underscore the necessity for continued refinement and adaptation of deep learning models to achieve more reliable and efficient water body segmentation. To address these issues, we incorporated attention mechanisms and ResNet architectures into U-Net models. This integration enhances the model’s ability to focus on key features, improves its generalization across diverse datasets, and boosts overall performance in water body segmentation tasks. AER U-Net, which stands for Attention-Enhanced Multi-Scale Residual U-Net, is a fully convolutional model used for semantic segmentation. Since AER U-Net offers the best overall performance with the least amount of information loss, it stands out among the many attempts to use current neural networks for image segmentation. To improve the fundamental prediction results, various scholars have improved the semantic segmentation results, especially in terms of better edge and boundary recognition. To get high-resolution predictions, we employed long-distance residual connections for multi-scale features throughout the downsampling procedure. Post-processing the segmentation findings is done by the Residual Refines Module, an independent encoder-decoder. We employ a Refinement Residual Block to improve feature maps. Global and local refinement are used in this special-purpose refinery network to increase forecast accuracy23. However, a lot of these methods have trouble reliably detecting water across wide areas, which frequently leads to unreliable or insufficient identification of water bodies in large-scale assessments. Our work’s primary contribution is.

  • AER U-Net, a multi-layered residual model intended for segmentation, is the suggested framework. Through the use of multi-scale residual networks, multi-dilated convolutions, and skip connections, AER U-Net integrates techniques to improve accuracy, allowing the model to capture complex data and increase segmentation precision.

  • Utilising effective training methods and network architecture optimisation strategies, the model was trained with the Adam optimiser to increase performance. In order to accelerate convergence, the model also incorporates transfer learning. Data augmentation during training was one of the strategies used to expose the model to a variety of scenarios and improve its robustness. Additionally, regularisation strategies, careful hyperparameter selection, and exhaustive validation on a representative dataset all increased robustness.

Materials and models

Figure 1 presents a streamlined workflow for the segmentation of water bodies from Sentinel-2 satellite imagery, illustrating the sequence of key processes. It begins with the acquisition of Sentinel-2 imagery, capturing high-resolution multi-spectral data over the target regions. This raw data is then divided into training and testing datasets to facilitate the development and evaluation of the segmentation model. In the next stage, the data undergoes pre-processing to enhance its quality and ensure consistency. This step typically includes operations such as normalization and resizing to standard dimensions, which prepare the imagery for subsequent analysis.

Following pre-processing, the refined data is input into a modified U-Net model for segmentation. This advanced deep learning architecture is tailored to accurately identify and delineate water bodies within the imagery, leveraging both spatial and spectral information to deliver precise results. Finally, the segmented output is subjected to quantitative analysis, where performance metrics such as accuracy, precision, recall, and IIoU are computed. These metrics provide a rigorous evaluation of the model’s performance, ensuring reliability and applicability in real-world scenarios. This comprehensive workflow highlights the integration of remote sensing, data preparation, advanced modeling, and evaluation to achieve accurate water body segmentation from satellite imagery.

Fig. 1
figure 1

Working process of the implemented approach.

Materials

The dataset utilized for model training and evaluation focuses on identifying and segmenting water bodies from satellite imagery. It is obtained from the Kaggle dataset titled Satellite Images of Water Bodies, featuring satellite data captured by the Sentinel-2 satellite24. The dataset is structured into two primary folders: Images and Masks. Here, masks are generated using the NWDI, a standard method employed to detect and map water bodies in satellite imagery. The NWDI exploits the spectral differences between water and non-water surfaces by comparing the reflectance in the green and NIR spectral bands. Water typically absorbs infrared wavelengths while reflecting green light, making it easier to detect using this index. The mathematical formula used for computing the NWDI is as follows:

$${\text{NDWI}} = {\text{Band3}} - {\text{Band8}}\;{\text{Band3}} + {\text{Band8}}$$
(1)

where, Band 3 represents Sentinel-2’s green channel and Band 8 illustrates Sentinel-2’s Near-Infrared (NIR) channels. “Figure 2 represents the workflow of AER U-Net”, and “Figure 3 represents the sample water body images with the corresponding masks.”

Fig. 2
figure 2

Workflow of AER U-Nets.

Pre-processing

Data preprocessing is an essential first stage in preparing data for DL model training. It involves cleaning, organizing, and converting raw data into a suitable format that allows seamless interaction with the model. The preprocessing phase focuses on discarding irrelevant information and addressing inconsistencies in the dataset. Furthermore, it standardizes the data to ensure uniformity across all inputs, facilitating efficient and effective model training. The following sequence of operations is carried out during the preprocessing phase:

  1. 1.

    Image Resizing: The first stage involves resizing all images in the dataset to a uniform size. Working with smaller, uniformly sized images speeds up the training process and reduces computational time compared to processing larger and varied image sizes. The resizing is accomplished using the resize function from the cv2 module, ensuring each image adheres to the same dimensions for consistency.

  2. 2.

    Pixel Scaling: After resizing, each image is passed through a function called mask_split_threshold, which scales the pixel values of the images to a range between 0 and 1. This normalization ensures that the input data has a standard range, making it easier for the model to learn patterns without being influenced by varying raw value ranges. Scaling accelerates the convergence of models during the training process and improves overall performance.

  3. 3.

    Image Padding: The final step involves removing unnecessary padding bits that may exist around the images. These bits can lead to inconsistencies during both training and testing. All images must be identical in size for the model to perform optimally. Hence, this step ensures that only the relevant portions of the images are used, discarding any extraneous or unwanted pixels that may introduce noise or distortions.

Fig. 3
figure 3

Sample water body images: (a) Original; (b) Mask.

AER U-Net

U-Net25 is a well-known deep-learning approach for image segmentation due to its precision, robustness, and adaptability. Therefore, based on this idea, we proposed a modified U-Net by incorporating advanced building blocks like residual connections and attention mechanisms, making it more robust and capable of focusing on critical regions in satellite images. The basic building blocks of the suggested U-Net are described below, and its specifications are tabulated in Table 2.

  1. 1.

    Convolution block: The convolutional block is the core unit for feature extraction. It applies a convolutional layer followed by batch normalization and a non-linear activation function. Batch normalization ensures stability and accelerates training, while the activation function introduces non-linearity to model complex mappings. This block is strategically used throughout the network for feature extraction and transformation26.

  2. 2.

    Residual Block: Residual blocks address the vanishing gradient problem by introducing skip connections. They preserve essential features across layers and allow deeper networks to train effectively. Each residual block aligns the input channels to match the output using a 1 × 1 convolution, followed by two convolutional blocks interleaved with a dropout layer for regularization. Finally, the shortcut connection adds the input back to the output, ensuring critical information is retained. This design facilitates better gradient flow and ensures that critical information is not lost as the network depth increases.

  3. 3.

    Attention Block: The attention block refines skip connections by focusing on relevant spatial regions. It aligns the skip connection and gating signal dimensions using 1 × 1 convolution. The feature maps are combined and activated with ReLU, followed by a sigmoid function to generate an attention map. This attention map highlights important regions, which are multiplied with the skip connection to refine the input for the decoder. This mechanism is particularly useful in tasks like water body segmentation, where distinguishing between the foreground (water) and background is crucial.

  4. 4.

    Encoder: The encoder captures hierarchical feature representations from the input image. Each level consists of a residual block for feature extraction, followed by a max-pooling layer to reduce spatial dimensions. The number of filters doubles with each level, enabling the network to learn increasingly abstract patterns. The encoder’s role is to transform the input into a compressed, high-dimensional representation that retains essential features for segmentation.

  5. 5.

    Bottleneck (Center): The bottleneck serves as the transition between the encoder and the decoder. It is designed to process the compressed features obtained from the encoder, extracting the deepest and most abstract representations of the input. This module consists of a residual block with 256 filters and an additional dropout layer for regularization. By capturing high-level contextual information, the bottleneck ensures that the decoder has access to features representing both global and local patterns in the input image.

  6. 6.

    Decoder: The decoder reconstructs the segmentation mask by upsampling feature maps and merging them with attention-refined skip connections. Each level begins with a transposed convolution to upsample the feature maps, followed by an attention block that combines the upsampled features with those from the encoder. A residual block processes the combined features to enhance detail and accuracy. This process is repeated at each level, reducing the number of filters and progressively restoring the image’s original resolution26.

  7. 7.

    Output: The output layer produces the final segmentation mask, representing the probability of each pixel belonging to the target class. It uses a 1 × 1 convolution to reduce the feature maps to a single channel, followed by an activation function to normalize the predictions. The sigmoid activation ensures that the output values are in the range [0, 1], suitable for binary segmentation tasks. This design enables the network to produce precise segmentation masks with well-defined boundaries.

  • Attention Enhanced U-Net Framework: The Attention Enhanced U-Net Framework is an enhanced U-Net architecture that incorporates attention methods to improve segmentation performance and fine-tune feature selection. Standard U-Net transfers encoder features straight to the decoder via skip connections, which could add extraneous data. By using self-attention mechanisms or attention gates (AGs), the network suppresses background noise and selectively focuses on the most pertinent areas, increasing segmentation accuracy. Transformer-based U-Net (TransUNet) and Attention U-Net with SE blocks are two variants that improve spatial and channel-wise feature learning even more. When accurate segmentation is essential, this method works especially well in autonomous systems, medical imaging, and remote sensing. In complicated segmentation tasks, the attention-enhanced model performs better because it decreases false positives, increases robustness, and guarantees more effective feature utilisation.

  • Residual Blocks for Enhanced Feature Extraction: Residual Blocks for Enhanced Feature Extraction help deep neural networks learn complex features more effectively by addressing disappearing gradients. By avoiding one or more layers, a residual block is made up of shortcut connections, also known as skip connections, which enable the network to learn residual mappings rather than direct transformations. In addition to enabling deeper designs without deterioration, this helps maintain significant features. Residual blocks stabilise training and improve gradient flow by combining batch normalisation and ReLU activation. Due to their extensive use in architectures such as ResNet, U-Net variations, and Transformer models, they are especially good at deep segmentation, image processing, and medical imaging tasks because they enhance feature representation and capture minute details.

  • Adaptive Adam Learning Optimization: The proposed approach leverages the advantages of important optimisation techniques to increase training efficiency and convergence by training the model using the Adaptive Adam optimiser with a learning rate of 0.001. RMSprop, momentum, and stochastic gradient descent (SGD) components are all incorporated into the Adaptive Adam optimiser, which expands upon the standard Adam optimiser. By modifying the learning rate for every parameter according to recent gradients, RMSprop helps to stabilise the optimisation and manage noisy gradients or shifting goals. Incorporating the momentum term speeds up convergence by minimising oscillations, smoothing updates across iterations, and facilitating the optimizer’s rapid passage over shallow areas of the loss surface. The stochastic aspect of SGD, on the other hand, enables the model to update parameters using mini-batches, escaping local minima and perhaps producing more generalised solutions. Particularly in challenging deep learning problems, these elements work together to provide a more flexible and effective optimiser that converges more quickly and steadily. By maintaining a learning rate of 0.001, the approach avoids overfitting and guarantees significant advancement towards the ideal solution while striking a compromise between quick convergence and stability.

Table 2 Block-wise configurations of the proposed modified U-Net.

Quantitative measures

Quantitative measures for image segmentation provide metrics that allow you to objectively evaluate the quality of a segmentation model. These measures are essential for assessing how well the model partitions an image into meaningful regions. In this work, we considered Intersection over Union (IoU), dice, recall, precision, F1-score, and accuracy28. The entire process of the proposed model is illustrated in Algorithm 1.

Evaluation metrics

  • Accuracy: It is defined as the fraction of the total count of appropriately categorized instances from the total count of instances.

    $${\text{Accuracy}} = {\text{Tp}} + {\text{TnTp}} + {\text{Tn}} + {\text{Fp}} + {\text{Fn}}$$

    Tp -True Positive; Tn -True Negative; Fp -False Positive; Fn -False Negative.

  • Precision: The ratio of appropriately categorized positive instances to the total count of positively predicted instances.

    $${\text{Precision}} = {\text{TpTp}} + {\text{Fp}}$$
  • Recall: The fraction of appropriately categorized positive instances from the total count of positive instances.

    $${\text{Recall}} = {\text{TpTp}} + {\text{Fn}}$$
  • F1_Score: It can be defined as the average harmonic between recall and precision.

    $${\text{F1Score}} = {\text{2RecallprecisionRecall}} + {\text{precision}}$$
Algorithm 1
figure a

Water Body Segmentation

Results and discussion

“Table 3 represents Data set Description” and this section outlines the experimental findings of the proposed model, emphasizing its performance relative to leading methods in the field. A thorough examination of the results is provided, illustrating why our approach delivers superior outcomes compared to current techniques.

Table 3 Data set description.

To predict water body areas from satellite images, we began by pre-processing the images through resizing, scaling, and padding. Next, we applied a modified U-Net model to detect water regions in the processed images29. The model learned relevant features automatically through multiple hidden layers and was trained using the backpropagation algorithm. Further model’s performance was evaluated using various metrics, including IoU, Dice, precision, recall, F1-score, and accuracy, which are presented in Table 7. For the experiments, the dataset was divided into 80% for training and 20% for testing. The experiments were conducted on a desktop featuring an 11th Gen Intel(R) Core (TM) i7-11700 processor (2.50 GHz), with 32GB of RAM and a 1 TB SSD, using Google Colab.

Fig. 4
figure 4

Performance of the proposed and existing models

Ablation analysis

“Table 4 represents the performance of the proposed model”, and “Table 5 represents a comparison with Trainable parameters”. To demonstrate the enhanced efficacy of individual elements within the proposed AER-UNet for waterbody segmentation, ablation studies were conducted using Kaggle datasets. The results presented in the Tables illustrate the segmentation effectiveness in a sequential manner: starting from the U-Net, followed by U-Net with Enhanced attention, U-Net with residuals, and finally, the newly proposed AER U-Net (U-Net with both Enhanced Attention and residuals).

Table 4 Performance of the proposed model.

The results uncovered several valuable insights:

Utilizing the U-Net architecture alone yielded metrics of 0.831 for precision, 0.83 for recall, 0.832 for F1-Score, and 0.834 for IoU.

The integration of U-Net + Enhanced Attention connections led to significant improvements. Precision, recall, F1-Score, and IoU values rose to 0.89, 0.88, 0.894, and 0.892, respectively, with IoU reaching 0.907.

Further, incorporating U-Net + Enhanced Attention + Residual mechanisms enabled the model to identify crucial parameters while eliminating unnecessary ones. Consequently, precision, recall, F1-Score, and IoU metrics saw substantial improvements, reaching 0.943, 0.940, 0.946, and 0.948, respectively. “Table 6 represents the comparison with different Layers and Fig. 4 illustrates the visual representations of the proposed and existing model with IoU score”.

Table 5 Comparison with trainable parameters.

Based on the outcomes, several valuable observations were made:

  • For U-Net with a depth of n = 3, the precision, recall, F1-Score, and IoU metrics were 0.841, 0.87, 0.823, and 0.827, respectively. Slight variances were observed for n = 4 and n = 5. Considering parameters and complexity, opting for n = 3 is a preferable choice.

  • The integration of U-Net with Enhanced Attention connections using dilated convolution1 and dilated convolution2 resulted in noteworthy improvements. Precision, recall, F1-Score, and IoU were calculated as 0.88, 0.884, 0.889, 0.882, and 0.891, 0.899, 0.897, 0.893, respectively.

Introducing U-Net with with Enhanced Attention connections alongside Multi scale Resifual led to the identification of essential parameters while removing unnecessary ones. As a result, precision, recall, F1-Score, and IoU were enhanced to 0.943, 0.940, 0.946, and 0.947, respectively.

Table 6 Comparison with different layers.
Table 7 Comparison of the proposed and existing models.

“Table 7 presents a comparison of the metrics between the proposed method” and several state-of-the-art models, including U-Net, ResNet, DeepLabV3, SegNet, and ENet. According to the statistics reported in Table 3, the modified U-Net model [30] demonstrates a higher IoU, a key metric for evaluating the segmentation performance of semantic images. This suggests that the proposed model outperforms the others overall in terms of segmentation accuracy. The main reasons behind the success of the proposed model:

  1. 1.

    By using attention layers, the model can focus on the most relevant features of the image (such as water bodies) and suppress less informative regions. This helps improve segmentation accuracy, especially in complex or cluttered images where distinguishing water from non-water regions is challenging [31].

  2. 2.

    Due to the multi-scale feature extraction, the model can accurately segment both large and small water bodies, capturing fine boundaries and irregular shapes that other models might miss.

  3. 3.

    ResNet introduces residual connections, which allow the network to learn more effectively by addressing the vanishing gradient problem and enabling deeper networks without losing performance. This leads to better feature extraction and improved segmentation accuracy30,31,32.

Conclusion

This study successfully presents a robust and efficient deep learning approach for water body detection from satellite imagery, leveraging a modified U-Net architecture. The proposed model incorporates advanced features such as residual blocks, attention mechanisms, and dropout layers to improve segmentation accuracy and enhance generalizability. By employing a contracting-expanding path design with optimized activation functions, kernel initializers, and multi-channel feature maps, the model effectively captures and processes complex spatial features. Key architectural enhancements, including attention-refined skip connections and dropout regularization, address challenges like overfitting and the vanishing gradient problem. The use of Adam optimizers further accelerates computation and ensures effective training. Additionally, data preprocessing techniques such as resizing, scaling, and padding contribute to the model’s precision and performance. This model excels even with small and hazy satellite images. Moreover, the AER U-Net architecture guarantees optimal results for water bodies located near land boundaries. The model also incorporates the Adam optimizer, which ensures faster computation times, making it an efficient solution for such tasks. The proposed approach achieves an IoU score of 0.94, demonstrating superior performance compared to existing methods. Its adaptability to high-resolution imagery and capability to accurately delineate water bodies make it a valuable tool for environmental monitoring, resource management, and disaster assessment.