Abstract
Underwater object detection plays a crucial role in applications such as marine ecological monitoring and underwater rescue operations. However, challenges such as limited underwater data availability and low scene diversity hinder detection accuracy. In this paper, we propose the Underwater Layout-Guided Diffusion Framework (ULGF), a diffusion model-based framework designed to augment underwater detection datasets. Unlike conventional methods that generate underwater images by integrating in-air information, ULGF operates exclusively on a small set of underwater images and their corresponding labels, requiring no external data. We have publicly released the ULGF source code and the generated dataset for further research. Our approach enables the generation of high-fidelity, diverse, and theoretically infinite underwater images, substantially enhancing object detection performance in real-world underwater scenarios. Furthermore, we evaluate the quality of the generated underwater images, demonstrating that ULGF produces images with a smaller domain gap.
Similar content being viewed by others
Introduction
Underwater object detection is a fundamental technology in underwater robotics, with applications spanning underwater resource collection, marine scientific research, underwater archaeology, and underwater search and rescue operations1,2,3,4. It involves the application of object detection techniques in underwater environments, with the primary objective of accurately identifying and localizing target objects in challenging aquatic conditions.
Currently, most object detection algorithms perform poorly in underwater environments. This is primarily due to the challenges in image acquisition and the complex, dynamic conditions of the underwater world5,6,7,8. During dataset collection, it is difficult to flexibly control the relationships between various targets, which limits the controllability and diversity of the dataset. As a result, the lack of rich and diverse training data prevents models from effectively learning relevant underwater features, leading to poor generalization and low detection accuracy in real-world underwater object detection tasks.
Current mainstream underwater image generation methods face major challenges in practical applications. Physics-based approaches rely on mathematical derivations but struggle to model complex underwater scenes, limiting their applicability. These methods fail to accurately capture environmental variability and fine details. On the other hand, data-driven methods, which use deep learning techniques have shown promise but require large, high-quality underwater datasets. Additionally, these methods often rely on establishing correspondences between underwater and in-air images, which is difficult. Most existing approaches generate underwater images by converting in-air ones, yet this fails to accurately replicate the unique optical properties and complexities of underwater environments, resulting in low-authenticity synthetic images.
In summary, current mainstream underwater image generation methods have notable limitations. Physics-based approaches struggle to model complex underwater scenes, while data-driven deep learning methods, though promising, rely on large, high-quality datasets and correspondences between underwater and in-air images, which are inherently difficult to obtain. The difficulty in acquiring rich and diverse underwater data, in turn, limits the performance of object detection algorithms in underwater environments. Therefore, studying underwater image generation technology for target detection is still a key and challenging research direction. It needs to be more efficient and less dependent on data.
In this study, we propose a Layout-Guided Diffusion Framework (ULGF) for underwater image generation, aiming to achieve theoretically infinite and flexible image generation for specific underwater scenes, thereby enhancing the performance of underwater object detection models (Fig. 1E). The key innovation of ULGF lies in its ability to produce high-fidelity underwater images without requiring external datasets. Experimental evaluations on multiple mainstream object detection frameworks demonstrate that ULGF substantially improves detection accuracy. Additionally, we showcase how ULGF can generate high-detail underwater images from limited datasets without requiring manual annotations.
ULGF is an image generation framework specifically designed for underwater environments. YOLO is an advanced, single-stage object detection algorithm. A Illustrates the process of adding noise to underwater images. Every image input to ULGF undergoes the addition of random horizontal noise. Assuming the original image has a noise level of 0 and the random noise level is 10, we present several typical noisy images. B Shows the training process of ULGF. The denoising U-Net33 model is trained by measuring the difference between the added noise and the predicted noise. It is important to note that the image encoder remains frozen during this process. C Describes the inference process of ULGF. Through T iterations of denoising the original noise, an image with the specified layout is generated. During this process, a randomly initialized noisy image transforms into an underwater image. This transformation is guided by prompts containing target and location information. The resulting image can be used to train detectors. D Illustrates the feature extraction process for different targets. The underwater image is encoded to extract feature information, with the label corresponding to the target’s position annotation in the image, and PKE records the feature distribution of each target and environment. By sampling the characteristic distributions recorded in PKE, a prior feature distribution map with a specified layout is obtained. E Visualizes the process of generating denoised underwater images.
A comparative analysis with mainstream underwater image generation methods demonstrates the superior image quality and diversity of our approach. To facilitate further research in this field, we have open-sourced the code of the underwater framework and the generated underwater image dataset, allowing researchers to seamlessly apply it to underwater object detection tasks and further advance the development of underwater image generation and detection technologies.
Results
ULGF provides a more reliable dataset for underwater object detection
We collected a total of 10 types of images from the RUOD9 underwater dataset (Fig. 2A). The dataset labels target categories including: fish, divers, starfish, coral, turtles, echinus, holothurian, scallops, cuttlefish, and jellyfish, totaling ten categories. The original dataset uses the YOLO label format, which includes the target class and the coordinates of the bounding box for each image. To meet our task requirements, we organized and converted the RUOD dataset and transformed it into the format of {image, description, mask} for more efficient training and inference (Fig. 1C).
A Illustrates the dataset used for training the ULGF model. This dataset consists of ten distinct types of underwater targets. B Presents the labeling process for underwater images. The YOLO labels from the original dataset were re-annotated using Geo, and combined with scene descriptions from (C), textual prompts corresponding to underwater images were generated. C Shows the process of scene description for a given image using Qwen’s visual language model. This model can analyze the scene depicted in an input image. Additionally, C indicates that it effectively filters out underwater images generated from in-air images, providing evidence that the vision-language model employed by Qwen can be utilized to assess the quality of generated images.
In the data preprocessing process, we first resize the images to ensure consistent input dimensions. Then, we randomly flip the input images with a probability of 0.5 to enhance the model’s robustness. During training, we randomly remove the corresponding prompt words with a probability of 0.1. These operations help improve the model’s generalization ability during training and prevent overfitting.
The input image is described as the combination of a global description and an object layout description. The global description, “An image of an underwater scene,” provides a concise summary for each image. The object layout description records the objects in the image and their positions. For example, image 8252 corresponds to the layout description ‘cuttlefish <L16><L16> diver <L18><L34>’. We utilized Qwen2.5-VL to verify the accuracy of the global descriptions, as most existing methods generate underwater images from aerial images, and the visual language model can identify and exclude those deemed non-typical underwater scenes (Fig.2C).
The layout description of the objects records the position of each target using Geo encoding10, specifically including the pixel coordinates of the top-left and bottom-right corners of the target. This precise spatial mapping facilitates mask generation in subsequent steps. The generated masks, derived from the layout description, play a crucial role in training by guiding loss function calculations, ensuring that the model prioritizes actual target regions. By integrating images, descriptions, and masks, the training targets can more accurately focus on the target areas, improving detection and recognition accuracy.
The trained ULGF model performs inference using only the image’s overall description and the layout description of objects. This simplified input reduces computational load while maximizing the semantic information learned during training. The overall description provides a high-level semantic understanding, while the object layout description specifies precise object locations, enabling efficient and accurate inference.
To ensure the validity and diversity of the generated data for object detection, we rigorously filtered the validation set images to ensure that they do not overlap with those in the training set. Specifically, no validation image was reused from training, preventing data leakage and ensuring accurate and fair model evaluation.
To maintain label consistency and accuracy, we exclusively used underwater image labels from the training set. To enhance data richness and diversity, we limited the number of objects per image and applied data augmentation techniques, such as random flipping of bounding boxes. These strategies increase dataset variability and mitigate overfitting to specific images or scenes. The detailed methodology is provided in the Methods section.
In this study, we evaluated several mainstream object detection models to assess their performance in underwater image object detection tasks. The YOLO11 series is among the most widely used object detection models, known for its excellent detection accuracy and real-time performance. The design concept of YOLO is to transform the object detection task into a regression problem, allowing object localization and classification in a single forward pass. YOLO has been widely applied in underwater object detection12.The YOLO series models have been continuously optimized with each version update, improving both detection accuracy and speed. We evaluated different versions of the YOLO model (YOLOv5, YOLOv613, YOLOv11, YOLOv12) to examine whether the generated underwater images consistently improve detection tasks as the detection model evolves.
RTDETR14 is a recently proposed object detection model based on the Transformer architecture. Unlike traditional Convolutional Neural Network (CNN) models, the Transformer has the ability to model global information when processing image features, enabling it to capture long-range dependencies between objects in the image. RTDETR places particular emphasis on real-time performance, achieving efficient inference while maintaining accuracy, making it an important model for our evaluation.
ULGF substantially enhances target detection performance through dataset expansion (Fig. 3D). We define mean Average Precision (mAP) as the average precision over IoU thresholds from 0.5 to 0.95, and mAP50 as the average precision at IoU = 0.5. As shown in Fig. 3D, ULGF improves the detection accuracy of all tested models on both the UDD and RUOD datasets. On the RUOD dataset, ULGF increases YOLOv5’s mAP50 by 1.2% and mAP by 0.9%; YOLOv6 by 1.4% and 0.7%; YOLOv11 by 1.5% and 1.4%; YOLOv12 by 2.0% and 1.8%; and RT-DETR by 1.4% and 0.9%. On the UDD dataset, compared to the original data, YOLOv5 with ULGF-augmented data achieves a 1.0% improvement in mAP50 and a 2.1% improvement in mAP; YOLOv6 improves by 1.7% and 1.5%, respectively; YOLOv11 by 3.0% and 1.0%; YOLOv12 by 0.7% and 1.1%; and RT-DETR by 0.7% and 0.4%. These results confirm that ULGF consistently enhances underwater object detection performance. Additionally, ULGF is compatible with other detection enhancement strategies, making it a versatile augmentation approach.
A Illustrates the process of generating images in various styles using ULGF and evaluates them based on three metrics: color bias, clarity, and contrast. During the training process, we investigate the impact of the weights assigned to occluded regions on the quality of the generated images. B Presents a method for generating underwater images using both underwater and land images. By obtaining depth information from land images and light field information from underwater images, a U-Net network is used to render images with an underwater style. C Follows a similar approach to generating underwater images from land images, but it guides the generator by determining whether the generated underwater images resemble real underwater images. D provides a detailed demonstration of the effectiveness of the dataset augmented by ULGF during training. We selected the classic YOLO series and RT-DETR as representative detectors to observe specific training details. The experiments were conducted on both the RUOD and UDD datasets. ULGF improves the accuracy of all tested detectors. Metrics marked with * indicate results trained on the augmented dataset.
We conducted a detailed analysis of the per-category accuracy improvements in object detection using ULGF. Taking YOLOv12 as an example, we trained the detection model with the ULGF-augmented dataset and observed accuracy improvements across all categories. Specifically, the mAP50 for the “jellyfish” category increased by 3.3%, while mAP improved by 3.2%. For “holothurian”, mAP50 increased by 4.3% and mAP improved by 2.6% (Fig. 4D). These results demonstrate the positive effect of ULGF on different types of underwater targets.
A Shows the prior distributions of different target types. We randomly selected 30 samples from each category in the RUOD dataset for visualization, roughly simulating the prior features of each target. B Compares the evaluation metrics of ULGF with other mainstream underwater image generation methods. FID measures the distribution similarity between generated and real images, MSE and PSNR calculate image distortion based on error, IS evaluates the clarity and diversity of generated images, while SSIM perceives image quality in terms of structure, luminance, and contrast. We selected the UWCNNtype1 method as UWCNN and UWCNNtype9 as UWCNN*. C Presents the count and percentage of each target category in the RUOD dataset. D Visualizes the accuracy improvements of ULGF for various target categories (based on YOLOv12). E Shows the variation in layout accuracy of the generated images under different occlusion strategies. F Illustrates the impact of the level of prior knowledge fusion on the generated image’s FID score and detection task accuracy.
ULGF is an underwater image generation method that does not depend on in-air images
Our framework verifies the quality and authenticity of generated underwater images by comparing them with the underwater images used during training. Specifically, we compared the underwater images generated by ULGF with the real underwater images used in training. This comparison helps evaluate whether the generated images successfully simulate the visual features of underwater environments.
To demonstrate the superiority of ULGF, we designed two sets of comparison experiments. The first set compares the underwater images generated by other methods with their corresponding land images. The goal is to test whether these methods can accurately preserve the spatial structure of the aerial images while effectively incorporating the unique visual changes of the underwater environment during generation.
The second set compares the generated underwater images with the real underwater images used in training. This experiment aims to evaluate whether these methods can maintain consistency with the training data when generating underwater images, thus verifying the stability and reliability of their generation capabilities. Since ULGF does not rely on land images, we only used real underwater images for the metric calculations in our experiments. Additionally, for methods that do not use underwater images for training, we performed metric calculations with the UIEB dataset15.
In addition to the above metrics, we further employ the vision-language model Qwen2.5-VL to describe the image scenes16, in order to validate the realism and superiority of our generated underwater images. To eliminate any human interference or bias, no guiding prompts are provided in the experiment. Qwen2.5-VL is asked to directly describe our generated underwater images as well as the underwater images generated by other methods (such as aerial image generation methods), allowing for an objective evaluation of the differences in scene realism and underwater characteristics across different generation methods (Fig. 2C).
Experimental analysis shows that ULGF outperforms mainstream underwater image generation methods in Fréchet Inception Distance17 (FID), a metric measuring the similarity between generated images and real images by comparing their distribution differences in feature space. ULGF achieves an FID of 23, while UWCNN18, UWGAN19, UWNR20, and UWCNN* have FID values of 218.09, 168.79, 229.863, and 225, respectively. Traditional underwater image generation methods learn the style of underwater scenes and generate underwater images based on aerial images (Fig. 4B). This approach results in generated images carrying many features of the aerial images, leading to poor FID scores. In contrast, ULGF generates images solely based on real underwater images, giving it an advantage in terms of image authenticity. Although ULGF performs slightly worse than UWNR in terms of Inception Score21 (IS), it outperforms UWCNN, UWGAN, and UWCNN*. This is because the IS metric measures the diversity of generated images through classification22. Since ULGF generates images of underwater scenes and focuses on generating targets within individual images, its performance on the IS metric is relatively modest.
In other common evaluation metrics, such as PSNR, MSE, and SSIM23, our method does not show a marked advantage over mainstream underwater image generation algorithms. This is because PSNR and SSIM primarily focus on pixel-level differences between generated and real images. Other underwater image generation methods, which better preserve the original image’s layout and structure, tend to achieve superior results in metrics like PSNR and SSIM. In contrast, ULGF learns the feature distribution of underwater scenes rather than replicating a specific image, which explains its performance in the FID metric.
We also discussed the impact of introducing prior information on image quality. We estimated the data distribution of different target categories in the training set after VAE encoding and incorporated it into the inference process to enhance the quality of generated underwater images. For instance, a value of 0.99 indicates that we introduced the prior data distribution during inference and performed 99 denoising steps, while 0.95 corresponds to 95 denoising steps. Experiments on the RUOD dataset demonstrate that introducing an appropriate amount of noise during the early stages of the denoising process can generate more realistic underwater images (Fig. 4F). On the UDD dataset, this approach enabled us to reduce the FID metric from 89.9 to 60.6. Although ULGF is designed for underwater object detection, researchers who require higher image realism can consider introducing prior information into the image denoising process (Fig. 5B).
A Shows the powerful generation capabilities of ULGF. Unlike mainstream underwater image generation methods (UWCNN, UWNR and UWGAN), ULGF does not require in-air images as input. ULGF is capable of generating underwater targets with specified layouts and types. Additionally, by providing different random seeds to ULGF, diverse images can be generated. In the demonstration, we randomly generate three sets of images for each layout label. B Illustrates the effect of fusing prior knowledge with noise for image generation, as well as the improvement in detection performance on the test dataset after augmenting the dataset with ULGF. After incorporating prior knowledge, the textures of the targets appear more natural. With the augmented dataset from ULGF, the detector shows a substantial reduction in both false negatives and false positives.
ULGF can generate underwater images with specific styles and layouts
We conducted a detailed evaluation and analysis of the style of images inputted into the ULGF model. First, we assessed the images in the existing dataset, focusing on key attributes such as color bias24, blur, and contrast (Fig. 3A). Through these evaluations, we were able to accurately identify the style characteristics of each image, thereby providing labels for the ULGF. Building on this, we successfully generated underwater images with diverse stylistic features during the image generation process, achieving varied visual effects. During this process, we adopted the ULGF’s design results for the weight loss function of target occlusion regions to train and generate underwater images.
During the model generation process, we specified arbitrary sizes and types of targets as conditions and, in combination with the image style information, successfully generated realistic underwater images (Fig. 5A). These images closely resemble real-world underwater environments in terms of visual effect, demonstrating the powerful capabilities of the ULGF model in underwater image generation tasks.
Discussion
Underwater object detection aims to identify and locate target objects, such as aquatic products, personnel, or other targets by analyzing image features. With the development of underwater detection technology, underwater object detection has broad application prospects in fields such as marine research, environmental protection, and aquaculture25. However, dataset scarcity remains a major bottleneck in underwater object detection. Due to the unique challenges of underwater environments, acquiring sufficient labeled datasets is both costly and difficult, limiting the diversity of training data and hindering the generalization and accuracy of detection models. Additionally, underwater images are affected by factors such as water quality, lighting, and biological movement, often resulting in substantial variations in image quality. In practical data collection, it is difficult to control the impact of these factors on underwater images, further exacerbating the limitations of insufficient datasets on model performance, leading to poor results in real-world detection26.
In this study, we propose a framework called ULGF for underwater image generation, aiming to improve underwater object detection performance. The framework has been validated on mainstream detectors and datasets. Currently, there is limited work on utilizing layout-guided diffusion models to construct training datasets for underwater object detection27. We have open-sourced the framework and provided a generated underwater dataset with accurate annotations for use by other researchers. This framework supports custom image generation, providing a powerful tool for further development in underwater object detection.
The improvement in detection accuracy brought by ULGF may appear limited in magnitude, but it holds positive significance in practical underwater detection scenarios. The consistency of these improvements across diverse object detectors (from YOLOv5 to RT-DETR) underscores the core value of our ULGF framework: it provides a robust and model-agnostic data augmentation solution. This is particularly valuable for practitioners who need to deploy reliable systems without being constrained by a specific model architecture. In critical missions such as underwater search and rescue, missing even a single target can lead to severe consequences. Any minor enhancement in detection accuracy signifies a markedly higher probability of locating missing divers or wreckage within extensive search areas. Similarly, in large-scale underwater ecological monitoring, where thousands of images or video frames must be processed for species population statistics, an improvement in detection accuracy can substantially enhance the long-term accuracy of population estimates, which is crucial for effective conservation and management strategies.
We evaluate the proposed underwater image generation framework from three aspects: underwater image generation quality, layout accuracy, and detection accuracy. To minimize the impact of parameter settings on the results in detection tasks, we only modify the input underwater dataset during the experiments without changing other parameter configurations. To consider the specific impact of generated data on detection tasks, we maintain consistent detector parameter configurations across different tests to ensure fairness in evaluation. In the layout accuracy assessment, we employ a trained YOLO detector to test all methods in order to eliminate any bias introduced by the selection of the detector. For the evaluation of underwater image generation quality, we compare the images generated by the proposed framework with those generated by other mainstream open-source underwater image generation methods, thereby comprehensively assessing the performance of the framework.
Currently, non-deep learning-based underwater image generation methods do not rely on large amounts of labeled data for training. Therefore, they can effectively generate underwater images in situations where data is scarce or labeled data is difficult to obtain. Furthermore, because these methods are typically based on physical models or rule-based systems, they have low computational overhead and fast processing speed, making them suitable for real-time or resource-constrained applications. However, these methods also have certain limitations, mainly due to their high dependence on condition design. Non-deep learning methods often struggle to adapt effectively to changes in different underwater domains, lacking sufficient adaptability. As a result, the generated underwater images are relatively uniform, lacking diversity and flexibility, and are difficult to handle complex and dynamic underwater environments.
Deep learning-based underwater image generation methods can be divided into supervised learning and unsupervised learning approaches. Supervised learning methods rely on the alignment between underwater and aerial images to generate and transform images. However, in practical underwater object detection scenarios, obtaining corresponding aerial images for underwater images is often not feasible. In contrast, unsupervised learning methods, while not requiring corresponding aerial image data, still require a large amount of underwater data to learn the features of underwater scenes (Fig.3B, C). Within the framework of unsupervised learning, Fig. 3B employs a U-Net network and integrates underwater feature information to render underwater images of diverse styles; whereas Fig. 3C utilizes depth information obtained from real images to generate underwater scenes, with the authenticity of the generated results evaluated by a discriminator. Although such methods reduce the reliance on paired data, ultimately, they still need to learn features from a large volume of underwater data. Consequently, they fail to fundamentally address the bottleneck of data scarcity, and the content of the generated images substantially differs from real underwater scenes.
From a technical implementation perspective, mainstream underwater image generation methods convert in-air images into underwater images to expand the dataset. However, these methods suffer from domain shift during the conversion process, resulting in poor image quality. There are substantial visual differences between in-air and underwater images, making it difficult for the converted underwater images to accurately reflect the characteristics of the underwater environment, thereby affecting the generation quality. In underwater object detection tasks, the content of in-air images cannot effectively guide underwater object detection. In-air images fail to capture the relationship between underwater objects and their environment, with their layout and background being obviously inconsistent with actual underwater scenes. Additionally, generating underwater images based on in-air images requires collecting a large number of real images, which increases the cost of image collection and annotation.
ULGF only requires input text and layout descriptions during inference, avoiding the reliance on in-air images and their corresponding datasets. This approach not only reduces the complexity of data collection but also allows for flexible generation of underwater images that meet specific needs. It effectively mitigates the negative impact of domain shift, improving the quality and applicability of underwater image generation.
We also investigated the generation of images with different styles. To this end, we evaluated attributes such as color bias, blurriness, and contrast in images from the existing dataset and annotated each image with relevant labels. To avoid subjective factors influencing the evaluation results, we invited 8 volunteers to participate in the image assessment and discarded any images with controversial evaluations. We generated images in different styles, demonstrating the framework’s adaptability to various scenes.
Although our model shows good performance in underwater image generation tasks, there is still room for improvement. It must be acknowledged that, despite the theoretically unlimited number of generated images, the data-driven training approach is heavily reliant on the dataset. In particular, when the underwater dataset is small in scale, the model’s performance is affected, leading to a lack of diversity in the generated image styles. Furthermore, the image generation speed slows down as the image size increases. While this has a minimal impact on real-time underwater object detection tasks, it remains a concern. Another area for improvement is that the current model still struggles to generate realistic texture details for complex objects. Specifically, when handling fine details of underwater object surfaces or textures in complex environments, the generated results do not fully replicate the actual scene.
In addition, the training set used in our evaluation contains a relatively small amount of available underwater data, which somewhat limits the performance of the detection model. Our framework has a high requirement for label accuracy in the underwater dataset because when handling unannotated underwater objects, the generated image treats them as background and generates them randomly. Such generated data may interfere with the model’s effective extraction of object features during underwater object detection training, potentially leading to missed detections in real-world scenarios. Therefore, we recommend that when selecting an underwater detection dataset, it is essential to ensure that the dataset has sufficient scene diversity and accurate labels to improve the model’s training effectiveness and detection accuracy.
Future research can focus on customizing diverse underwater generation environments by incorporating factors such as lighting conditions, suspended particle distribution, and other environmental variables, thereby reducing reliance on existing datasets. Although the ULGF enhances the feature similarity between generated underwater images and real ones through underwater prior feature fusion, it does not explicitly incorporate designs tailored to underwater imaging characteristics, which limits further improvements in the quality of the generated underwater images. Currently, research on integrating image generation with real underwater scenes is still in its early stages, offering vast exploration opportunities. One possible direction is to further develop a multi-modal training framework that integrates factors such as underwater lighting, visibility, and light scattering, and designs underwater images in different styles. This approach could reduce dependence on specific datasets and enhance the detection model’s adaptability and detection performance in unknown underwater scenarios. For example, combining physical lighting models with diffusion generation techniques (IClight28) could generate more realistic underwater physical images, improving the authenticity and reliability of the data. Additionally, combining targets not included in the dataset with underwater scenes could effectively enhance the robustness and generalization ability of the detection system when facing unknown objects, providing stronger support for the practical application of underwater object detection.
In summary, by leveraging the advantages of layout-guided diffusion in image generation, our method exclusively relies on underwater images to produce high-quality synthetic data, eliminating the limitations of traditional aerial-to-underwater conversion methods. Compared to using aerial images, the ULGF-generated underwater images are more realistic, flexible, and controllable, allowing scene and object layouts to be adjusted as needed. This approach removes the need for manual aerial image collection and annotation, providing a solution to address dataset scarcity in underwater object detection.
Methods
Selection and preprocessing of underwater datasets
When selecting an underwater dataset, it is essential to consider the specific needs of the detection task, ensuring that the dataset includes annotations for diverse object categories. Some underwater datasets focus on image enhancement, aiming to improve image quality through denoising and contrast adjustment. However, these datasets contain fewer samples and are not directly applicable to object detection tasks, making them unsuitable for our needs. Our framework is designed for object recognition and localization within an image, differing from certain underwater datasets that rely on paired reference images for processing. Consequently, datasets specifically designed for underwater object detection are better suited to our task requirements.
In this study, we trained our model using the object detection dataset RUOD (Fig. 4C), which contains 9800 training images and 4200 test images, with a total of 74,903 detection targets across 10 underwater species. In the first step of this study, we constructed an underwater image generation dataset based on the object detection dataset. This dataset consists of triplets in the format {image, description, mask}. The image component consists of underwater training images featuring diverse underwater scenes and target objects. The description section provides a text-based narrative of the scene and the target location. We generate descriptions using object detection datasets and evaluate them with Qwen2.5-VL to demonstrate the reliability of the image descriptions. The target positions are encoded using Geo’s bounding box method, enabling text-driven image generation with precise target placement control (Fig. 2B). The mask is used to train the framework loss, and in the latent space, the target box position is adjusted by rounding to ensure accurate alignment between the generated image and the target box. We divide the image into a series of sub-images and assign a unique description to each sub-image. This way, the top-left and bottom-right sub-images corresponding to the target in each image can be anchored by their respective descriptions.
To systematically evaluate the generalization performance of the ULGF model in complex underwater scenarios, this study introduces the Underwater Open-sea Farm Object Detection Dataset (UDD). This dataset is specifically designed to enhance object recognition capabilities in open aquaculture environments, covering three key economic species: sea cucumber, sea urchin, and scallop, with a total of 2227 high-definition images (1827 for training and 400 for validation). All data were collected in real open-sea aquaculture farms using 4K cameras mounted on underwater robots or carried by divers, fully capturing the complex environmental characteristics of underwater grasping operations. The experiments in this study adopt parameter settings and evaluation procedures entirely consistent with the RUOD dataset to ensure the validity and consistency of cross-dataset comparisons.
We designed a standard data processing pipeline for training. Input images are horizontally flipped with a probability of 0.5 to enhance model robustness. The pixel values of the images are then normalized, and images with incorrect aspect ratios are padded to the required size. For each image, a maximum of 22 boxes are annotated, with any excess boxes discarded. During training, some unlabeled images are used to assist in training ULGF.
Image generation strategy
The generation strategy is designed based on the actual scene. In this framework, we reference the label distribution of the training set to design the types and layout of the generated images. However, this does not mean the generation scheme is fixed. In practice, our layout is flexible. If small or dense target detection is required, our framework can generate a large number of relevant and realistic images. The image generation scheme is as follows: targets with a box area smaller than 0.0001 are discarded, as such small areas cannot effectively represent the target’s features. Additionally, a horizontal flip with a probability of 0.5 is applied to create detection box labels, ensuring image diversity. The number of generated images in detection tasks matches that of the training set. To ensure reproducibility, a fixed random seed is set for each image.
Network architecture and training
Our layout-guided diffusion-based underwater image generation framework consists of three parts: underwater description encoding, image information compression, and image denoising generation. This framework is based on the Latent Diffusion Model (LDM), with the underwater text description module using CLIP29. CLIP has gained multimodal learning ability through training on large-scale image-text pairs, enabling it to understand complex natural language and efficiently match it with image content. Image information compression addresses the issue of slow training and generation speed for large images. We use a Variational Autoencoder30 (VAE) to compress the original images into latent space, reducing the image dimensions while preserving key features. We use the encoded underwater description as a generation condition, with the compressed image in latent space as input, and train the image generation model using a U-Net network (Fig.1B).
We trained for 60 epochs. During training, we froze the VAE components and used pre-trained weights to focus on fine-tuning other parts of the model, reducing unnecessary computation. With the VAE fixed, we fine-tuned all parameters of the U-Net using the AdamW optimizer and a cosine learning rate schedule. The learning rate for U-Net was set to 1.5e−4, with a linear warm-up of 3000 steps to gradually increase the learning rate. This approach helped stabilize the training process and improve convergence. The batch size was set to 16, which was the optimal value that could be handled by our NVIDIA RTX 3090. During training, there was a 10% chance of replacing the text prompt with an empty string. This encouraged the model to generate unconditional images, reducing over-reliance on text prompts and enhancing the diversity and flexibility of the generated images.
Loss function design
The diffusion model learns the data distribution of real images by progressively denoising random noise sampled from a normal distribution31. To learn how to gradually recover images from noise, we add noise at different time steps to the real images (Fig.1A). Let \({x}_{t}\) represent the noisy image, \({x}_{0}\) be the real image, \({\epsilon }_{t}\) be the noise, and \({\alpha }_{t}\) be the noise coefficient. The noisy image can be represented as:
The training process of the diffusion model is the process of predicting and separating the noise in the noisy images. The loss function is designed to minimize the difference between the predicted noise and the actual added noise. In this study, we use \(\varepsilon\) to represent the noise distribution added to the image, \({\epsilon }_{\theta }\) be the predicted noise distribution, \({z}_{t}\) be the noise input, \({\tau }_{\theta }\) be the conditional encoder, y be the introduced condition, \(\delta\) be the variational autoencoder’s processing function, x be the input image, \(t\) be the time step, \(\odot\) be the pixel-wise multiplication operator, and \(M\) be the corresponding pixel value. The loss function can be expressed as:
However, in underwater object detection tasks, directly making indiscriminate predictions on random noise is clearly unreasonable. The reason is that we prefer the model to learn the specific features of different types of targets rather than the environment. In detection tasks, we are detecting targets, so it is necessary to assign higher weight to the targets and reduce the focus on background information. We refer to the foreground and background weighting method in Geo, where smaller targets are assigned higher weights.
During the training of ULGF, we found that designing loss functions to make the model focus more on foreground targets can improve object detection performance. However, underwater scenes contain numerous overlapping and occluded objects, making the weight design for overlapping areas of multiple targets a key consideration. Directly assigning different weights to foreground objects makes it difficult to balance the weight distribution in occluded areas. To enhance the model’s ability to handle occlusion, we investigated this issue. For pixels belonging to three or fewer objects, we randomly assign weights based on the number of object categories while considering their physical occlusion relationships. When a pixel belongs to too many objects, we randomly assign a single object’s weight to that pixel. This strategy ultimately achieved the highest accuracy in image layout (Fig. 4E). In the figure, “min” represents selecting the smallest weight among multiple targets in the occluded area, “max” represents selecting the largest weight among multiple targets, “mean” represents taking the average weight of all targets in the area, and “random” refers to our ultimately chosen solution of randomly assigning weights.
Underwater image inference
The inference process involves progressively denoising a normal distribution noise to generate a real underwater image (Fig.1C). It has been observed that the difference between the noisy images during training and the noise in inference is substantial, resulting in poor realism of the generated objects and the emergence of strange structures. We drew inspiration from the prior noise construction in object segmentation tasks32 and designed a prior image inference method specifically for underwater object detection.
We perform feature extraction on the images encoded by the VAE, with the specific details being that we classify the bounding boxes according to the number of target classes in the labels32 (Fig. 4A). We extract the feature information of each target from the encoded latent space (Fig.1D). By selecting a high-dimensional space for information extraction, we can obtain more refined target features. We average the features of each target to reduce the influence of other types of targets and the environment that may exist in the occluded areas. Let the class be denoted as \({cls}\), \(P\) be the pixel points belonging to \({cls}\), and num be the total number of pixel points in this class. The feature distribution of \({num}\) (denoted as \(N\)) can be represented as:
In the inference process, we extract the specific underwater scene and target prior features, and then fuse them with the underwater layout prompts and random normal distribution noise. The fused image, awaiting denoising, already contains the geometric information guidance. When combined with the text prompt guidance, it generates a realistic image.
Underwater description production
We illustrate how the prompt works through pseudocode, as shown in Algorithm 1. First, initialize a list of location tokens to convert continuous coordinates into discrete symbols. Then, iterate through each bounding box, and after validating the data, convert the top-left and bottom-right coordinates into their corresponding location tokens. The description of each object is formed by combining the category name with the two location tokens. Finally, all object descriptions are joined with commas and prefixed with “An underwater scene image with” to form the complete scene description text. This conversion result is suitable for training vision-language models or serving as output for image understanding systems.
Algorithm 1
Convert Bounding Box Set to Text Description
Input:
-
bboxes: List of bounding boxes, each element is [x1, y1, x2, y2, class_name]
-
num_buckets: Number of buckets for position tokens
Output:
-
Text string describing an underwater scene
Begin:
1.position_tokens = GeneratePositionTokenList(num_buckets)
2.Validate bbox count does not exceed preset limit
3.result_list =empty list
4.For each bbox ∈ bboxes do:
a.Validate bbox class and coordinate values legality
b. top_left_token = CoordinateToToken(bbox[0], bbox[1], position_tokens)
c. bottom_right_token = CoordinateToToken(bbox[2], bbox[3], position_tokens)
d. object_description = bbox[4] + top_left_token + bottom_right_token
e. result_list.Append(object_description)
End for
5. final_text = “An underwater scene image with “ + Join(result_list, separator = “,”)
6. Return final_text
End
Computational requirements
Our framework requires at least 22GB of GPU memory for training, while only 5GB of GPU memory is needed for inference. This ensures that most personal computers or laptops with dedicated GPUs can handle the computational requirements of our framework. Additionally, during underwater object detection training, with the batch size and image size remaining unchanged, the additional burden imposed by our model is minimal. We have conducted a series of comparisons with current mainstream object detection frameworks.
We evaluate our framework’s performance in generating underwater object detection datasets from three key aspects. The first aspect focuses on improvements in underwater object detection performance. To ensure fair evaluation, we train the underwater image generation model using only the training set of the object detection dataset, preventing it from learning information from the validation set scenes. The mean Average Precision (mAP) serves as the evaluation metric for this aspect.
Furthermore, we assess the quality of the generated underwater images using the Fréchet Inception Distance (FID) as the evaluation metric. For methods based on physical models and those using unpaired image generation, we compare the synthesized underwater dataset with our reference underwater dataset. For approaches that rely on paired underwater-aerial images, we evaluate the generated underwater images against their corresponding ground-truth training images.
Additionally, we employ the mAP metric to evaluate the layout accuracy of the generated images. In this design, a higher mAP score indicates greater precision in the generated images and a closer resemblance to the real dataset.
Data availability
In terms of data availability, the datasets and models generated or used in this study have been made available in the following ways: The RUOD dataset can be accessed at https://github.com/dlut-dimt/RUOD. The UDD dataset is available at https://github.com/chongweiliu/udd_official. The underwater image generation dataset, the pre-trained detection model and weights related to ULGF are available from the corresponding author on reasonable request.
Code availability
The code for ULGF can be found at the following URL: https://github.com/maxiaoha666/ULGF.
References
Wu, Z. et al. Self-supervised underwater image generation for underwater domain pre-training. IEEE Trans. Instrum. Meas. 73, 5012714 (2024).
Zhou, J. et al. AMSP-UOD: when vortex convolution and stochastic perturbation meet underwater object detection. Proc. AAAI Conf. Artif. Intell. 38, 7659–7667 (2024).
Zeng, L., Sun, B. & Zhu, D. Underwater target detection based on Faster R-CNN and adversarial occlusion network. Eng. Appl. Artif. Intell. 100, 0952–1976 (2021).
Sun, B., Zhang, W., Xing, C. & Li, Y. Underwater moving target detection and tracking based on enhanced You Only Look Once and deep simple online and real-time tracking strategy. Eng. Appl. Artif. Intell. 143, 0952–1976 (2025).
Cao, J. et al. Unveiling the underwater world: CLIP perception model-guided underwater image enhancement. Pattern Recognit. 162, 111395 (2025).
Fu, Z., Wang, W., Huang, Y., Ding, X. & Ma, K. Uncertainty inspired underwater image enhancement. Proceedings of the European Conference on Computer Vision 465–482 (Springer, 2022).
Zhang, W. et al. Underwater image enhancement via minimal color loss and locally adaptive contrast enhancement. IEEE Trans. Image Process. 31, 3997–4010 (2022).
Peng, L., Zhu, C. & Bian, L. U-shape transformer for underwater image enhancement. IEEE Trans. Image Process. 32, 3066–3079 (2021).
Fu, C. et al. Rethinking general underwater object detection: datasets, challenges, and solutions. Neurocomputing 517, 243–256 (2023).
Chen, K. et al. GeoDiffusion: text-prompted geometric control for object detection data generation. In Proceedings of International Conference on Learning Representations, 1–12 (ICLR, 2024).
Redmon, J. et al. You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision on Pattern Recognition, 779–788 (IEEE, 2016).
Liu, H., Song, P. & Ding, R. WQT and DG-YOLO: towards domain generalization in underwater object detection. Preprint at https://arxiv.org/abs/2004.06333 (2020).
Li, C., et al. YOLOv6: a single-stage object detection framework for industrial applications. Preprint at https://arxiv.org/abs/2209.02976 (2022).
Zhao, Y., et al. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision on Pattern Recognition, 16965–16974 (IEEE, 2024).
Li, C. et al. An underwater image enhancement benchmark dataset and beyond. IEEE Trans. Image Process. 29, 4376–4389 (2019).
Zhang, F. et al. Atlantis: enabling underwater depth estimation with stable diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision on Pattern Recognition 11852–11861 (IEEE, 2024).
Bynagari, N. B. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Asian J. Appl. Sci. Eng. 8, 25–34 (2019).
Li, C., Anwar, S. & Porikli, F. Underwater scene prior inspired deep underwater image and video enhancement. Pattern Recognit. 98, 0031–3203 (2020).
Wang, N. et al. UWGAN: Underwater GAN for real-world underwater color restoration and dehazing. Preprint at https://arxiv.org/abs/1912.10269 (2019).
Ye, T. et al. Underwater light field retention: neural rendering for underwater imaging.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 488–497 (IEEE, 2022).
Salimans, T. et al. Improved techniques for training GANs. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 2234–2242 (2016).
Szegedy, C. et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision on Pattern Recognition, 1–9 (IEEE, 2015).
Wang, Z. et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Yan, J. et al. Underwater image enhancement via multiscale disentanglement strategy. Sci Rep 15, 6076 (2025).
Liu, K. et al. A maneuverable underwater vehicle for near-seabed observation. Nat Commun 15, 10284 (2024).
Xu, S. et al. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 527, 204–232 (2023).
Rombach, R. et al. High-resolution image synthesis with latent diffusion models. InProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10684–10695 (IEEE, 2022).
Zhang, L., Rao, A. & Agrawala, M. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In Proceedings of the 13th International Conference on Representation Learning (ICLR, 2024).
Radford, A., et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 8748–8763 (PMLR, 2021).
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Gao, H. et al. SCP-Diff: spatial-categorical joint prior for diffusion based semantic image synthesis. Proceedings of the European Conference on Computer Vision 37–54 (Springer Nature Switzerland, 2024).
Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015, 234–241 (Springer Int. Publ., 2015).
Acknowledgements
This research was supported in part by the National Natural Science Foundation of China (62403108, 42301256), the Liaoning Provincial Natural Science Foundation Joint Fund (2023-MSBA-075), the Aeronautical Science Foundation of China (20240001042002), the Scientific Research Foundation of Liaoning Provincial Education Department (LJKQR20222509), the Fundamental Research Funds for the Central Universities (N2426005), the Science and Technology Planning Project of Liaoning Province (2023JH1/11200011 and 2024JH2/10240049).
Author information
Authors and Affiliations
Contributions
Y.Z.: Data management, annotation of training datasets, data analysis, and interpretation. L.M.: Data management and preparation, algorithm development and validation, comparative experiments, data analysis and interpretation, and manuscript drafting. J.L.: Comparative experiments and manuscript drafting. Y.X.: Algorithm design and implementation, experimental validation, and data visualization. B.C.: Method design and supervision of the research process. L.L.: Project management, funding acquisition, resource provision, and manuscript review. C.W.: Validation of experimental results and critical feedback on the manuscript. W.C.: Data curation and model performance evaluation. Z.L.: Investigation and assistance with manuscript revision.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Engineering thanks Fenglei Han, Ajisha Mathias and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: [Philip Coatsworth].
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhuang, Y., Ma, L., Liu, J. et al. A diffusion model-based image generation framework for underwater object detection. Commun Eng 5, 22 (2026). https://doi.org/10.1038/s44172-025-00579-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s44172-025-00579-z







