Background & Summary

The Great Barrier Reef (GBR)1 is an invaluable ecological treasure, housing intricate and delicate deep-sea ecosystems that play a pivotal role in maintaining the region’s marine biodiversity2. As these ecosystems face increasing threats from climate change, habitat destruction, and anthropogenic activities3; understanding and preserving them have become paramount goals for marine conservation efforts4. Advanced tools for comprehending the complexities of these underwater environments, such as automated object (biota) detection5; in particular, have the potential to provide a better understanding of the deep sea ecosystem expeditiously and more efficiently.

Analyzing images from remotely operated vehicles (ROVs) is a complex process that requires human expertise in marine biology, geology, and environmental dynamics. Skilled analysts regularly identify and categorise marine organisms, recognise geological features, and detect anomalies; however, the sheer volume of data accumulated from ROVs makes expert analysis infeasible. Software tools such as Coral Point Count with Excel extensions (CPCe)6, photoQuad7, Biigle8, and Squiddle+9 have been used by experts to assist in the manual annotation of imagery with minimal to no automation in the annotation process. Some examples of manual and skilled data annotations from the literature are as follows. Schoening et al.10 annotated 1340 images taken through ROVs and towed cameras with the help of five expert annotators using Biigle. The authors estimated the abundances of 20 fauna categories around the DISCOL (disturbance and recolonisation experiment) in a manganese nodule area of the deep South Pacific11. Wright et al.12 manually annotated over 5,665 images of the One Tree Island’s shelf edge in the GBR to analyze changes in benthic community composition. Pawlik et al.13 visually analysed quadrat point-intercept data to calculate sponge cover in the Caribbean mesophotic reefs. Bridge et al.14 analysed 718 images taken by an autonomous unmanned vehicle (AUV) across transects around the Hydrographers Passage in the GBR and manually analysed 27 categories of macrofauna. Sih et al.15 analysed multibeam sonar data and baited ROV video footage across 42 sites in the central GBR and examined relationships between deep-reef fish communities and benthic habitat structure. They classified sponges, corals, macro-algae, reef fish, and other substrates. These methods are time-consuming and prone to bias as only smaller manageable representative samples of data are analysed due to time limitations16. Combining manual analysis with automation can improve the efficiency of data analysis, allowing the use of larger data samples; hence, improving our understanding of the deep sea17.

In the literature, several machine learning models have been developed for the automatic classification of deep-sea flora and fauna. Lopez-Vazquez et al.18 operated a crawler ”Wally” (an Internet Operated Vehicle) at Barkley Canyon (Vancouver, Canada) in 870 m depth since 2009. In this study, the group collected over six thousand image samples that were classified into 10 classes by using the NEPTUNE Canada Marine Life Field Guide19. They evaluated the raw images in metrics and designed an underwater image enhancement pipeline to avoid problems such as image degradation. Furthermore, they reported that eight classical machine learning models improved the classification pipeline using independent datasets with an accuracy of 88.64%. Maka Niu is a low-cost imaging and analysis tool designed collaboratively for deep-sea exploration in Hawaii20. The imagery and sensor data were uploaded to Tator21 for analysis, and the models gave good performance for object detection. Tator is an online video annotation platform based on RetinaNet and Detectron222. Williams et al.16 developed CoralNet, a machine learning-based image analysis tool23 for the automation of coral reef benthic cover estimation using image data from NOAA (National Oceanic and Atmospheric Administration)24. The results showed that CoralNet strongly correlates with human analysts in the main Hawaiian Islands and American Samoa. This indicates that CoralNet can improve the efficiency and consistency of coral reef monitoring programs. All these methods are based on point annotations and have good performance for predicting cover for benthic communities; however, they have poor performance or cannot completely predict the presence and absence of benthic communities.

Deep learning is a machine learning methodology that has been prominently used in computer vision problems such as image (object) classification25, and also been used for classifying underwater marine environments26,27,28. Hence, object detection holds immense potential for identifying and quantifying the diverse benthic organisms29 and substrate types that characterise the deep sea ecosystems of the GBR. Object detection not only facilitates the creation of detailed habitat maps but also aids in conducting thorough biodiversity assessments30,31. Furthermore, it offers essential data that enable scientists to evaluate the health and resilience of these ecosystems, providing a foundation for evidence-based conservation strategies. However, the success of object detection tools in this context hinges on the availability of accurate and comprehensive labelled datasets. Zurowietz et al.32 developed a machine learning assisted image annotation method (MAIAA) for environmental monitoring and exploration using autoencoder networks and mask region-based convolutional neural networks. The model was able to greatly improve annotation speeds; however, it is limited to binary classification, looking at class shell and animal only. Dawkins et al.33 created an open-source computer vision software platform called Video and Image Analytics for Marine Environments (VIAME) for automating the image analysis process, for applications in marine imagery or any other type of video analytics. The platform successfully classified 12 types of sea creatures with good accuracy. The unique challenges underwater imaging poses, such as variable lighting conditions, species diversity and life characteristics, and imaging resolution—require machine learning models to be robust and capable of generalising well beyond their training data31,34. Consequently, a substantial, quality-controlled and diverse dataset becomes an indispensable asset for developing models that can effectively identify and classify various objects amidst the complexities of real-world underwater scenarios.

In this paper, we introduce the Deepdive dataset which is a deep sea ROV object detection dataset that we meticulously curate and label through a human-in-the-loop35 approach to ensure high accuracy. The Deepdive dataset comprises a vast collection of images spanning many classes of benthic biota. In the data curation process, we manually label 4158 images of 62 classes by leveraging data acquired by ROVs equipped with high-resolution imaging systems and obtain a dataset that features class imbalance. Hence, this study showcases the practical applications of image classification techniques in marine conservation and habitat mapping. We provide a comparison of novel pre-trained deep learning models for classification including ResNet, DenseNet, Inception, and Inception-ResNet for benchmarking the dataset.

Methods

Problem Background

Underwater imaging presents a unique set of challenges from environmental and technical factors. There are various aspects that can significantly impact the quality of underwater images20, such as species characteristics, environmental conditions, substrate properties, and equipment (machine) limitations. Environmental factors such as water turbidity, lighting conditions, and currents can decrease image clarity. Certain types of substrates, such as sand and coral, can affect image contrast and sharpness. Hardware and software limitations such as camera resolution and autofocusing capabilities are crucial to image quality. For instance, low camera resolution leads to poor image quality, while lack of advanced autofocusing for mobile objects typically results in blurring. Additionally, the amount of natural light penetration, often limited in deeper waters, affects the overall image brightness. Species characteristics, such as the behaviour and mobility of underwater organisms, impose challenges on the acquisition of quality image-based data. Noise, caused by factors such as species sensitivity to sudden light changes, can further degrade image quality. Therefore, underwater imaging requires careful consideration of these challenges to capture accurate and informative visuals of aquatic environments and their inhabitants. Figure 1 shows a subset of these images that are free of blemishes, and Fig. 2 presents cases of data samples that are of lower quality due to various technical and biological factors. All of these images are extracted from videos from the ROV SuBastian Dive in the submarine canyons of the outer GBR explored the Ribbon Reef Canyons in 202036,37. The ROV dive video and photo data are available publically on National Computational Infrastructure (NCI) Australia37. Ref. 36 describes the full deployment details for the research vessel R/V Falkor cruises on which this data was collected. Figure 6 shows the location of the dives and the types of transects the ROV took through six of the canyons of the reef. Table 2 shows the six dive video details imaged at varying depths using a SULIS Subsea Z71, 4K (12X zoom) science camera mounted on ROV SuBastian. We extracted frame grabs from all these videos at 10-second intervals and got 1546 images.

Fig. 1
figure 1

Arbitrary sample of images from dataset showing the varying colours and textures in the image foreground and background taken from the ROV video footage. The images were extracted from36,37.

Fig. 2
figure 2

Objects in data that were unclear due to various environmental and technical factors. Images extracted from36,37.

Table 1 The dataset contains multiple classes of objects, and the number of images corresponding to each class is recorded.
Table 2 ROV SuBastian recorded multiple transects across the Ribbon Reef 5 canyons.

Human-in-the loop

Human-in-the-loop (HITL) machine learning is a collaborative approach that leverages the complementary strengths of humans and machines in problem-solving35. This approach is widely employed across diverse applications, including image classification38,39, natural language processing40, and medical diagnosis41. Human involvement starts with collecting, labelling, cleaning, and organising data for machine learning. Humans (experts) are used to provide feedback and expertise for iterative model training and evaluation. HITL approaches have several advantages over traditional machine learning methods, including higher accuracy, improved transparency, and reduced bias. However, they also have some drawbacks, such as increased development and maintenance costs, complexity in designing human-machine interfaces, and the need for human feedback which can limit scalability35. Continuous improvement of a model’s performance largely depends on the feedback loop. HITL models have been used in the context of segmentation of Earth’s imagery42 as well as artificial intelligence-assisted annotation of coral reef orthoimages43.

Manual data labelling

In this study, we classify all objects into three broad categories; biota, general unknown biology, and non-living objects by following the categories described in Version 1.4 of the Collaborative and Automated Tools for Analysis of Marine Imagery (CATAMI) classification scheme44. The CATAMI classification system is a hierarchical taxonomy-based approach developed to classify marine habitats and biota observed in underwater imagery and videos. The system provides a framework for analysing individual points within images or videos, accommodating the limitations of identifying biota solely from imagery, such as the need for broad morphological categories for certain taxa like sponges. We classify the data in this study into 62 categories coming from 4158 images, including different biota, general unknown biology, and non-living objects. 59 out of the total number of classes identified were biota. We created the general unknown biology class to include any biota we could not categorise. The remaining 2 classes were non-living and included coral rubble and wooden debris. Table 1 shows the number of instances of each class in the dataset. We used Squidle+45, a web-based image annotation tool, to upload and annotate the data and then downloaded each of the images and cropped the identified objects into individual image instances. This was necessary as some frame grabs contained multiple classes of objects and our dataset contains individual instances of each class per image. There are exceptions to this when we looked at class tuberworms as it was not possible to separate individual objects due to their small size.

We utilise two research assistants (experts) to check each image label and reduce the uncertainty and error in the labelling process. The first expert marks and labels each object on the image on Squiddle+ and the second expert then reviews the label in the next stage, while cropping out individual instances of each object from the frame grab. Additionally, uncertainty in the identification of species is allowed by putting labels at a coarser level within the hierarchy. Therefore, we place the species with low identification and clarification at a higher level within the hierarchy. We ignore the images with blur phase and strong external influences (e.g. sand obstruction when ROV moving or sampling) in the dataset. Using the HITL approach described above, we curated 62 classes of objects; however, to remove extreme imbalance from the dataset, we only kept 33 classes. Figure 1 presents a subsample of the images from the dataset labelled using the CATAMI classification.

Deep learning models

We select multiple image classifiers based on convolutional neural network (CNN)46 architectures that have been pre-trained using large datasets. These models leverage the power of transfer learning, which has become an integral part of computer vision tasks47,48,49 such as image classification47,50, object detection51,52, and image segmentation53, achieving remarkable performance.

The ResNet (Residual Network) model revolutionised deep learning by introducing skip connections that enable the training of extremely deep neural networks54. ResNet’s residual connections ensure valuable information is efficiently propagated through the layers. Traditional deep learning models faced a limitation where adding more layers might lead to degradation in performance due to the vanishing gradient problem55. ResNet’s skip connections allow gradients to flow directly through the network, facilitating the training of networks with hundreds or even thousands of layers. By alleviating the degradation issue, ResNet enables the construction of deeper networks that can capture increasingly complex features, leading to state-of-the-art performance on various computer vision tasks such as image classification56, object detection57, and semantic segmentation53.

DenseNet58 (Densely Connected Convolutional Networks) presents a pioneering architecture that fosters maximum information flow between layers by establishing dense connections. DenseNet’s densely connected blocks reuse and gradient flow, leading to better training efficiency and overall performance59. Unlike traditional CNNs, where each layer is connected only to its subsequent layer, DenseNet connects each layer to every other layer in a feedforward fashion within a dense block. This dense connectivity pattern encourages feature reuse and facilitates gradient flow throughout the network, mitigating vanishing gradient issues and promoting feature propagation58. Furthermore, DenseNet’s compact representation and parameter efficiency make it particularly well-suited for tasks with limited training data, as it enables the model to leverage information from earlier layers more effectively60. The intricate interconnections within DenseNet leads to impressive performance gains across various domains, including image classification61 and object detection62.

The Inception Network is a deep CNN that is recognised for its novel inception module that efficiently captures features in data across multiple spatial scales63. The inception module employs parallel convolutional operations with varying kernel sizes within each layer, enabling the network to simultaneously learn features at different resolutions. This approach enhances the network’s capacity to represent intricate image patterns by extracting local and global features effectively. Additionally, Inception Network incorporates dimensionality reduction techniques, such as 1 × 1 convolutions, to manage computational complexity while maintaining information flow and improving efficiency. The inception module helped capture multiscale features and reduce computational complexity by utilising parallel convolutional layers with varying filter sizes64. Subsequent iterations of the Inception Network have been introduced that include Inception-V2, Inception-V364, and Inception-ResNet54. Inception Networks have consistently demonstrated competitive performance on image recognition benchmarks such as ImageNet. Due to its versatility and effectiveness, the Inception Network has become influential in computer vision addressing a wide range of problems including image classification and segmentation65,66.

Framework

In Stage 1 of our framework (Fig. 3), we acquire ROV data and store and annotate video data on Squiddle+. In Stage 2, we curate the Deepdive dataset by identifying and extracting different biota, general unknown biology, and non-living objects using HITL object detection from the ROV videos. We cleaned the dataset by removing the images that were blurry or had other irregularities. We also excluded image classes that had less than 15 instances from the final Deepdive dataset to reduce the imbalance in the dataset. In Stage 3, we utilise three prominent pre-trained deep learning models for automated image classification, including variations from ResNet, DenseNet, and Inception Network. The selected models in our framework have been pre-trained on the ImageNet dataset67 that features 1000 classes of objects. In Stage 4, we use the Deepdive dataset to create benchmark results for the multi-class classification task using the respective pre-trained deep learning models. We utilise transfer learning via pre-trained models by replacing the final fully connected layer in the model with a customised classifier layer to accommodate the distinct classes in the Deepdive dataset. We then retain the respective models using the 33 classes of the Deepdive training dataset. We use the validation subset to monitor the model’s performance and prevent overfitting during the training process. We evaluate the classification accuracy using the Receiver Operating Characteristic Curve (ROC), and Area Under Curve (AUC)68 scores on the test set.

Fig. 3
figure 3

Framework for curating the Deepdive dataset by identifying and extracting different biota, general unknown biology, and non-living objects using HITL object detection from the ROV videos.

Data Records: Deepdive Dataset

The Deepdive dataset (can be accessed from69) contains a zip directory of all the 3994 images featuring deep-sea biota images split into three folders: training, validation and testing using a 60:20:20 split. The three folders contain 33 sub-folders, each representing a different class of images. The dataset for training consists of 2,667 images. The validation subset contains 667 images, and the test set comprises 660 images. We group the images into sub-folders to ease the selection of only specific classes if you only need to train a model for predicting a handful of classes. We kept the sub-folder names the same across each subset that is training, validation and testing, for consistency.

Technical Validation

We preprocess the data to create a coherent set of images that can be utilised directly for classification tasks. Our approach involved carefully selecting and applying various preprocessing techniques. We first isolated the classes of images that had more than 15 instances identified using HITL annotation. This was done to reduce the extreme imbalance in the Deepdive dataset; hence, we ended up with having 33 distinct classes. Figure 4 shows the distribution of all images present in Deepdive dataset. It clearly shows the imbalance in the dataset even after removing approximately 30 classes due to the lack of images for those classes. The level of imbalance in the dataset has a 33:1 ratio of most class instances to least class instances. We were unable to improve this ratio, but we ensured adequate representation of all classes across the three sets by randomly selecting images for each subset. We then standardised the image sizes by resizing them to a consistent resolution of 250 x 250 pixels. This was achieved through bilinear interpolation of the original image to ensure conformity to the new size. We rescaled the pixel values of the images to be in the range [0 to 1] as the colour channels were originally in the range [0 to 255]. This ensures homogeneity and compatibility with the classification architecture.

Fig. 4
figure 4

Distribution of all classes in the DeepDive dataset, indicating the class imbalance.

Usage Notes

In order to ensure the usability of our dataset, we performed testing on the data set for the image classification task. We selected the deep learning image classifiers mentioned previously and created a framework to integrate the Deepdive dataset into this process. We downloaded the dataset from Zenodo and unzipped the files to uncover the full dataset. In the framework (Fig. 3), we loaded the images from the training and validation folders and trained seven different pre-trained models including ResNet-50, ResNet-101, ResNet-152, DenseNet-121, DenseNet-169, Inception-V3, and Inception-ResNet-V2 from Keras library70. We ran 30 independent experimental runs for the training, validation, and testing cycle for each classifier with random initialisation of the final fully-connected layer of the model for each training cycle. This was done to create benchmark results for the dataset in order to guarantee consistent and reproducible outcomes. The model was trained using Tensorflow70 in an Anaconda environment running on a Windows System with an Nvidia 3090 graphical processor, Intel i9 processor and 128 gigabytes of memory.

The results show that the Inception-ResNet model achieved the best classification accuracy of 65% as given in Fig. 5. All the other models had lower accuracies that were below 50%. We further analyse the AUC scores from each of the models given in Table 3. The Inception-ResNet-V2 model had the best prediction performance across each of the classes in the dataset. Figure 7 shows the ROC-AUC plot for a single run of the Incettion-Resnet-V2 model. We did not find a strong correlation between the number of instances present per class and the classification performance. This contradicts conventional machine learning models for class imbalanced datasets, where larger classes have better classification performance. The class Quill (Sea Pens) only had 15 instances in the dataset; however, the model performed exceptionally well classifying it. We further investigated this issue by reviewing the ImageNet dataset67 used in training the ResNet model, and found that the Quill class was present in that dataset. Table 1 highlights all the classes in the ImageNet dataset that are also present in the Deepdive dataset. All of these classes have better classification performance as the model is pre-trained on them and then retrained with the same classes during our retraining of the classification layer. All these classes have an AUC over 0.94, regardless of the number of instances present for these classes. Apart from these special considerations, we find good classification in classes with more images present; this is common to imbalanced datasets, and we see the same trend here. It’s worth noting that there is an exception for the Stalked Erect sponge and Laminar Erect sponge classes, which have an AUC of over 0.98 but less than 20 instances per image. However, we believe that the Arborescent Stumpy sponge class provides sufficient training data for the model to identify these two classes, as they belong to the same sponge class and share many similar features.

Fig. 5
figure 5

Model classification accuracy on testing data across 30 experimental runs.

Table 3 AUC values for each of the classes averaged over 30 experimental runs for each model.
Fig. 6
figure 6

Map of Ribbon Reef canyon system including Ribbon Reef 5 taken from71. The black lines in Figure C represent the ROV dives transects for four dives. Figure A and B give the relative position of the ROV dive sites.

Fig. 7
figure 7

ROC-AUC plot for an experimental run of the best performing model (Incetion-ResNet-V2).

We presented a deep learning framework with a human-in-the-loop approach for ROV image analysis, highlighting the effectiveness of Inception-ResNet model. The Deepdive dataset is limited in size when compared to conventional deep learning datasets such as ImageNet67 that feature more than 3.2 million images. However, Deepdive has gone through rigorous quality control through the human-in-the-loop methodology and is comparable to similar underwater image datasets as outlined in the introduction10,18. Furthermore, we also have limited images of some of the classes and have a class-imbalanced dataset. In order to maximise the performance of the classification models, we excluded the classes with fewer than 15 images from the experiments. There is potential for future improvement with data augmentation for handling class imbalance and uncertainty quantification using Bayesian deep learning models.