Abstract
The fast evolution of underwater and underground Internet of Things (IoT) infrastructures has generated an urgent urgency to have precise, power-efficient, and real-time object detection systems that can be functioning within the extreme seawater conditions. The current solutions are characterized by noise, poor feature discrimination, and high cost of computation. Inspired by recent developments in the machine learning-driven localization, trustworthy routing in underwater wireless sensor networks, and contemporary sonar-based estimation methods, the presented work proposes a new Ensembled Deep Hybrid Learning (EDHL) system that combines the multi-modal IoT sensing with Inception-based deep feature extraction and Gradient Boosting classification. The proposed EDHL model as opposed to conventional CNN or signal-processing-only models integrates multi-scale visual, seismic, thermal, and electromagnetic along with radar features to enhance resistance to turbidity, multipath distortions, and dynamic environmental changes. The results of the experiment show accuracy of 98.39%, low memory usage, and consistent inference time, surpassing state-of-the-art models and complying with the current developments in adaptive filtering and DOA/DOD estimation, as well as lightweight deep detectors. The suggested system offers a future extension of autonomous marine exploration, underwater mapping and intelligent UWSN deployments.
Introduction
Underwater object detection has received significant international interest because of its use in marine ecology, oceanographic survey, underwater robots, security surveillance, and buried infrastructure investigations1,2. Conventional systems are very dependent on either sonar, optical imaging, or single sensor modalities that tend to deteriorate when turbidity, multipath reflections or poor illumination or sediment interference are present3,4. Recent research indicates that machine learning is taking center stage in the development of underwater communication, localization, routing reliability, and environmental perception, especially in Machine-Learning-Assisted Magnetic Induction UWSNs, adaptive coil design towards the creation of omnidirectional communication, and intelligent routing structures towards the delivery of data in a cost-effective way in the UWSNs5,6.
Additionally, the UWSNs modernization scans the increasing importance of the strong localization and detecting algorithms that leverage the multimodal sensor data and deep learning systems7,8. The relevance of multi-scale signal interpretation in noisy marine operational conditions is proven by contemporary works in joint DOA/DOD estimation, whereas the need of uncertainty-aware prediction in dynamic underwater conditions is shown by the adaptive Kalman filtering-based multi-target tracking systems9,10,11. Meanwhile, the developments in YOLO-based object detectors and, more specifically, more advanced versions of YOLOv8 demonstrate how lightweight convolutional and transformer-inspired models can help to improve the feature extraction in low-visibility underwater images12,13,14. Yet, irrespective of these improvements, the current models continue to have problems with:
-
Great levels of noise in the sensing modalities.
-
Low feature stability in turbidity and reflections.
-
Poor inter-sensors fusion.
-
Large computation costs barring deployment of edges.
In order to overcome these weaknesses, this paper introduces a new Ensembled Deep Hybrid Learning (EDHL) model, which incorporates IoT-based multimodal sensing and Inception-based multi-scale feature extraction and Gradient Boosting classification15,16. On the contrary, as opposed to the previous efforts only considering localization, routing, sonar estimation, or optical detection, the EDHL framework offers an integrated detection approach, which can combine radar, seismic, thermal, electromagnetic, and visual evidence into a single inference channel17,18,19. The goal is to attain real-time and resource-efficient together with very precise underground and underwater object detection in line with the current research directions in UWSNs, machine-learning concepts, adaptive filtering, and intelligent sensing20,21,22. Figure 1 shows the discovering hidden potential with IoT.
Main contributions of this work
-
Advanced Integration of IoT and AI: This research pioneers the integration of IoT technologies with advanced artificial intelligence models to enhance underwater object detection capabilities. This combination allows for real-time, accurate detection and monitoring of various underwater objects.
-
Development of the EDHL Model: The formulation and implementation of an EDHL Model that integrates Inception networks with Gradient Boosting. This model stands out for its ability to extract multi-scale features and enhance classification accuracy in a challenging underwater environment.
-
Robust Preprocessing Techniques: Implementation of sophisticated data preprocessing techniques including Median Filtering and Z-Score Standardization, which significantly improve the robustness and accuracy of the model by reducing noise and normalizing data from multiple sensor inputs.
-
Comprehensive Validation: Extensive testing and validation of the model across various scenarios and datasets demonstrate its efficacy and robustness. The model’s performance is further evidenced by superior precision, recall, and F1-score metrics compared to baseline models.
-
Real-world Application Potential: Discussion on the practical applications of the model in marine biology, environmental monitoring, and commercial underwater operations, highlighting its capability to operate autonomously in remote areas and handle complex, noisy sensor data efficiently.
Section "Related work": Related Studies reviews existing underwater object detection methods, focusing on limitations of current technologies and highlighting the need for advancements through deep learning and IoT. Section "Methodology": Proposed Methodology details the EDHL model, explaining its architecture and integration of Inception networks with Gradient Boosting. Section Results and discussions": Results and Discussion presents the model’s superior performance metrics, demonstrating its effectiveness in realistic settings. Section "Conclusion and future work": Cite Section.: Conclusion and Future Scope concludes with the study’s achievements and prospects for integrating advanced sensor technologies and real-time detection to improve operational efficiency and accuracy in underwater exploration.
Related work
Detection of objects on the sea-bed is crucial while exploring the marine environment, but the underlying environment presents some challenges such as light attenuation, scattering and background interference. Current models fail to achieve high robustness, require high computational costs, and present high levels of false detection. An efficient approach was presented to detect objects underwater and images enhancement through deep learning techniques. True colors are further improved via FUnIE-GAN and then fed into the YOLOv7-GN detection network for a suitable detection23. One new ACC3-ELAN network was proposed to improve the feature fusion and the AC-ELAN-t module to simplify the network by pruning. Experimental results on the DUO dataset reveal enhanced detection capabilities, which can be of interest for underwater object detection in embedded devices.
Automatic modulation classification is essential in surface and underwater sensors of the IoUT because of the DL in improving the accuracy of the classification. This aims at solving issues on how DL algorithms can be implemented on edge devices with low computation capacities. The concept of network pruning is outlined as approach to the problem of network complexity reduction and subsequent increase in resource utilization rate in order to minimize interference. The novel lightweight CNN framework (DLocean) was proposed in order to classify modulated signals with a consideration of various Signal to Noise ratio (SNR)24. On edge devices the model gets 93% as tested on more devices. The proposed approach achieves only 4% accuracy at SNR = 5 dB which shows that it cannot degrade the classification performance of the wide range of scenarios required for real-world IoUT systems while it significantly decreases the complexity of the deep networks.
An efficient underwater weapon with sophisticated technologies and highly advanced features, underwater acoustic homing weapons (UAHWs) have the capability to locate, recognize, and attack targets rapidly. It has been found that there is specificity of target acquisition for the UAHW interaction, and their efficiency is the result of their speed and accuracy. A stacking ensemble technology based real time target identification system for UAHWs is proposed. A target has mirroring characteristics due to the broadband detection signals of UAHWs emitted by the target. Therefore, it retrieved characteristics associated with the distribution of energy and the ability of the target to detect correlation across broadband25. In an effort to address the problem of imbalance in the original sea trial dataset that employed the SMOTE technique in generating a fairly balanced dataset. After that, trained and evaluated the model using the original dataset and a roughly balanced dataset separately using an independent stacking ensemble model. Finally, in order to put into practice the stacking ensemble model, it resorted into the use of an embedded device. To support the proposed strategy, the data from actual underwater acoustic homing weapon sea trials was employed. In the experiment, they applied 5-fold of cross validation. The approach used in this study was able to surpass the individual classifiers because the results in this study were an average of 933% accurate. By using this model, one was in a position to meet real-time requirements with a single-cycle inference time of 15ms.
The identification of underwater objects has become a vital element for the development of aquaculture, environmental observations, marine research activities and other related fields owing to the steady increase in ocean observation technologies. Challenges of underwater photos include high noise, hazy object and multi-scale which makes it hard for the deep learning-based target recognition systems to address these challenges. These issues are solved by modifying DETR to be more appropriate for use in water systems26. First, give a learnable query recall method that is easy to implement and achieves the best result; moreover, approach improves the identification of objects by a significant margin by reducing the noise effect. Secondly, the lightweight adaptor is introduced to provide multi-scale features for the encoding and decoding processes for the small and irregular items recognition in underwater scenario. Third, the hybrid of smooth L1 and CIoU loss function are used to optimize bounding box regression technique. Finally, compare the created network with other state-of-art approaches on the RUOD dataset to check its efficacy. The viability of the proposed approach is evidence by the results of the experiments.
Pollution of the marine and ocean environment is among the top severe environmental challenges that affect the world. This article determines that the effects of marine plastics are negative, hence making them a threat within marine habitats. The plastic waste falls to the ocean bottom and decomposes, forming micro plastic particles after undergoing various changes. By these tiny plastic pieces, the marine life gets dead and thus dies off, or in other words gets extinct. An underwater dataset including 13,089 pictures with the size of 300 of 300 pixels with both the marine organisms and marine debris were collected27. This dataset is used to develop the deep feature generator in the described technique for the example projector. The feature generator produces 6000 features employing Resnet101 network in a sample projector construction. 6000 characteristics are considered of which 1000 most relevant features are selected by using NCA (Neighborhood Component Analysis). These feature vectors are then classified by the kNN (k-nearest neighbor) technique as the conclusion. Both the types of tenfold cross-validation were used for the validation purpose. With the set of hybrid data the maximum accuracy achieved was 99.35%. By analyzing the differences of the objects in the comparisons and the calculated performance figures the proposed scheme is a success.
A multiscale convolution and an effective channel attention mechanism based underwater acoustic target classification framework was used that utilizes related discriminative auditory co-features like the MFCC and the GFCC28. They showed strong performance of their model on ShipsEar dataset and DeepShip dataset across different noise levels, highlighting the importance of adaptability of passive sonar-based ocean monitoring. The attention enhanced CNN model has good feature fusion properties to classify audio data, although it is confined to acoustic modalities only; it does not use multi- sensor fusion or spatial object detection. Conversely, the suggested EDHL structure combines the nonhomogenous IoT data of visual, seismic, thermal and electromagnetic signals, allowing the multimodal features learning beyond the sound signals. Furthermore, in contrast to their unimodal focus, our model focuses on deployment preparedness in real-time IoT ecosystems with hybrid learning networks and not pure attention-based pipelines.
A CNN-LSTM hybrid based real-time tsunami prediction model was presented based on seismic waveforms, satellite data, and oceanographic sensor values29. They apply the STFT to their time-series into spectrograms, allowing the extraction of spatial features using CNN and temporal learning using LSTM. Experimental analysis showed that the quality was better than single CNN and RNN models especially in early warning cases. Nevertheless, this paper is more about temporal prediction, compared to a spatial defining of underwater objects, and does not extend to object detection or buried object recognition. By contrast, the EDHL model is used to solve visual and subsurface object detection with the help of ensemble deep learning that is more robust in a variety of settings. Moreover, although their technique is based on sequential pattern modelling, our framework uses Inception-based multiscale vision features with Gradient Boosting to perform classification more efficiently in noisy IoT settings.
The detection of underwater objects has also been studied in a similar manner and the inherent difficulties of the environment, which include but are not limited to stray light absorption, turbidity, scattering, and complex background noise, have consistently hampered visibility and object discernibility. Current deep learning architectures cannot generally be robust to such variations leading to increased false positive rates, sensitivity to noise, and intensive parameter optimization. In addition, most existing methods, such as acoustic, CNN-based visual, or temporal predictor-based, require single-modality input or require substantial computational complexity making their application to edge and IoT devices infeasible. The proposed work, on the contrary, overcomes such limitations by a multimodal IoT-based EDHL framework that combines the information collected by GPR, thermal, seismic, and electromagnetic sensors. Inception-based multiscale feature extraction is coupled with Gradient Boosting to make the model provide high-precision classification in the presence of noises and is also lightweight and resource-efficient. In contrast to traditional deep networks, that need processing at GPU levels, and cloud connectivity, this scheme is directly designed to be deployed in real-time on constrained facing devices, both to balance accuracy-efficiency and to have scalable and field-ready detectors.
Methodology
The approach for improving the detection of objects under water consists the proposal of a novel framework known as Ensembled Deep Hybrid Learning (EDHL) Model that combines Inception networks and Gradient Boosting. This approach starts with deployment of IoT sensors that capture multi-modal data from Under-water environment. The data are preprocessed where the inputs are subjected to median filtering and Z-score standardization to reduce noise and get rid of inputs that are too large or too small. The resulting data are passed through the Inception network that is designed to identify the multi-scale features that are important for identification of various underwater objects. These features are then classified using Gradient Boosting which is a machine learning technique that comes up with subsequent detection by improving on previous classifications. This methodology is designed to tackle the issues of detection in water including difficulties in the reflection of light and scattering which affect the water environment through deep learning to enhance detections and reduce errors in water conditions.
Data collection
The data provided in this study was completely obtained within the publicly provided Oceanic Life Dataset in Kaggle30. This data is chosen as it includes a wide range of categories of underwater objects and is applicable to the training of deep learning models in the underwater conditions. It is composed of high resolutions RGB pictures of a variety of underwater types including fish, coral, sponges, seaweed, crustaceans and other marine organisms. In order to balance the learning process and lower the bias of the classes, stratified sampling was used to divide the dataset into training (70%), validation (15%), and testing (15) splits. All the images were processed with resizing, normalization, noise reduction, and augmentation methods, which are rotation, contrast, flipping, and blur-based perturbation. The dataset is publicly available which makes the future research and benchmarking effort reproducible and transparent. Table 1 presents the sample that is distributed by class.
Data preprocessing
Noise reduction
Noise reduction is a necessary procedure to be performed as a part of data preprocessing in the context of underground object detection. As it is well known the data acquired from IoT devices such as Ground Penetrating Radar (GPR), electromagnetic, seismic and thermal sensors includes random fluctuations caused by environmental interferences, then it is crucial to apply a noise reduction technique. One such technique is the Median Filter which works well for this aim. It operates based on the ability to use the median of each point’s neighbors and this effectively helps to reduce noises within the signal while not distorting the significant features of the same signal. This makes sure that the readings of the sensor like reflections off the GPR, the amplitude of the seismic signal are clear and free from noise that may be induced by for instance ground vibrations or interference from nearby functioning electronics.
The Median Filter replaces each value in a signal with the median of neighboring values within a specified window size. For a signal \(\:S\) of length \(\:n\) and a window size \(\:k:\)
where \(\:{S}^{{\prime\:}}\left(i\right)\) is the noise-reduced value at position \(\:i.\).
Outlier detection and removal
Outliers, will inhibit the way the model interprets the data. Such outliers can be due to inherent problems with the sensors, a sudden change of environment, or wrong transmission of data. To control their effects, techniques commonly known as outlier detection techniques are used to isolate outlying values. Another statistical solution is applied and it is known as Z-score method, which standardizes the given data and finds out points which are beyond the limit as a rule 3 standard deviations from the average value. After outlier detection, they are deleted or corrected to make sure that the resultant set of data can be very useful for optimization.
The Z-score of a data point \(\:{x}_{i}\) in a dataset \(\:X\) is calculated as:
Where \(\:{\mu\:}_{X}\) is the mean of the dataset \(\:X\), and \(\:{\sigma\:}_{X}\) is the standard deviation of the dataset. Data point with \(\:\left|{Z}_{i}\right|>3\) are typically considered outliers.
Data synchronization
In underground object detection, a number of sensors are employed and collect data in different forms including GPR of reflection, electromagnetic sensors of field change, and seismic sensors of ground movement. However, due to the fact that the body is full of sensors that gives a wide range of data, it is important to harmonize the readings from the sensors. This synchronization is normally carried out using the time stamps so that data from all sensors is brought to a common point in space or time for matched reading. This alignment enables the model to map sensor data correctly between the different modalities hence enhancing the capabilities of detecting.
Data from multiple sensors must be synchronized based on time stamps. If \(\:{t}_{sensor1}\) and \(\:{t}_{sensor2}\) are the time stamps for sensor 1 and sensor 2 respectively, synchronization ensures:
This guarantees that corresponding readings across sensors are aligned for accurate correlation.
Data normalization
Normalization is another operation that is done so as to standardize all the readings of the sensor to a particular range. Since various sensors can generate data of different scales it is important to scale the values to ensure that no feature dominates the model due to a large numerical scale. For GPR readings Min-Max scaling is performed which normalizes the values into the range of 0 to 1. Seismic and thermal data usually take normal distribution hence standardization is often done to shift the mean value to zero and also a standard deviation of one. This step assists the model to identify patterns in data well without being affected by different degree of magnitude.
Missing data handling
Sensor networks may suffer from gaps in collected data or a node washout meaning that some of the measurements are missing. For dealing with this, Linear Interpolation is used for continuous sensor data. This method of imputation of data involves calculating the mean of the average of the surrounding numbers making the data continuous in nature. For large gaps, linear interpolation might not be accurate enough, in which case the deep learning model can be trained to estimate the missing values taking into consideration the other data points. For that reason, these models can give more precise estimations on demand satisfying by taking into account patterns and trends of the data surrounding them.
For a missing value \(\:{x}_{m}\) between two known values \(\:{x}_{i-1}\) and \(\:{x}_{i+1}\), linear interpolation estimates \(\:{x}_{m}\) as:
Where \(\:{t}_{m}\) is the time at which the missing data point occurs.
Dimensionality reduction
For handling high-dimensional data from multiple sensors, data dimensionality is reduced. This is done by using t-SNE (t-distributed Stochastic Neighbor Embedding) in order to reduce the feature space, in addition to preserving the most co-essential components. This method enables one to reduce the dimensionality of the dataset while at the same time retaining relevant information. The complication of training is prevented by reducing dimensionality and the generalization capability of the model is increased since only important features, which are essential for identifying underground objects, are defined.
The objective function is defined as:
where \(\:{P}_{ij}\) is the probability of similarity between points \(\:i\) and \(\:j\) in the high-dimensional space, and \(\:{Q}_{ij}\) is the probability of similarity between the same points in the low-dimensional space.
Data aggregation
Data aggregation simplifies the data set, yet maintains important patterns for easier analysis by the sensors. Measurements obtained by a sensor can be archived over ordered time intervals or spatial domains, and statistics like the average and standard deviation can be computed. It helps to capture changes over time or space hence making the data less extensive and minimizing the repetition of similar data. This is particularly important as errors on offset noise may also be exaggerated in the aggregation function thereby negating a good quality detection data set.
Data transformation
The data obtained from the sensor may be in a raw format and needs some sort of transformation in order to be used in model training. For instance, unprocessed GPR depth data can be abstracted into 2-D or even 3-D picture illustrating the underground features. Likewise, the audio signals such as seismometers which produces time-series data can also require transformation to a format that captures time measures. These transformations help in enhancing the structures of deep learning models to help the models make relations with spaces and time in the data, thereby increasing object detection. Data acquisition entails the organization of data from the sensor, including GPR depth readings in a format that is easier to understand. For transforming a one-dimensional GPR signal into a two-dimensional image, the matrix \(\:I\) can be formatted as:
Where \(\:{S}_{m}\left({t}_{n}\right)\) is the signal strength at depth \(\:m\) and time \(\:{t}_{n}\). This matrix represents a 2D image where each pixel corresponds to a specific time-depth pair.
Data labeling
When using the supervised learning method, the dataset is required to be tagged according to the results of the identification of underground objects as pipes, cables, rocks, or voids. This is very important in the training of the model so that it can recognize between various different types of subterranean materials. In case where manual labeling of the sensor readings is not possible as may result from bigger data sets, clustering algorithms can be used to derive pseudo-labels based on similarity of the readings. This unsupervised approach can still help the model to discover a sensible difference between different underground objects even if there is no human labeled data.
Correlation-based feature selection (CFS)
The Common Feature Selection (CFS) is one of the most widely used feature selection methods that can enhance the performance of a machine learning model by finding the most important elements in the set of features. The first and foremost objective of CFS is to identify a small subset of features which are most relevant to the target variable and at the same time, contain least inter-correlation between them. This way CFS ensures that the selected features have high feature score which makes them good predictors of the target variable and also ensures that the selected features are minimally dependent on each other thus avoiding the problem of over fitting the model. The extracted feature set of the multimodal sensor inputs and Inception-based deep feature maps initially comprised 312 features. The feature space was narrowed down to 58 very relevant features after using the Correlation-based Feature Selection (CFS) technique. These features chosen showed good correlation with the target classes with very limited inter-feature redundancy that played a significant role in enhancing the performance of the proposed EDHL model. The fact that there were 312 features reduced to 58, improved not only classification efficiency but also reduced memory and decreased computational costs during training and inference. This best subset was which retained all the discriminative information needed to achieve accurate detection and was to be used in all the further experiments and measures.
Fundamentally, CFS is rooted on the notion of using feature subset which has quality features that relates well with the target variable while at the same time having low features inter-correlation. In other words, CFS intends to select features that contain fresh and meaningful information about the target besides selecting features that are either redundant or nearly correlated. This also aids in interpretation of the model and increases the computational in efficiency by reducing the dimensionality of the feature space. The foundation behind CFS can be expressed as an optimization problem where the subset of features is evaluated based on the merit function, which measures the predictive power of the feature set. The merit of a subset is defined as:
Where \(\:k\) is the number of features in the subset \(\:S\), \(\:\overline{{r}_{cf}}\) is the average correlation between each feature and the target variable, and \(\:\overline{{r}_{ff}}\) is the average inter-correlation between the features in \(\:S\).
This formula represents the trade-off that CFS aims to achieve in terms of the correlation between the features with the target variable while at the same time reducing the degree of correlation between the features. In fact, when the features are almost identical with each other, the denominator of the merit function will be large thus leading to a small merit score. On the other hand, if the features have a high correlation with the target set, and very low correlation between themselves, the merit score will be higher, and hence making the subset more beneficial. By using CFS filter we can decrease the dimensionality of the data, where in we eliminate the features that are not significant for the specific prediction task. Further, this improves the model’s ability to perform effectively by limiting the features to the informative ones it learns, which may help in minimizing overfitting, a situation where noise within the data is learned as patterns. CFS is more useful in the situations where the number of features available for the dataset is large while the number of relevant features is less. Due to its ability to choose the appropriate subset of features CFS can enhance the performance of training of the machine learning models, especially for models of higher complexity in terms of their architecture such as deep learning neural networks in which the reduction of input features can result to enhanced training time and generalization.
Figure 2 shows the architecture of proposed model.
Domain-Specific features
GSD or Ground Penetrating Radar (GPR) was instrumental in underground object detection since they were capable of emitting a pulse into the ground and getting reflected signals on subsurface objects. Several common formed features characterizing the field of geophysical well logging which were extracted from GPR data were reflection signals, attenuation rates, and depth information. Reflection coefficients represented the echoes that were reflected when radar wave was received back after striking on different dielectric bodies such as pipe, rock or void. These signals depended on the size of the target; the density of the material it was made of; as well as its depth. Decrease rates determined how much the signal formed by the radar had diminished at a particular layer of the ground. He also realized that the signal strength was higher where the object density was high for instance where there were metal pipes and low were there were rocks or soil. Another key parameter was depth information, which made it possible determine the place of the objects upon the earth’s surface on the basis of the time required for the radar signals to be rebounded.
Seismic sensors recorded the events of vibrations travelling through the solidity making it efficient to detect underground structures or materials. Some of the significant attributes derived from seismic data were vibration amplitude, wave velocity and frequency bands. Vibration amplitude was defined in terms of the amplitude of the seismic waves that could be used to determine the presence of massy as well as solid objects in the ground. For instance, a high amplitude seismic signal perhaps highlighted the odds of coming across a rock formation or underground structure. Another significant characteristic was wave velocity because the various media took different amounts of time to transmit the seismic waves. Measuring the velocity of these waves will enable one to determine what kind of material is beneath the surface—whether it is soft soil, concrete or rock. Frequency bands also aided in the identification of underground objects by determining the distribution of energy in the various versions of frequencies. Low frequency signals propagated through the deeper regions of the ground and high frequency waves were able to give finer perception about the surface area of the ground. All these features made seismic sensors a must have instrument when it comes to identification as well as classification of the objects that are present underground.
Thermal sensors give a different perspective in underground detection by mainly measuring temperatures at the subsurface level. Thermal features obtained from the sensor data include temperature gradients as well as temporal variations of thermal patterns. Temperature gradients showed how the temperature increased with depth and this was important in viewing objects that either produce heat or those that have trapped heat such as water pipes or thermal layers in the ground. For instance, object such as heated pipelines a form of buried utilities caused temperature variations within the soil and thus easier to find. The other was time based thermal fluctuation which depicted the pattern in which temperature varied with time. These changes showed the existence of the sub-surface structures which altered the thermal condition in the encompassing vicinity. Regions with slow heat dissipation revealed that materials or voids with high density accumulated heat. Nevertheless, these thermal features were useful as an adjunct to other sensor data since they added more information regarding the subsurface.
Ensembled deep hybrid learning (EDHL) model
The proposed EDHL Model called Inception-Gradient Boosting is an improved model which adopts the advantages from both deep learning and machine learning approaches Inception network for feature extraction and Gradient Boosting for classification. This combination gives a highly efficient working model particularly on tasks such as detecting underground objects that require recognition from a variety of patterns that may be present in large volumes of data. The InceptionV3 variant has been used to propose EDHL framework as the platform to extract multi-scale features. ImageNet weights were used to initialize the network and then it was fine-tuned on the underwater dataset. It has an architecture of 48 layers, consisting of convolutional, pooling and Inception modules. The size of input images was adjusted to 299 × 299 × 3, which is equivalent to the InceptionV3 input. The final Inception module was followed by a Global Average Pooling (GAP) layer to which the feature maps were inputted, giving a 2048-dimensional feature vector. This was then narrowed to 312 intermediate features with a fully connected layer and Correlation-based Feature Selection (CFS) was applied to bring the number of features to 58 features. The input into the Gradient Boosting classifier was these 58 features and the estimator values were adjusted to 300, learning rate 0.05 and maximum tree depth 4. To make experiments reproducible, all of them were written with the help of TensorFlow and Scikit-learn.
Inception network is a deep convolutional network used to effectively extract multi-scale features concerning the input data. In contrast with normal CNNs that use one filter size per layer, Inception model uses several filters of various sizes all in one layer. This makes it able to register patterns at different scale at the same time which is especially useful for data that is structurally complex like data gotten from GPR, seismic, or thermal input data. In underground detection, cables or large features may be small like cables while large features may exist, hence a model must detect features at different resolutions. In addition, by having two branches which are based on convolution layers that have different kernel sizes (for example, 1 × 1, 3 × 3 or 5 × 5), the Inception network has capabilities of detecting both subtle and the general big picture like features. This parallel processing of features in fact gives a broad perspective on the features of the data so that no significant feature is missed during feature extraction.
The Inception module processes the input \(\:X\) (e.g., sensor data from GPR, seismic, or thermal sensors) using parallel convolution operations with multiple filter sizes. For an input feature map \(\:X\), the output feature map \(\:Y\) from the Inception module is:
Where \(\:Con{v}_{k\times\:k}\left(X\right)\) represents a convolution operation with a filter size of \(\:k\times\:k,\) \(\:MaxPoo{l}_{3\times\:3}\left(X\right)\) represents a \(\:3\times\:3\) max-pooling operation, \(\:concat\:\) refers to the concatenation of feature maps along the depth (channel) axis. Inception modules are applied to the passed input data sequentially to extract all necessary spatial and contextual features of the subsurface signals. These features are judgments made about the underground objects including the shape, depth, material and intensity of reflection and they are very valuable in the next phase which is classification. After, the features are extracted by the Inception network, the features will be further classified by another machine learning algorithm called Gradient Boosting, which is an ensemble learning algorithm and famous for its accuracy on out of sample data as well as on high dimensional data. Gradient Boosting operates by using a series of weak learner models, usually, decision trees for developing a sequence of progressive models while each model corrects the mistakes made by the previous ones. In other words, the future predictions are improved with the help of each next model, which adaptively learns from the errors of the previous models. The extracted feature vector \(\:f\) is fed into the Gradient Boosting model. Gradient Boosting works by iteratively improving the prediction of a base model (typically decision trees) by minimizing a loss function \(\:L\) with respect to the model predictions. The prediction at iteration \(\:t\), denoted by \(\:{\widehat{y}}^{\left(t\right)}\), is updated as follows:
Where \(\:{\widehat{y}}^{\left(t\right)}\) is the predicted output at iteration \(\:t\), \(\:{h}^{\left(t\right)}\left(f\right)\) is the decision tree at iteration \(\:t\) applied to the feature vector \(\:f\), and \(\:\eta\:\) is the learning rate. In the case of Gradient Boosting used in context of the EDHL model, they employ features derived from the multi-scale aspects provided by the Inception network to boost classification of various underground objects. The features extracted through Inception are then fed to the Gradient Boosting model which employs decision trees in order to form an improved classifier. The boosting process reduces the level of prediction error through underlining misclassification data points and making it more accurate. This is specifically important in situations where the object to be detected is below the ground where the environment is usually full of noise thus call for detection of fine structures in the data in order the detect objects such as pipes, rocks, or voids. The goal of Gradient Boosting is to minimize the loss function \(\:L\), which could be mean squared error (MSE) for regression or log-loss for classification:
Where \(\:{y}_{i}\) is the true label of the \(\:i\)-th sample, \(\:{\widehat{y}}_{i}\) is the predicted label of the \(\:i\)-th sample, \(\:l\) is the chosen loss function (e.g., log-loss for classification), and \(\:N\) is the total number of samples. Furthermore, the Gradient Boosting classifier boosts the model’s performance by working on imbalanced data and sharpening the precision of less frequent classes (i.e., rare subterranean items). Its strength in combining multiple weak classifiers into a single diverse and accurate predictive model makes it suitable for this application where extracted features must be processed with precision. The final prediction of the EDHL model is the output after the last iteration of Gradient Boosting:
This combines the multi-scale features extracted by the Inception network and refines the predictions through the Gradient Boosting process over \(\:T\) iterations, leading to an accurate classification of underground objects. The proposed Inception-Gradient Boosting EDHL model works as ensemble using both deep learning (Inception) and machine learning (Gradient boosting) that may benefit from each other. The Inception network outperforms other in terms of feature extraction particularly in problems that involve complicated and huge data set while the Gradient Boosting performs best in its role of improving the classification by combining various decision trees such that it forms a single strong decision trees. This hybrid approach offers several key advantages:
-
Multi-Scale feature capture: The parallel filters used in the Inception network guarantee that the model can learn micro-level and macro-level features at the same time, which provides the feature space needed for boosting to improve.
-
Improved accuracy: When interacting Gradient Boosting with the deep features, it allows model to repeatedly learn from the mistakes that it made and therefore improve the classification accuracy.
-
Generalization: The model does not overfit due to the ensemble nature of the Gradient Boosting that penalizes complexity and tunes to achieve a good generalization capability on unseen data.
-
Handling complex data: The Inception network is effective for the handling of heterogeneous and ill-typed data, and the Gradient Boosting is powerful in binary classification; the latter makes this model suitable for the detection of underground objects in noisy and multimodal data sets.
The proposed Inception-Gradient Boosting EDHL model is a very efficient algorithm for the underground object detection. The model integrates the multi-scale features of the Inception network coupled with high precision classification from Gradient Boosting for accurate and efficient detection of multiple targets. This integration of deep learning and CNN provides the model with the ability to capture such complexities within the data and make accurate predictions in different underground environments.

Algorithm: Underwater object detection using hybrid methods
Novelty of the work
Although the major novelty of this work is admittedly the innovative integration of IoT with the Ensembled Deep Hybrid Learning (EDHL) model, this contribution is not only timely but also powerful in the development of the state of underwater and underground object detection. IoT sensors identify raw and unstructured data in large amounts, and the deep learning models usually need clean and structured data and strong computing equipment to work best. The gap that is filled in the integration described here is real-time, resource-aware detection with data streams, which are collected directly with distributed IoT devices. This goes beyond just a technical accumulation of the combination but a paradigm shift that shows how multi-modal IoT signals (GPR, seismic, thermal, electromagnetic) can be preprocessed, fused, and read efficiently using a hybrid learning pipeline. Besides, the suggested EDHL, which combines Inception networks with feature extraction on its own with Gradient Boosting to perform refined classification, was specifically selected to address the practical limitations of current solutions in noisy and resource-limited settings. In contrast to strictly academic standards, this IoTEDHL integration illustrates a practical implementation model which can be expanded, energy-efficient and can support real-life uses, including marine exploration, underground utility mapping and environmental surveillance. Therefore, the novelty is not about creating an entirely new model that would be devised on its own, but rather about the tactical association of the IoT data acquisition with a hybrid ensemble that would be uniquely tailored to the requirements of the operational aspect and precision of this field, which has not been accomplished in a single piece of work.
Results and discussions
The implementation of the proposed model uses a reliable environment where the development environment is Jupyter Notebook and a high-performance computing to perform the massive computations needed. The hardware setup of the experiments, which was an Intel® core i7-1355U CPU with 6GB of RAM, was specifically chosen to be representative of real-life deployment scenarios, instead of being a high-end laboratory setup. The fact that the proposed EDHL model is developed to operate in edge, IoT, and resource-constrained settings, means a performance evaluation on a small CPU-based system indicates its practicability, scalability, and deployability on systems not reliant on GPUs. Nonetheless, to prevent possible ambiguity, we note that the training and fine-tuning phases were originally conducted on a machine with a GPU (NVIDIA RTX 3060 having 12GB VRAM), whereas the reported runtime, inference latency, and resource consumption metrics have been measured when the system is executed on a CPU only, to have an idea of the low-power operation conditions. This two-stage experimental structure is consistent with standard deployment models, in which models are trained on a high-performance computer but are deployed on smaller processors, systems in the field, or handheld machines. The lack of GPU discussion in the original version was an accident; this will be fixed by stating explicitly the GPU-enabled training regime and the CPU-based inference regime. Also, the model was able to maintain competitive accuracy and runtime performance despite the limited hardware, which additionally justifies its relevance to real-world deployment where the use of a GPU acceleration cannot be assured. This setup does not only facilitate the implementation of high-intensity computational work but also improves the performance and quickness of the model for real-time object detection in underwater conditions environments. Utilizing Inception for feature extraction and Gradient boosting for classification the EDHL model represents a robust and effective strategy which can hardly be challenged for performing rather complex operations such as the underground object detection. The working principle is focused on multi-scale deep learning feature extraction to be integrated with state of art machine learning based boosting methods for achieving high accuracy in detection and classification. First of all, the EDHL model takes multi-modal sensor data that come from diverse underground detection methods including GPR, seismic sensors, electromagnetic sensors, and thermal sensors. They all provide different type of data; it can be images, Time series, or even temperature and the likes. The data is usually preprocessed using techniques such as cleaning the data to remove noise, normalizing the data and scaling the data before it is fed into the model for analysis. Figure 3 shows the dataset sample.
Table 2; Fig. 4 demonstrate the comparison of different preprocessing techniques according to the accuracy. The no preprocessing baseline demonstrates that the model is insensitive to noisy and unprocessed data achieving the worst accuracy of 90.12% and precision of 89.45%. This result shows the roles of data preprocessing since by using such data, the model has low accuracy in identifying underground objects. If preprocessing is not carried out on the data, the data may contain noise and extreme values as well as features that are not scaled to a common value range, thus making it hard for the model to extract relevant patterns. The recall and F1 score metrics of 88.90 and 89.17% respectively are again an affirmation of the fact that no data preparation makes it difficult to have a perfect relationship between precision and recall.
The use of the Min-Max Normalization was effective in enhancing the performance of the model on the processed data as the range of the data set was standardized to a smaller range. The accuracy enhanced to 92.78% along with the better proportion of precision and recall metrics as 91.67% and 91.32% accordingly. Nevertheless, as the information shown in the current model’s F1 – score of 91.49% indicates, normalization, though it minimizes fluctuations in the features, does not actually remove noise and outliers. The performance further increased to 93.50% when using Z-Score Standardization technique that scales the values to have a mean of zero and standard deviation of one hence proving that Standardization is a more efficient technique than Min-Max Normalization when features are normally distributed. For the purpose of noise reduction, Median Filtering was applied while to give the final touch to the model, Outlier Removal (Z-score) was used to remove the highest and lowest values. Median Filtering technique where noisy data is replaced by the average of nearby data increased the accuracy up to 94.32% the precision up to 93.15% and also the recall up to 92.75%. This method plays advantage to the model as it discards noise while retaining useful characteristics. As with using Z-scores to remove outliers, this technique achieved a result at 93.89% accuracy which, although slightly less than with Median Filtering, was far superior to the option with no preprocessing at all. By so doing, the model minimized and trimmed the noise which otherwise would have distorted the algorithm when making decisions hence enhancing the F1-score by 92.43%.
Using t-SNE for dimensionality reduction, I got an accuracy of 95.05%, precision of 93.95%, and a recall of 93.56%: this example demonstrated that reducing the number of features can be advantageous, provided that the essence of the data remains intact. Hence, by reducing the dimensionality, the model can easily search for only the most important features that are present in the data without causing overfitting of the training data. The finest results were obtained when Median Filtering and Z-Score Standardization were applied simultaneously. The proposed combined preprocessing method achieved the highest accuracy at 96.16%, followed by precision at 95.21%, recall at 94.71%. Overall F1-score comprises of 94.51%. The overall comparison of precision, recall, and F1-score comparison across preprocessing techniques is illustrated in Fig. 5. This approach offer a fair solution to the noise problem and outliers, as well as the normalization of data giving the best solution to underground object detection.
The preprocessed data is then fed to the Inception network the goal of which is feature extraction from the obtained sensor data. The Inception architecture aims at the fact that the input data is passed through multiple convolutional filters of different sizes at once within one layer, for example, 1 × 1, 3 × 3 and 5 × 5. This makes it possible for the network to learn multi-scale features which are very important in sensing underground objects with varying characteristic sizes and shapes such as pipes or rocks or voids. For instance, the 1 × 1 filters used in Inception network allow for capturing small cables or fractures; on the other hand, the 5 × 5 filters suits the capturing of large underground rocks or geological structures among others. The outputs of such different filter sizes are then stitched together in a way that join both the fine and the broad characteristics into a single features map. Several layers of such Inception modules are applied to the model, and that enables it to capture more numerous and intricate patterns and dependencies in the provided data. Finally, there are feature maps of Inception layers that are flattened into feature vector to complete characterize raw sensor data.
Table 3 shows the SELECT-NSGAIII, FS-DE, and their comparison in terms of accuracy, precision, recall, and F1-score. Without Feature Selection, the model had an accuracy of 92.45% because it was trained with unwanted features that negatively impacted its performance. Precision, recall, and F1-score rates were equal to 91% with the same interference on it. From the case of training subject without feature selection, it can be seen that there was always introduction of noise that affected classification from irrelevant features that existed in the input data.
The accuracy, based on the results of Principal Component Analysis (PCA), increased slightly to 93.10% as depicted in Fig. 6. In this way, deep learning after applying PCA to reduce the number of features rid of the redundant data and achieved the higher precision and recall equal to 91.95% and 91.22% correspondingly. Nevertheless, it must be realized that when performing a PCA, the reduction process may eliminate important features and that is why the improvement was not as marked as had been anticipated. Recursive Feature Elimination (RFE) showed a higher number of 94.02%, thus indicating higher accuracy as the algorithm eliminated unimportant features step by step.
With improved feature selection, precision, recall, and F1-score increased to 92.86%, 92.10% and 92.47% respectively thus affirming our assumption that RFE is more efficient in selective data classification as shown in Fig. 7. The performance of both Mutual Information and Chi-Square Test was also boosted and the closest to that of RFE. Hence, Mutual Information produced the accuracy of 93.78% and Chi-Square generated the accuracy of 93.45%. The techniques in question were quite comparable, although the first algorithm was slightly more precise at the cost of a lower recall. These methods are useful to know how variables are dependent on the target and in previous experiments their results were worse because they do not have a clear feature selection procedure like RFE or CFS.
The Proposed (Correlation-based Feature Selection – CFS) model stood out by producing an incredible accuracy of 98.39%. CFS evaluates features based on their correlation and then selects the features having high accuracy with minimum overlapping which provides the highest precision of 97.82%, recall of 98.02% and F1-score of 97.91%. L1 Regularization (Lasso) demonstrated mild enhancement with achieving an accuracy of 93.68%. Lasso to an extent prohibits use of irrelevant features in a model, however, its accuracy was relatively lower than CFS and RFE. The identification of the ideal set of features further showed that the Proposed CFS model outperformed the other techniques by providing the highest accuracy and performance of the classification models. Due to the high dimensionality of the maps from Inception network, the subsequent step entails dimensionality reduction of the feature space. Such can be made through methods such as AutoEncoders which minimizes the feature vectors into a lower dimensional space while preserving the more significant features. It also minimizes computational time and comes with an added advantage to avoid issues of overfitting especially for big data sets. The feature vector is compressed and then sent to the classification layer. Finally after feature extraction and compression, the EDHL classification apply Gradient Boosting as the basic technique. Gradient Boosting is a type of ensemble learning which combines a number of weak learnings models to create a strong learner models, most commonly a decision tree in most cases where each tree is trained to minimize the residuals of the prior model. In this process the model tries to refine its prediction by optimizing a loss function related to the problem for example log loss for classification problems.
Table 4; Fig. 8 shows the performance comparison of numerous models in a case of underground object detection in terms of accuracy. Through the procedure, it was found out that the CNN model had the lowest accuracy of 89.31% and F1-score of 87.05%, which cement the hypothesis that it is weaker in handling multi-modal textual data. The ResNet and DenseNet models had higher accuracies of 91.24% and 92.13% due to the deeper architecture that used high-level features, shown by their higher precision and recall. The DenseNet proved to be fairly well balanced as observed in the F1-score of 90.52%.
The precision, recall, and F1-score comparison of the existing models along with the proposed model is shown in Fig. 9. VGG had a moderate performance of approximately 90.89% accuracy and 89.35% F1-score implying that VGG has an older architecture in comparison to the ResNet and DenseNet. More specifically, AutoEncoder-LSTM model that uses the AutoEncoder for feature dimensionality reduction and the LSTM for temporal data reached 91.87% of accuracy, which also demonstrates the receptiveness of this approach to sequential data analysis. However, it is clearly shown that there is potential for optimization to increase both the measure of Precision as well as Recall by having an F1-factor of 90.00%. The second model called Hybrid ResNet-LightGBM was slightly better with an accuracy of 93.02% and the third model known as Inception-XGBoost with 92.74%. Their application of the gradient boosting for classification improved the precision as well as the recall of the data thereby producing better results.
The specific classifier chosen, EfficientNet-Gradient Boosting, yielded a level of accuracy marked at 93.41% that adjusts itself to the model’s size which stands at 91.64% as per F1 score. The Capsule Network also displayed high accuracy with precision of 91.20% and recall of 90.82% which depicts that network does have capability to learn spatial hierarchies present in data. Therefore, the Proposed EDHL (Inception-GB) model yielded the best overall performance with the highest accuracy of 98.39%, precision of 97.82%, recall of 98.02%, and F1-score of 97.91%. The combination of Inception for feature extraction and Gradient Boosting for classifying the data worked best in analyzing the challenges of underground object detection and was the best-performing model as observed from sensitivity, specificity, and overall accuracy assessments.
During each stage of the Gradient Boosting, a decision tree is modeled using a feature vector delivered by the Inception network. The model is trying to determine the type of underground object (for instance pipe, rock, or void) through minimizing the error from the previous iteration of the model prediction. Thus, the Gradient Boosting component continuously goes through this process and adjust the model result by the outcome of the generated decision tree, in the process minimizing both the bias and variance of the model result. By using Gradient Boosting on the model, it is easy to work with noisy data since it is complicated. It is particularly useful to detect underground object since it is able to pay attention to the hard to classify objects, enhancing the model performance on the overlooked patterns including low contrast features like reflections or minimal changes in temperature. The last stage of the EDHL model is the merge of the network Inception and the classifiers, Gradient Boosting. Recall that Gradient Boosting works in a way that prediction is the average of several decision trees and this average is weighted. This approach enhances the results of the model greatly as the mishmash of the errors of the classifiers in decrease by the ensemble process.
Table 5 show the false positive rate (FPR) of different models associated with underground object detection. Out of all the models, the CNN model has the worst FPR score of 11.48% which means that the model has a tendency of detecting non-object regions as underground objects. This high error rate inform that CNN is not so effective to interpret the complexity and noise in the sensor data and often produce wrong outcomes or signal positive results. ResNet and DenseNet provide enhancements with FPR values of 9.13% and 8.62% which signify that increased depth yields improved feature extraction capabilities. Subsequently, with increasing complexity such as the use of multiple modality sensors, these models show certain deficiencies as can be seen from the slightly higher FPR compared to the more sophisticated models. VGG and AutoEncoder-LSTM models follow with FPRs of 9.21% and 8.89%, respectively but they show relatively better performance particularly when handling sequential data and image based features.
Overall, the Hybrid ResNet-LightGBM and Inception-XGBoost models show even better results, with FPRs of 7.52% and 7.95%, respectively. Their incorporation of gradient boosting with deep learning architectures allows them to further minimization of the false positive rates by improving the model’s decision margins. In terms of FPR, EfficientNet-Gradient Boosting has a much lower false positive rate than the previous models with 7.36% which demonstrates that the model is less complex yet accurate.
The Proposed EDHL (Inception-GB) model yields the lowest FPR of 402%, which underlines the model’s capacity for minimizing false detections. By using Inception networks for extracting multi-scale features, this model is also able to do better classification with help of Gradient Boosting which is free from more noise and thus there are less false alarm. A low FPR underscores the success of the Ensembled approach to identify between objects that are located underground and other non-object areas. Compared to other models, the EDHL model is characterized by significantly fewer false positives, which makes it useful for practical application when a high level of accuracy is needed, for example, in utilities mapping or in archeology as shown in Fig. 10. While the proposed method achieves slightly lower Recall and a somewhat higher FPR compared to state of the art performance, it offers more reliable and less noisy decision in actual usage scenarios because of the better accuracy. The final prediction of the model is the classification of the underground objects based on whether or not they are present: pipes, cables, rocks or voids, for example. The ensemble method guarantees the regularity and the generalization of the prediction on various data of sensors and noises of the environment.
Table 6 illustrates the false negative rate (FNR) comparison of different models. CNN model again has the highest FNR equal to 12.23 which means that it often misses the true underground objects. This is a fairly large disadvantage when applied practically, for example in utility mapping or archaeological surveying where an object can be missed at a great cost. In fact, both ResNet and VGG models exhibit improvement with regards to FNRs of 10.08% and 10.25% respectively, but they still miss a big number of objects. Unlike DenseNet which gives 9.44% FNR showing that it can capture more relevant features, the results suggest that there is still a significant scope for enhancement in the number of missed detections. Thus, the flexibility of using Autoencoder-LSTM with FNR of 9.99% clearly demonstrate that this model is more efficient dealing with time series data while still being not perfect. The Hybrid ResNet-LightGBM and Inception-XGBoost models provide even more desirable FNR assessment as 8.35% and 8.77%. Figure 11 gives clear illustration about FNR distribution across models. They can incorporate the deep learning architectures with boosting algorithm minimizes the false negatives. However, as compared to the above-mentioned simpler models they run better, though they are not very useful for the requirements of high recall rate application.
The models of EfficientNet-Gradient Boosting and Capsule network offer marginal enhancements with the FNRs estimated to be 8.12% and 8.61% respectively, and hence, a better recall as compared to the earlier models. However, the Proposed EDHL (Inception-GB) model that is developed in this study has FNR of 3.87% which is better than all the other models. This low FNR demonstrate the proposed model’s high performance in localizing true underground objects with a very low chance of false negatives. The use of both Inception for extracting multiple scale features and Gradient Boosting for easier classification makes this model ever reliable; making it possible to identify almost all objects almost all the time in practical use. The use of Inception-Gradient Boosting EDHL has several advantages over the use of conventional models. First, the Inception network can perceive an ultra-fine granularity at the multi-scale feature extraction level and hence it will not miss any feature from the complicated data structure. This makes the model very versatile to different underground objects, starting with cables and ending with geological phenomena of large sizes.
Table 7 highlights the difference in the runtime of the models employed in the context of underground object detection. This is mainly because of their complexity where the CNN model had the shortest runtime of 45.6 s. This is expected due to the fact that CNNs are less depth than some of the current deep learning architectures such as ResNet or DenseNet, whose layers take longer time to process the data because of their depth. Upon evaluating ResNet and DenseNet, their total time taken was 52.4 s and 56.3 s, respectively, thus demonstrating the extra time consumed in even more elaborate models that establish more complicated features.
When it comes to inference time, the VGG model runs slightly faster than ResNet and DenseNet, taking 48.7 s to complete the task, mainly because, although it is constructed using more layers than CNN and has a total of 19 layers, it is much simpler in topology. Autoencoder LSTM being an integration of AutoEncoders and sequence modeling requires 53.2 s that indicates its capacity towards handling time series data but at the same time instants a considerable time period. The model with the longest runtime is the Hybrid ResNet-LightGBM because the boosting process is integrated with deep learning and takes approximately 57.8 s – much longer than the original ResNet model despite having an impressive accuracy.
Other models such as Inception-XGBoost and EfficientNet-Gradient Boosting take moderate time of 54.9 s and 50.2 s respectively, signifying efficiently combined with complexity. Figure 12 depicts the runtime comparison across other models with the proposed model. It takes Capsule Network, which is able to capture spatial hierarchies in principle, 51.1 s to accomplish the task. The EDHL (Inception-GB) model proposed here takes 49.5 s and can be considered as other state of art models. This runtime will give the required balance between feature extraction as done by the Inception model, and the ability to classify by the Gradient Boosting model. It is also important to note that the model that has been proposed possesses relatively low, yet still reasonable, runtime which will make it acceptable when solving many real-life problems where computational time is important.
Table 8 compares the performance of various models based on three important metrics: Thus, we used three primary parameters of model performance, namely, specificity, AUC-ROC, and sensitivity. Specificity reflects the accuracy of identification of non-object areas, and sensitivity reflects the ability of the model to identify true underground objects. The AUC-ROC score is useful to provide an overall estimation of the model as it calculates the area under curve which helps to find out how well a model is distinguishing between an object region and non-object region. The specificity and sensitivity of the CNN model are the lowest, being only 88.52% and 87.77%, respectively, which confirms weak ability to identify true positives and avoid false negatives. Both the ResNet and DenseNet models seem to have enhanced performance; however, the ResNet model has a specificity of 90.87% and sensitivity of 89.92%, while the DenseNet model has somewhat higher values with specificity of 91.38% and sensitivity of 90.56%. The result of the VGG model is moderate; it is equivalent to ResNet performance; however, its specificity is 90.79% while the sensitivity is slightly lower at 89.84%.
From Fig. 13, the performance of the AutoEncoder-LSTM model is quite satisfactory with the specificity of 91.11% and sensitivity of 90.98% indicating that the model when trained on time series data is a more efficient model than CNN or VGG. ResNet-LightGBM and Inception-XGBoost models achieve greater performance with specificities of 92.48% and 92.05%, and sensitivities ranging from 91%, showing their effectiveness to classify true underground objects. The next model is EfficientNet-Gradient Boosting which is quite similar with specificity of 92.64% and the sensitivity of 91.73%, giving a balance between model complexity and its performance. Hence the Proposed EDHL (Inception-GB) model tends to perform the best in specificity, AUC-ROC, and sensitivity when compared with other models. While the specificity is 95.98% meaning that the algorithm is very good at NOT identifying objects that are not actually underground, its sensitivity is 97.98% which means that as few objects as possible that really belong to the underground category are not picked up. The proposed EDHL model has the highest AUC-ROC of 98.75% proving that it efficiently separates object and non-object regions while being the most accurate underground object detection model in the comparison.
Second, Gradient Boosting integrated into the model guarantees that the prediction of the model is as accurate as possible since it improves the classification with each iteration. The architecture of Gradient Boosting allows the focusing on the more complicated cases and decreasing the number of mistakes which are made by the model. Indeed, by using the deep learning with the boosting, the EDHL model can serve as an effective tool for high-dimensional data handling with high accuracy of the results demonstrated with the further tests in this field. The combined EDHL model that uses Inception and Gradient Boosting has been found to be a strong solution for the underground object detection. Therefore, it introduces a high accuracy and efficiency of detecting and classifying the complex underground objects based on the multi-scale feature extraction with iteration boosting based classification.
Table 9; Fig. 14 shows the comparison of memory requirements of different models that illustrates the computation cost in memory. CNN takes least amount of memory among all at 320 MB, which can be attributed to the fact that it employs a basic design. Nevertheless, this hurts the performance in comparison with deeper models which are usually employed in the case. Since ResNet and DenseNet are deeper networks these two networks consume more memory with ResNet consuming 450 MB while DenseNet consumes 510 MB. The memory of these models is proportional to the number of parameters that these models possess, but the increased number of parameters allows these models to learn complex patterns, yet at the cost of computational resources. Specifically, the VGG model requires 490 MB, which, compared to DenseNet, is slightly lower; meanwhile, it is higher than ResNet, which is expected given the model’s architecture. AutoEncoder-LSTM costs 460 MB; this underlines the increased memory needed not only for feature extraction but also for time series. Only a little over half a gigabyte of memory is used by the Hybrid ResNet-LightGBM and Inception-XGBoost models, at 480 MB and 470 MB, respectively. These models caused the above models deep learning boosting techniques require more memory since they incorporate several calculation components.
Improvements in runtime and consumption rates as outlined are quite small in individual benchmarks, but have immense implications in the edge and IoT deployment. Even a 5% to 10% increase in hardware accelerator design is significant, as a reduction in runtime is multiplied by millions of inferences in practice in real-time applications like continuous video surveillance or autonomous navigation. On an equal note, a small percentage decrease in memory use minimizes expensive off-chip DRAM accesses and the major power consumption and thermal sensitivity bottleneck. The difference against the baseline models might not be that significant, but as the model becomes more complex, the difference between the two models increases, with the difference becoming more pronounced with larger architectures. In addition, the given design does not focus on raw speed as such, but rather on balanced trade-offs between latency, reconfigurability, and thermal resilience, which makes the contribution of incremental gains very valuable when it comes to operating in the real world. The reported upgrades can therefore be considered practical, with cumulative energy savings, reliability and scalability benefits that cannot be only expressed through just a series of numerical comparisons.
The Capsule Network, has a memory usage of 475 MB due to its intricate structure that encodes spatial relations in data. For image and texture deep features extraction and in order to memorize 455 MB, the Proposed EDHL (Inception-GB) model is efficient. Nevertheless, the performance of the proposed model is still reasonable high while the memory consumption is still competitive compared to other models suggested in the literature for real-world applications where memory is a limited resource.
Table 10 is a comparison of the proposed EDHL model and CNN-based, hybrid, and transformer-based architecture regarding accuracy, memory, and runtime. Figure 15 provides the comparison of the precision of different models. Transformer models like ViT and Swin are highly accurate (97%), but their high cost (about 1200 MB of memory and 95 s of runtime) makes their implementation unusable in edge and IoT applications. The conventional CNNs and ResNet-based models have smaller memory footprint and reduced run time, yet their accuracy is between 92% and 94%.
The ResNetLightGBM hybrid configuration also has a slightly better accuracy but runs in 58 s. Conversely, the highest accuracy (98%), as well as the lowest memory capacity (455 MB) and shortest run time (49.5 s) of all compared methods, was allocated to the Proposed EDHL model (Inception–GB). This performance/efficiency ratio supports the fact that EDHL is a more viable and scalable option to deploy in real-time, particularly in situations where a transformer is not applicable due to resource-constrained performance. The result of the proposed model is presented in Fig. 16, which reflects the distribution of the model in terms of correctly and incorrectly identified objects in the sea.
Conclusion and future work
In this paper, a new IoT-based Ensembled Deep Hybrid Learning (EDHL) system was created to address the historical issues of underwater and underground object detection, particularly those that occur due to sensor noise, environmental variability and computational limitations of UWSN and underwater sensing conditions. Combining multi-modal IoT data with Inception-based multiscale feature extraction and Gradient Boosting classification, the proposed system was proved more accurate, robust, and computationally efficient than the current deep learning and hybrid models. The findings confirm that EDHL is not only effective in feature discrimination, but it also highly minimizes false alarms, thus, it is appropriate in resource-constrained underwater networks to be used in real-time. The methodology is also quite consistent with the current developments of machine-learning-based localization accuracy of MI-uWSNs, robust and energy-efficient routing, adaptive sonar-based DOA/DOD estimation, advanced Kalman-filter-based multi-target tracking systems, and recent discoveries in lightweight underwater object detectors like improved YOLOv8 variants. In the future, the framework has a lot of integration possibilities with distributed UWSN communication modules, adaptive tracking algorithms, and intelligent underwater navigation systems. Future work can include integration of transformer-guided detection backbones, uncertainty-aware Kalman filtering to dynamically track multiple targets, and usage of the architecture to cooperative swarm-based deployments of AUV/ROVs. In addition to this, it is possible to scale the model to support real-time learning on changing oceans and optimize the use of ultra-low-power embedded hardware, which can support scalable and autonomous marine exploration, infrastructure inspection, and environmental monitoring in future underwater sensing ecosystems.
Data availability
The datasets generated during and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.
References
Sheena Christabel, P. et al. Underwater animal identification and classification using a hybrid Classical-Quantum Algorithm. Access. 11: 141902–141914, (2024).
Hao et al. Underwater object detection method based on improved faster RCNN. Appl. Sci. 13 (4), 2746 (2023). https://doi.org/10.3390/app13042746
Abhishek et al. Enhancing Underwater object detection: leveraging YOLOv8m for improved subaquatic monitoring. SCI. 793 , (2023). https://doi.org/10.1007/s42979-024-03170-z
Ayush et al. Implementation of image recognition for human detection in underwater images. IRJAEH 2 (01), 1–5. https://doi.org/10.47392/IRJAEH.2024.0001 (2024).
Gang, Q. et al. Machine Learning-Based prediction of node localization accuracy in IIoT-Based MI-UWSNs and design of a TD coil for omnidirectional communication. Sustainability 14 (15), 9683. https://doi.org/10.3390/su14159683 (2022).
Khan, Z. U. et al. Machine Learning-based Multi-path Reliable and Energy-efficient Routing Protocol for Underwater Wireless Sensor Networks. In 2023 International Conference on Frontiers of Information Technology (FIT), 316–321 (Islamabad, Pakistan, 2023) https://doi.org/10.1109/FIT60620.2023.00064
Hao et al. Spatio-Temporal feature enhancement network for blur robust underwater object Detection. In IEEE TCDS, (2024). https://doi.org/10.1109/TCDS.2024.3386664
Zheng et al. Underwater fish object detection with degraded prior Knowledge, electronics. Basel 13 (12), 2346 (2024). https://doi.org/10.3390/electronics13122346
Ashok, P. et al. Absorption of echo signal for underwater acoustic signal target system using hybrid of ensemble empirical mode with machine learning techniques. MTA 82, 47291–47311. https://doi.org/10.1007/s11042-023-15543-2 (2024).
Muhammad, A. et al. Exploration of contemporary modernization in UWSNs in the context of localization including opportunities for future research in machine learning and deep learning. Sci. Rep. 15, 5672. https://doi.org/10.1038/s41598-025-89916-y (2025).
Ma, X. et al. Joint DOD and DOA Estimation for bistatic MIMO sonar based on reduced-order regularized MFOCUSS. SIViP 19, 194. https://doi.org/10.1007/s11760-024-03802-0 (2025).
Jingchun, Z. et al. HCLR-Net: hybrid contrastive learning regularization with locally randomized perturbation for underwater image Enhancement. IJCV (2024). https://doi.org/10.1007/s11263-024-01987-y
Himanshu et al. An ensemble mosaicking and ridge let based fusion technique for underwater panoramic image reconstruction and its refinement. MTA 82, 33719–33337 (2024). https://doi.org/10.1007/s11042-023-14594-9
Ma, X. et al. Zhongwei Shen; A modified adaptive Kalman filter algorithm for the distributed underwater multi-target passive tracking system. JASA Express Lett. 1 January. 5 (1), 016001 (2025).
Ma, X. et al. Research on pedestrian and vehicle detection method based on improved YOLOv8 model. SIViP 19, 1167. https://doi.org/10.1007/s11760-025-04691-7 (2025).
Ziran et al. Self-attention and long-range relationship capture network for underwater object detection. JKSUCIS 36 (2), 101971 (2024). ISSN 1319–1578. https://doi.org/10.1016/j.jksuci.2024.101971
Kayode Saheed, Y. & Ebere Chukwuere, J. CPS-IIoT-P2Attention: explainable privacy-preserving with scaled dot-product attention in cyber-physical system-industrial IoT network. IEEE Access. 13, 81118–81142 (2025). https://doi.org/10.1109/ACCESS.2025.3566980
Saheed, Y. K. & Sanjay M. CPS-IoT-PPDNN: a new explainable privacy preserving DNN for resilient anomaly detection in cyber-physical systems-enabled IoT networks. Chaos, Solit. Fract. 191: 115939, ISSN 0960 – 0779 (2025) https://doi.org/10.1016/j.chaos.2024.115939
Saheed, Y. K., Abdulganiyu, O. H. & Tchakoucht, T. A. Modified genetic algorithm and fine-tuned long short-term memory network for intrusion detection in the internet of things networks with edge capabilities. Appl. Soft Comput. 155, 1568–4946. https://doi.org/10.1016/j.asoc.2024.111434 (2024).
Saheed, Y. K., Abdulganiyu, O. H. & Tchakoucht, T. A. A novel hybrid ensemble learning for anomaly detection in industrial sensor networks and SCADA systems for smart city infrastructures. J. King Saud Univ. Comput. Inf. Sci. 35 (5), 101532 (2023). https://doi.org/10.1016/j.jksuci.2023.03.010
Saheed, Y. K., Abdulganiyu, O. H., Majikumna, K. U., Mustapha, M. & Workneh, A. D. ResNet50-1D-CNN: a new lightweight resNet50-One-dimensional Convolution neural network transfer learning-based approach for improved intrusion detection in cyber-physical systems. Int. J. Crit. Infrastruct. Prot. 45, 1874–5482 (2024).
Wen et al. Denoising multiscale back-projection feature fusion for underwater image enhancement, AS 14(11): 4395, (2024). https://doi.org/10.3390/app14114395
Changhong et al. Lightweight underwater object detection algorithm for embedded deployment using higher-order information and image enhancement, JMSE 12(3): 506, (2024). https://doi.org/10.3390/jmse12030506
Xiaoyu et al. Edge-Enabled modulation classification in internet of underwater things based on network pruning and ensemble learning. IEEE ITJ. 11 (8), 13608–13621. https://doi.org/10.1109/JIOT.2023.3338147 (2023).
Jianjing et al. Real-time underwater acoustic homing weapon target recognition based on a stacking technique of ensemble learning, JN 11(12): 2305, (2023). https://doi.org/10.3390/jmse11122305
LXi et al. Underwater object detection method based on learnable query recall mechanism and lightweight adapter, PO 19(2): e0298739, (2024). https://doi.org/10.1371/journal.pone.0298739
Kubra et al. Projector deep feature extraction-based garbage image classification model using underwater images, MTA, (2024). https://doi.org/10.1007/s11042-024-18731-w
Yang, J. et al. Underwater Acoustic target classification using auditory fusion features and efficient convolutional attention network, IEEE Sens. Lett. 9(3): 1–4, Mar. Art. no. 7001304, (2025). https://doi.org/10.1109/LSENS.2025.3541593
Karthihadevi, M. et al. Deploying deep learning for real-time tsunami monitoring: a CNN-LSTM model for sensor-based prediction, In 2025 International Conference on Inventive Computation Technologies (ICICT), 1548–1555 (Kirtipur, Nepal, 2025) https://doi.org/10.1109/ICICT64420.2025.11004902
https://www.kaggle.com/datasets/cyanex1702/oceanic-life-dataset accessed on 12th May 2025.
Zhang, S. et al. Underwater Object Detection Based on Improved CNN. IJRTI 7, 22–28 (2023). https://doi.org/10.5121/ijrti.2023.7805
Bajpai, V. et al. Underwater moving object detection using an End-to-End Encoder-Decoder with ResNet. CVF 2023, 45–58. https://doi.org/10.1109/CVPRW.2023.00345 (2023).
Li, J. et al. A hybrid deep learning method for underwater object detection using densenet. Sensors 23 (4), 256–270. https://doi.org/10.3390/s2304156 (2023).
Wang, X. et al. Improved VGG for underwater object detection. MATPR 80 (3), 1940–1945. https://doi.org/10.1016/j.matpr.2023.06.125 (2023).
Uma, N. et al. Underwater human detection using faster RCNN and AutoEncoder-LSTM, SD:MT, 80: 2201–2208, (2023). https://doi.org/10.1016/j.matpr.2023.08.036
Jiang, L. et al. A novel hybrid ResNet-LightGBM approach for enhanced underwater object detection. JMSE 12 (issue 1), 58–72. https://doi.org/10.3390/jmse12010058 (2024).
Liu, Y. et al. Enhanced underwater object detection using inception-XGBoost, Access*, 11: 9012–9025, (2023). https://doi.org/10.1109/ACCESS.2023.3245809
Jain, S. et al. DeepSeaNet: improving underwater object detection using EfficientNet and gradient boosting. ArXiv:2306 06075. 2024 https://doi.org/10.48550/arXiv.2306.06075 (2024).
Huang, F. et al. Capsule networks for robust underwater object detection. JoOE 49 (2), 224–239. https://doi.org/10.1109/JOE.2024.003224 (2024).
Acknowledgements
Not applicable.
Funding
The authors declares that this research is not funded by any organization.
Author information
Authors and Affiliations
Contributions
S.T: Writing – original draft, Visualization, Software, Validation, Methodology, Conceptualization. J.V: Writing – revision, Validation, Supervision, Conceptualization.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tada, S., Jeevanantham, V. Exploring oceanic depths: unveiling hidden treasures with IoT and ensembled deep hybrid learning model. Sci Rep 16, 5333 (2026). https://doi.org/10.1038/s41598-026-35634-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-35634-y















