Introduction

Though they have lovely purple blossoms, water hyacinths are a serious ecological hazard to freshwater ecosystems. The delicate balance of aquatic ecosystems is upset by this fast proliferation, which also degrades the water quality and endangers the local flora and animals. Water hyacinths have a variety of harmful effects. Their quick growth causes dense mats to form, which effectively block the entry of sunlight. This slows the growth of small animals called phytoplankton, which produce oxygen through photosynthesis.

The situation is made worse by the fact that water hyacinths drain enormous quantities of dissolved oxygen during decomposition, ending up in oxygen-depleted zones that are harmful to fish and other marine organisms. Apart from causing oxygen depletion, these thick mats form a physical barrier that restricts light penetration and limits the total area that can be inhabited by native aquatic organisms. As fish and other species strive to acquire shelter and nourishment, this impacts the existing food chain. This issue gets severe by the drop in oxygen levels, which lowers biodiversity overall. Water hyacinths have detrimental effects that go beyond ecological concerns. The dense mats may hinder the natural processes of water purification by trapping trash and contaminants below their surface. This may result in eutrophication, a condition wherein deficiencies in nutrients grow. The formation of toxic algal blooms is triggered by eutrophication, which degrades the quality of the water and emits nasty smells. Major risks towards navigation and infrastructure are also raised by water hyacinths. Water hyacinth proliferation can be inhibited with numerous different approaches. Many researches are made to build a robot which gathers trash from the ocean autonomously through the integration of robotic arms, computer vision, and mechanical design. Its modular architecture and renewable energy sources enable customization according to the kind of garbage it is supposed to target. Recent advancements in engineering offer further ways to mitigate the ecological effects of water hyacinths. Autonomous and remotely operated sailboats known as “surface skimming vessels” were developed specifically for skimming the surface of water bodies and gathering floating garbage which include debris, microplastics, and oil spills. These boats have the knack of exploring wide areas effectively and relentlessly for long stretches of time. Microrobot Swarms is a potential novel idea that comprises swarms of tiny robots which may eliminate all microplastics or contaminants and navigate throughout water. These robots are designed to interact with one another and explore locations that are not accessible by conventional means.

Fig. 1
figure 1

(a) Water hyacinth growth rate (in Sq. ft) (b) Seasonal calendar of water hyacinth growth.

Microrobot Swarms is a potential novel idea that comprises swarms of tiny robots which may eliminate all microplastics or contaminants and navigate throughout water. These robots are designed to interact with one another and explore locations that are not accessible by conventional means. The disruption of irrigation, hydroelectricity production, and transportation is driven by invasive aquatic plants, such as water hyacinths which contributes to an annual revenue loss of millions of dollars across the nation which is evident based on the growth rate of hyacinth shown in Fig. 1.

From what has been reported we can infer that aquatic ecosystems and other vital infrastructure are endangered by water hyacinths. Their rapid growth impacts the natural equilibrium, downgrades biodiversity, and makes it harder to perform fundamental duties particularly irrigation and navigation. Fortunately, advancements in technology propose promising ways to minimize the impact and expansion of water hyacinths. We can successfully counteract this invasive threat and conserve the preciousness of our freshwater resources through implementing a combination of modern methods.

Related work

The field of medical image segmentation has witnessed remarkable advancements, particularly through the development of deep learning architectures tailored for specific diagnostic applications. One significant innovation is the enhancement of the U-Net architecture, specifically refined for lung CT image segmentation in pneumoconiosis cases. This enhanced model incorporates a double convolution structure coupled with a residual mechanism, which effectively mitigates the risks of overfitting while ensuring that critical details in the lung regions are preserved. Moreover, the addition of a Squeeze-and-Excitation (SE) attention mechanism allows the model to focus more on salient feature information, leading to improvements in segmentation accuracy, positive predictive value (PPV), and segmentation consistency (SC)1. The employment of the Gaussian Error Linear Unit (GeLU) activation function facilitates a more complex nonlinear representation, enabling better generalization and performance. The model’s lightweight architecture addresses computational bottlenecks, which is particularly crucial in clinical settings where rapid analysis is required.

Data preprocessing techniques also play a vital role in enhancing the model’s effectiveness. For instance, the use of the watershed algorithm for pixel segmentation, alongside three-channel processing, helps to mitigate binarization errors, ultimately resulting in higher accuracy in the segmented images. Extensive experimentation demonstrated that the refined U-Net outperforms traditional models, showcasing not only superior segmentation accuracy but also more precise localization of target areas. This work stands out as a pioneering effort to integrate multiple novel techniques into the U-Net architecture, significantly enhancing its performance in medical image segmentation tasks and underlining its value in diagnosing and treating pneumoconiosis.

Another noteworthy contribution to medical image analysis is the introduction of DRD-UNet, specifically designed for multi-class semantic segmentation of breast cancer histopathology images stained with hematoxylin and eosin (H&E). The standout feature of DRD-UNet is its DRD processing block, which embeds dilated convolutions, residual connections, and dense layers, automating the feature extraction process and thereby improving segmentation accuracy. The model allows for the capture of intricate details from histological images, which are often challenging due to high resolution and complex structures. Comprehensive comparisons between DRD-UNet and other existing architectures, including standard and modified versions of U-Net, revealed significant performance advantages, particularly in metrics such as the Jaccard index, Dice coefficient, and overall accuracy2.

DRD-UNet consistently achieved high specificity rates, particularly for the necrosis and stroma classes, highlighting its robustness in differentiating various tissue types—a crucial aspect in clinical diagnostics. This work not only presents a new architectural approach but also reinforces the efficacy of deep learning models in digital pathology, emphasizing their role in improving diagnostic accuracy and clinical decision-making in breast cancer pathology.

Additionally, the introduction of Multi-layer Convolutional Sparse Coding (ML-CSC) blocks within the CSC-Unet model series represents a significant evolution in semantic segmentation methodologies unlike methods inferred from3. Traditional convolution operations often lack the capacity to extract sufficient semantic and appearance cues, particularly in complex cases. By reformulating the convolution process to include ML-CSC blocks, researchers have observed enhanced feature richness and spatial detail retention, leading to performance improvements across multiple datasets, including DeepCrack, Nuclei, and CamVid4. The experimental results validate that the CSC-Unet significantly outperformed standard U-Net models, achieving accuracy increases of up to 3.43%. This research highlights the potential of ML-CSC blocks not only to elevate the CSC-Unet model but also to inform enhancements in other convolutional structure-based segmentation networks, paving the way for future exploration in medical imaging and beyond.

UNet + + outperforms UNet and WPE in DSR tasks. Significant WER reductions achieved with UNet + + based approaches. Frequency-dependent convolution scheme (FDCS) for enhanced performance5. DTA-UNet enhances feature extraction and segmentation accuracy. Dynamic Convolution Decomposition (DCD) enhances feature extraction. Triple Attention (TA) improves skip connection and highlights lesions. Evaluation on COVID-SemiSeg, ISIC 2018, and stroke datasets6. PRISMA for selecting UNet-based studies. Five bias methods: ranking, radial, regional area, PROBAST, ROBINS-I. Biomedical image segmentation (BIS) for vascular and non-vascular images7. CAT-Unet improves medical image segmentation performance significantly. Coordinate Attention (CA) module for contextual information. Skip-NAT replaces original U-Net skip connections8. UNet2 + and UNet3 + redesign skip connections for segmentation improvement. UNet-sharp (UNet#) combines dense and full-scale skip connections9. Res-UNet and nnUNet excel in brain tumour segmentation and multi-class heart segmentation. nnUNet is most effective for polyp and myocardium segmentation10. LS-UNet improves road crack detection using real-time segmentation for improved detection accuracy with deep separable convolutions to reduce feature redundancy11. Lightweight Mamba UNet integrates Mamba and UNet outperforms existing state-of-the-art models in segmentation. Residual Vision Mamba Layer for feature extraction. Achieves 116x fewer parameters and 21x lower computation costs12.

Feature extraction is a critical component of pattern recognition, and recent advancements have led to more effective methodologies that enhance system performance across various applications. One significant development is the hybrid feature extraction approach employed in palm vein recognition systems. This technique integrates wavelet transformation and Histogram of Oriented Gradients (HOG) to provide a more comprehensive representation of palm vein patterns13. The wavelet transformation decomposes images into multiple components, capturing essential details while discarding less relevant information. This method focuses on the approximation component, which retains the crucial features necessary for accurate classification.

By combining wavelet transformation with HOG features, the hybrid DWT-HOG approach has demonstrated significant improvements in recognition accuracy and other performance metrics, such as Equal Error Rate (EER) and Area Under the Receiver Operating Characteristic Curve (AUC). The architecture comprises a sequential convolutional neural network (CNN) structure that incorporates convolutional layers, max-pooling layers, and a flattened layer, ultimately achieving an impressive classification accuracy of 99.85% on the CASIA dataset. This performance not only surpasses previous methodologies but also underscores the effectiveness of advanced feature extraction techniques in biometric applications, where precision is paramount.

Another noteworthy advancement in feature extraction is the use of an improved Vision Transformer (ViT) model for crop pest image recognition. This model employs a unique block partitioning method alongside transformer architecture and attention mechanisms to enhance spatial feature extraction from images14. By resizing RGB images and segmenting them into non-overlapping patches, the ViT model facilitates a focused analysis of image features. The self-attention mechanism allows the model to identify and emphasize relevant features, improving its ability to classify various crop diseases. The results demonstrate the model’s effectiveness in distinguishing seven distinct categories of pest damage, showcasing the potential of combining deep learning with image processing techniques for improved agricultural outcomes.

The introduction of the HDF-Net model further underscores the importance of advanced feature extraction methods in image forensics. HDF-Net employs a dual-stream network structure that integrates RGB features with Spatial Residual Mapping (SRM) to localize tampering artifacts in images effectively15. The model utilizes a Multi-View Vision Transformer (MViT) to extract both global and local noise features, enabling the identification of subtle discrepancies indicative of tampering. The architecture comprises three complementary modules—Spatial Transformation Processing (STP), Feature Transformation Stream (FTS), and Tampering Edge Refinement (TER)—which collectively enhance feature extraction and artifact localization. Quantitative evaluations indicate that HDF-Net outperforms existing state-of-the-art models across multiple benchmark datasets, establishing its robustness in the realm of image forensics.

Moreover, the significance of skip connections in deep neural networks has been reinforced in recent studies, particularly concerning the degradation issues observed in deeper architectures16. Skip connections facilitate the transfer of information from earlier layers to deeper layers, significantly enhancing the model’s capacity to learn complex features. DenseNets, for instance, utilize skip connections to concatenate feature maps from preceding layers, resulting in improved feature extraction capabilities. The integration of deep supervision in segmentation tasks further bolsters this approach, enabling models to capture both coarse and fine-grained semantic information. The proposed UNet model17 exemplifies this methodology, showcasing consistent improvements in performance metrics across various datasets, including enhanced Intersection over Union (IoU) and Panoptic Quality (PQ) scores.

Channel and Spatial Attention Feature Extraction (CSA-FE) integrated with Standard Vision Transformer (ViT) module enhances feature extraction in remote sensing images for disaster response and land cover analysis18. Pix2Pix GAN for semantic segmentation and object delineation. U-Net for precise semantic segmentation retaining spatial information for automated feature extraction from high-resolution satellite data19. PiNet outperforms state-of-the-art methods in silent object detection (SOD) with level-specific feature extraction for better saliency cues and progressive refinement of saliency through coarse-to-fine process20. Sensor fusion and object detection enhance perception and obstacle recognition for autonomous driving in Indian road conditions by handling unpredictable traffic patterns, diverse road infrastructures and challenging weather conditions21.

Hyperspectral image classification using segmented principal component analysis (Seg-PCA) to reduce spectral dimensions, enhancing the extraction of global and local intrinsic characteristics, which improves classification accuracy through the integration of 3D-2D CNNs and multi-branch feature fusion22. Method for skin lesion classification that combines a segmentation mask with the lesion image, using Gaussian blur to obscure unimportant areas23. Segmented-Incremental-PCA (SIPCA) for hyperspectral image classification, which segments the image into correlated band subgroups and applies Incremental-PCA to enhance classification accuracy, achieving 91.22%, outperforming traditional methods like PCA and SPCA24. The E-VGG19 model enhances the traditional VGG19 by integrating max pooling and dense layers, achieving improved accuracy in classifying skin lesions into malignant and benign categories25. VGG19 architecture has also been employed to diagnose eye diseases, demonstrating its reliability and accuracy in identifying conditions like cataracts and diabetic retinopathy26.

The advancements in pattern recognition are not limited to medical imaging and biometrics; they also play a critical role in the field of autonomous navigation. Effective pattern recognition techniques are essential for the identification and classification of various objects in the environment, enabling autonomous systems to make informed decisions. Recent studies27 have highlighted the importance of employing advanced neural network architectures, such as YOLO models, for real-time object detection in navigation tasks. The YOLO architecture’s grid-based approach allows for the efficient detection of objects within images by dividing them into smaller cells, with each cell responsible for recognizing objects in its vicinity28. This method significantly enhances the model’s ability to identify smaller and less frequent features, thereby improving overall diagnostic accuracy.

Recent findings demonstrate that models like YOLOv7m and YOLOv8m excel in recognizing complex patterns associated with anatomical structures in dental radiographs. High precision and recall rates indicate these models’ capability to learn and generalize effectively from training data, allowing for accurate recognition of specific dental categories and treatments29. However, challenges remain in segmenting dental restorations and differentiating between similar categories, such as crowns and metallic restorations. This highlights the need for robust pattern recognition capabilities, as misclassifications can lead to diagnostic errors.

To further enhance pattern recognition in autonomous navigation, the incorporation of diverse and comprehensive training datasets is essential. Including a variety of conditions, scenarios, and environmental factors can significantly improve the generalizability of models. The application of data augmentation techniques can artificially increase the diversity of training data, allowing models to learn more robust patterns29. Furthermore, the potential of integrating advanced techniques like Federated Learning offers a promising avenue for enhancing pattern recognition in AI models. This approach facilitates the training of models on decentralized data sources while preserving privacy, ultimately leading to more comprehensive and representative datasets.

Ship traffic pattern recognition method that integrates multi-attribute trajectory similarity, considering ship static and dynamic features alongside port geospatial features, enhancing the precision of maritime navigation and improving safety and efficiency in ship traffic management30. framework for recognizing vessel navigation patterns using snapshot perspectives, focusing on instantaneous group behaviours. It identifies patterns like convoying, turning, and mooring, enhancing understanding and prediction of maritime traffic dynamics through advanced algorithms31. Take One Class at a Time (TOCAT) classification strategy for vessel type recognition for maritime navigation using AIS data as adaptive model trained using radar data parameters32. object detection in maritime navigation using the Yolov5 neural network family, specifically trained on the SAR Ship Dataset, to enhance recognition processes with greater quantum of training data33.

The effectiveness of pattern recognition methodologies is paramount in autonomous navigation systems, particularly in identifying obstacles and making real-time decisions. The integration of advanced algorithms and models, coupled with robust training methodologies, can significantly enhance the accuracy and reliability of these systems. As AI continues to evolve, its capacity to recognize complex patterns in various environments will be pivotal in improving navigational precision and safety.

The continuous exploration of advanced methodologies in pattern recognition and feature extraction is driving innovation across diverse fields, including medical imaging, biometrics, agriculture, and autonomous navigation. The integration of sophisticated deep learning architectures, attention mechanisms, and hybrid feature extraction techniques is enhancing the performance and reliability of these systems. As research progresses, these advancements will not only improve diagnostic accuracy and agricultural outcomes but also contribute to the development of more effective autonomous navigation systems. The future of pattern recognition holds immense potential for advancing technology and improving outcomes in various applications.

Methodology

Dataset

We constructed a dedicated real-world dataset, termed WHD-7282 (Water Hyacinth Detection – 7 282 images), to train and evaluate the proposed cascaded detection pipeline. All 7 282 RGB images were captured directly from the moving catamaran prototype operating at its final mounting height (40–60 cm above the water surface) and typical speeds of 0.3–0.8 m/s, using the same Sony IMX219 8 MP camera module later deployed on the vessel. This strategy ensures zero domain gap between training and real-world operation, including natural motion blur, platform vibration, water glare, and perspective variations. Images were collected between December 2024 and March 2025 from seven different freshwater bodies in and around Chennai, India (four ponds, two irrigation canals, and one 0.8 ha lake), covering water hyacinth densities from 15% to 95% of the water surface. The dataset intentionally includes challenging co-occurring elements such as water lettuce (Pistia stratiotes), Salvinia, algae blooms, lotus leaves, floating plastic debris, reflections, ripples, shadows from overhanging trees, birds, and fishing traps. Environmental conditions comprise sunny (38%), partly cloudy (35%), overcast (18%), and light rain (9%) scenarios, recorded between 07:30 and 18:00 h, resulting in diverse water colours (clear, green, tea-coloured, and turbid brown) and lighting directions. Each image was manually annotated with pixel-accurate binary masks distinguishing water hyacinth from all other regions using Label Studio, followed by cross-verification and refinement by three independent annotators. Images were randomly split stratifying by location and weather condition into 5 826 training (80%) and 1 456 validation (20%) samples. During training, the images were resized to 512 × 512 pixels and augmented on-the-fly with random horizontal/vertical flips, rotations (± 30°), brightness/contrast adjustments (± 30%/±25%), and mild Gaussian/motion blur. The complete dataset, splits, and annotations will be released publicly upon publication to serve as a benchmark for aquatic invasive weed detection.

Overview of the proposed system

In this section we will introduce our prototype framework as shown in the Fig. 2 which explains about the working setup in the real time scenario. We will delve further and articulate all the major aspects involved in forging the prototype.

Fig. 2
figure 2

Visual illustration of the objective proposed.

The system is designed to be deployed in water bodies which are contaminated by water hyacinth. The deployed system is desired to surf around the water body to identify and locate the presence of water hyacinth and autonomously navigate towards the detected hyacinth. This system is proposed to evacuate hyacinth through a collecting unit at the rear part of the system. Forging this model include multiple hardware and software setups like image segmentation and detection, autonomous navigation, real-time hyacinth detection and embedded system for catamaran. The functionalities and significance of all these aspects are explained in further sections.

Hardware design

This section describes the development of a hardware prototype designed to autonomously detect and remove water hyacinths. The system utilizes a Jetson Nano board as its central processing unit, interfaced with brushless DC (BLDC) motors for propulsion. To achieve autonomous navigation, a servo motor is employed to steer the system in different directions.

For safe and effective motor control, an external electronic speed controller (ESC) acts as an intermediary between the Jetson Nano board and the BLDC motors. The Jetson Nano generates low-power control signals through Pulse Width Modulation (PWM). The ESC translates these signals, enabling them to interface with a high voltage power source. This translation is crucial as the PWM signals lack the necessary power to directly drive the BLDC motors. Moreover, the ESC provides short-circuit protection, a vital safety measure for high-speed BLDC motors.

Fig. 3
figure 3

Hardware design of proposed prototype model.

The system is equipped with an external camera module. This camera plays a key role in the navigation algorithm by detecting the presence of water hyacinths within the designated water body. A detailed explanation of the navigation control algorithm will be provided in subsequent sections. Image Segmentation and classification facilitates precise hyacinth identification. The camera captured image data is fed into a deep learning model trained to perform image segmentation and classified based on color variations, enabling the system to locate hyacinths within predefined quadrants. Based on the quadrant of detection, the navigation system formulates an appropriate course of action.

This prototype design as shown in Fig. 3 is found to be a promising model for autonomous detection and removal of water hyacinths. The integration of Jetson Nano, BLDC motors, and a camera with deep learning image segmentation demonstrates an explicitly effective approach to manage invasive aquatic plants.

Hyacinth detection

UNet image segmentation

In our proposal, we implement an encoder-decoder based UNet architecture as shown in Fig. 4 for image segmentation. To this algorithm we feed the real time camera capture which will try and differentiate the presence of plants and weeds in the deployed environment. UNet model is trained over the generated dataset of both real time RGB images along with its mask segmented into a binary image. Post training the model will explicitly perform the segmentation of plants and land on the real time captured images. The segmentation will assist the classifier in making the prediction much better than conventional CNN classifier. The Algorithm 1 Shows the modifies UNet Algorithm.

Fig. 4
figure 4

Visual illustration of model architecture.

UNet Architecture is a curated construction of regular neural network functions like convolution, Max Pooling, transpose convolution and concatenation. Now to understand the arithmetic of the UNet model, we look into the following for better understanding.

figure a

Algorithm 1: Modified UNet.

For training the UNet to perform image segmentation, we proceed with the arithmetics that follows. The entire dataset comprises \(\:{N}_{t}\) original image samples \(\:\{{T}_{i}{{\}}^{{N}_{t}}}_{i=1}\) where \(\:{T}_{i}\:\epsilon\:{R}^{H\times\:W\times\:D}\:\)is the input training volume and \(\:\{{M}_{i}{{\}}^{{N}_{t}}}_{i=1}\) where \(\:{M}_{i\:}\epsilon\:{R}^{H\times\:W\times\:D}\:\)is the actual binary mask of the original input image. The segmentation stage begins at the pre-processing stage where the input training data \(\:\{{T}_{i},{M}_{i}{{\}}^{{N}_{t}}}_{i=1}\) rescaling and resizing to desired form. Resize both \(\:{T}_{i}\) and \(\:{M}_{i}\) from range [0, 255] to range [0,1] by scaling it by a factor of 1/225. Resize \(\:{T}_{i}\) and \(\:{M}_{i}\) to a target size of (512 × 512) with batch size = 32 and zip \(\:{T}_{i}\) and \(\:{M}_{i}\) together as \(\:{A}_{i}\) and concluding the preprocessing step.

From \(\:\{{T}_{i},{M}_{i}{{\}}^{{N}_{t}}}_{i=1}\) computed \(\:{A}_{i}\) matrix post rescaling and resizing will be fed to input layer and the following Eq. 1 will be executed in the neural network for mask with dimension \(\:n\times\:n\)

$$\:{E}_{1}=Conv2D\left(Conv2D\right({A}_{i})$$
(1)

The above equation provides result at stage 1 of encoder part providing result \(\:{E}_{1}\) which is carried forward to the subsequent layers of the architecture. Subsequent layers of the model always depend on the outcomes from different stages of the encoder setup. Computations at further stages are as follows,

$$\:{E}_{2}=Conv2D\left(Conv2D\right(MaxPool\left({E}_{1}\right)\left)\right)$$
(2)

At stage two of the encoder phase in the architecture, we perform the computations of \(\:{E}_{2}\) by means of Eq. 2 with the results from the first layer obtained from Eq. 1

$$\:{E}_{3}=Conv2D\left(Conv2D\right(MaxPool\left({E}_{2}\right)\left)\right)$$
(3)
$$\:{E}_{4}=Conv2D\left(Conv2D\right(MaxPool\left({E}_{3}\right)\left)\right)$$
(4)
$$\:{E}_{5}=Conv2D\left(Conv2D\right(MaxPool\left({E}_{4}\right)\left)\right)$$
(5)

Results from every stage will sequentially serve as the input for the subsequent stage in the encoder phase of the at different levels form stage 1 to stage 5, For our presented proposal we restrict the encoder stage and proceed to the decoder phase after 5 stages of the UNet architecture. Decoder phase obtains the consolidated result of multiple stages in the encoder phase in a sequential manner, \(\:{E}_{1\:},{E}_{2},{E}_{3},{E}_{4},{E}_{5}\) are the computations at the end of every stage in the encoder phase derived from Eq. 3, Eq. 4, Eq. 5. Now the decoder phase will perform the inversion at every stage symmetric to the stage in the encoder.

$$\:{D}_{5}=Conv2D\left(UpSamp\right({E}_{5}\left)\right)$$
(6)

Decoding commences at stage 5 through Eq. 6 where the inverse process of max pooling referred as up sampling is employed in the intention of decoding the modifications made by the max pooling while encoding. \(\:{D}_{5}\) provides the decoded computation at stage 5 and which will be fed to the subsequent stages of decoding.

$$\:{S}_{4}={D}_{5}\left|\right|{E}_{4}$$
(7)
$$\:S{{\prime\:}}_{4}=Conv2D\left(Conv2D\right({S}_{4}\left)\right)$$
(8)

Primary significance of UNet architecture lies in the interlinking of both encoded result and the decoded result from the previous stage. This is done by the concatenation operation where the resultant matrix from both the inputs will be cascaded in the 3rd dimension which is along the − 1 axis. The shape of the concatenated outcome will have its third coordinate as the sum of third coordinates of both the inputs. In this we concatenate the encoded result of stage 4 which is \(\:{E}_{4}\) and the decoded result from stage 5 which is \(\:{D}_{5}\) providing the concatenated result \(\:{S}_{4}\)at stage 4 based on the Eq. 7. This outcome will be decoded at stage 4 again by the Eq. 8.

$$\:{D}_{4}=Conv2D\left(UpSamp\right(S{{\prime\:}}_{4}\left)\right)$$
(9)

Decoding commences at stage 4 through Eq. 9 and serving its computation for concatenation and convolution of the succeeding level. \(\:{D}_{4}\) after decoding at stage 4, the outcome will concatenate encoded result of stage 3 which is \(\:{E}_{3}\) and the decoded result from stage 4 which is \(\:{D}_{4}\) derived from Eq. 10 and Eq. 11.

$$\:{S}_{3}={D}_{4}\left|\right|{E}_{3}$$
(10)
$$\:S{{\prime\:}}_{3}=Conv2D\left(Conv2D\right({S}_{3}\left)\right)$$
(11)
$$\:{D}_{3}=Conv2D\left(UpSamp\right(S{{\prime\:}}_{3}\left)\right)$$
(12)

Decoding commences at stage 3 through Eq. 12 and serves its computation for concatenation and convolution of the succeeding level. \(\:{D}_{3}\) after decoding at stage 3, the outcome will concatenate encoded result of stage 2 which is \(\:{E}_{2}\) and the decoded result from stage 3 which is \(\:{D}_{3}\) derived from Eq. 13.

$$\:{S}_{2}={D}_{3}\left|\right|{E}_{2}$$
(13)
$$\:S{{\prime\:}}_{2}=Conv2D\left(Conv2D\right({S}_{2}\left)\right)$$
(14)
$$\:{D}_{2}=Conv2D\left(UpSamp\right(S{{\prime\:}}_{2}\left)\right)$$
(15)

Last level of decoding commences at stage 2 through Eq. 15 and serves its computation and concatenates the encoded result of stage 1 which is \(\:{E}_{1}\) and the decoded result from stage 2 which is \(\:{D}_{2}\) derived from Eq. 16.

$$\:{S}_{1}={D}_{2}\left|\right|{E}_{1}$$
(16)

After all levels of encoding and decoding with corresponding concatenations performed, we finally arrive at the penultimate layer to the output layer. In this we fed the concatenated outcome\(\:{S}_{1}\) to a sequence of convolutions and giving the result \(\:F\) by the Eq. 17 after completely performing encoding and decoding operations for five levels in the UNet architecture.

$$\:F=Conv2D\left(Conv2D\right({S}_{1}\left)\right)$$
(17)

 

Final output layer of the UNet architecture is to primarily situated to perform a sigmoid activation on the result \(\:\:F\:\)from Eq. 17 produced after all all stages of encoding and decoding and generating the segmented output \(\:{Y}_{i}\) by means of Eq. 18 for every \(\:\{{T}_{i},{M}_{i}{{\}}^{{N}_{t}}}_{i=1}\) samples of the dataset.

$$\:{Y}_{i}=\:Out\left(F\right)\:;\:{Y}_{i}\:\to\:\:Segmented\:Image\:$$
(18)

 

VGG19 classification algorithm

The segmented output from the UNet architecture is fed into the pretrained VGG19 for detecting the presence of hyacinth in the real time images. The classification of the segmented image will explicitly direct the Jetson Nano microcontroller to either move towards that direction or to take scavenge for the presence of hyacinth in other directions. VGG19 is a popular image classifier built based on convolutional neural network (CNN) architecture. VGG19 takes our segmented image from the UNet model and resizes to 224 × 224 pixels as input. Stacks of convolutional layers with small filters (3 × 3) analyze the image and will start extracting features like edges and shapes. Fully connected layers at the end of the model will convert the extracted features into probabilities to identify the presence of hyacinth in the input image. The probability will quantify and perform binary classification which is hyacinth detected and no hyacinth detected. VGG19 was trained on the segmented outputs from UNet image segmentation model after labeling and by leveraging its pre-trained feature extraction layers, we can surely present a cascaded framework of both UNet Image segmentation and CNN based VGG19 classifier (Fig. 5). The weights and bias from our cascaded model framework have been deployed into the Jetson Nano microcontroller for identifying the presence of hyacinth in the frame and navigating in response to that.

Fig. 5
figure 5

Visual illustration of VGG19 classifier architecture.

Autonomous navigation algorithm

This section depicts the control system for an autonomous aquatic system designed to autonomously navigate and avoid water hyacinths. The system leverages an external camera for hyacinth detection and rudders for maneuverability. The process is initiated by activating the servo motor that controls the rudders. This initial rudder movement will be set in a neutral state which is 0-degree rudder deviation with respect to the center axis of the model.

Following the initial movement, the camera captures the real time view of the water surface on which the system is deployed. This camera captured data is then fed into a hyacinth detection algorithm. This algorithm (Fig. 6) is a deep learning model trained on UNet image segmentation model whose weights and bias are extracted and fed to the Jetson Nano controller to identify the presence of water hyacinths in pre-pre-defined quadrant mapping using the external camera.

Fig. 6
figure 6

Flow of autonomous navigation algorithm deployed on hardware setup.

If the hyacinth detection algorithm does not identify any hyacinths, then the control system steers the system back to a neutral rudder position by providing an 80-degree rudder deviation. Still if the hyacinth detection algorithm does not identify any hyacinths, then the control system will move forward for a period of 15 s attempting to identify the presence of hyacinth else the system then repeats the process again by capturing in real time and running the hyacinth detection algorithm for detection. If the hyacinth detection algorithm does not identify any hyacinths even after surfing for a predefined wait time limit the shore detection algorithm drives the setup back to shore.

If a hyacinth is identified then our system proceeds to determine the specific quadrant which is 1 st, 2nd, 3rd, or 4th relative to the system’s position. This quadrant identification allows the control system to actuate the navigation precisely in the appropriate direction. Based on the identified quadrant, the rudder control mechanism commands the system strategically to redirect the hyacinth detection bound into the center frame. The rudder movements are designed to navigate the system towards the detected hyacinth. For instance, if a hyacinth is identified in Quadrant 1, the rudders will be at a + 30 degrees deviation from its median position. This rudder configuration will control the air flow which pushes the system aligned to the hyacinth, effectively making the hyacinth flow into the collection unit. Similar rudder adjustments are made for hyacinths detected in Quadrants 2, 3, and 4 by setting a −60degree, −30 degree and + 60 degree respectively, ensuring the system navigates towards the hyacinth.

Once the hyacinth bounded inside the center frame then the system will propel towards the detected hyacinth and collect it inside the collection unit. Then the entire process restarts again where the camera captures the new and fresh data followed by the hyacinth detection algorithm analyzing it, continuing the process of hyacinth detection and navigating correspondingly.

This continuous loop allows the autonomous system to navigate through a water body and displays an effective collection of the water hyacinths through strategic rudder movements based on real time camera captures and evacuating them for a cleaner aquatic environment.

From detection to actuation: real-time control logic on Jetson nano

The entire perception-to-action pipeline runs at 14–18 fps on the Jetson Nano 4 GB with no GPU underclocking is done as per the Table 1. The transition from vision to mechanical commands is deterministic and requires < 50 ms after each camera frame:

  1. 1.

    The 640 × 480 RGB frame is resized to 512 × 512 and fed to UNet → binary vegetation mask (35 ms).

  2. 2.

    Pixels classified as vegetation are cropped and passed to the frozen VGG19 classifier → binary decision “water hyacinth present/absent” (12 ms).

  3. 3.

    If confidence > 0.85, the mask is divided into four equal quadrants and the quadrant(s) containing ≥ 8% hyacinth pixels are marked positive.

  4. 4.

    A simple rule-based controller instantly maps the dominant quadrant to rudder angle and thrust (Table 1). PWM signals are sent directly to the two ESCs and the steering servo via the Jetson’s GPIO using the Adafruit PCA9685 library.

Table 1 Decision table executed every frame on Jetson Nano.

This extremely lightweight logic (≈ 200 lines of Python) ensures immediate and predictable mechanical response with zero reliance on complex path-planning algorithms, making the system behaviour transparent, debuggable, and robust in real water conditions.

Biomass collection mechanism

Once water hyacinth is centred in the field of view, the catamaran advances at full thrust. A 60 cm-wide intake ramp at the bow height guides floating plants onto a rear-mounted mesh conveyor belt driven by a separate 12 V geared DC motor (Fig. 3d). Plants are lifted and deposited into a detachable 25 kg (wet weight) basket. No cutting or shredding is performed; the system relies on passive scooping, which is simple, energy-efficient and avoids fragmentation that could aid re-infestation.

Experiments

As discussed in Sect. 3.3.1, the model is trained on a curated dataset with both RGB images along with its mask. The masking is a process of distinguishing the plant and non-plant regions by segmenting out only the plant using binary image. The shape of the input RGB image is (512,512,3) after performing all rescaling and resizing operations to achieve the desired efficiency. Similarly binary mask images will suit the shape (512,512,1) after undergoing preprocessing. From Fig. 7 we can observe the third coordinate of the ordered triplet in the shape corresponding to the number of channels, which is 3 for a RGB image signifying three different channels for red, blue and green. In the case of a binary mask, the number of channels is 1 which is exclusively for masking only the desired pixels from the RGB image. The objective met by the UNet image segmentation model is segmenting only the presence of plants in the output. The model proves capable of segmenting out only the plant for any real time image, provided the input feed must come from an environment where the plants are hyacinths as the model is trained on a dataset comprising images more specific to have water hyacinth rather than generic plant type images.

Fig. 7
figure 7

Implementation setup of image segmentation.

Results and discussion

Our proposed methodology had shown a significant performance in segmentation of hyacinths in the deployed environment. Our model assists in identifying the definite presence of hyacinth in real time by running the proposed UNet based segmentation model. Figure 8 provides a visual understanding of the model efficiency. The loss and accuracy metrics of the model trained is monitored serially for every iteration of the training and plotted. The plot assists in arguing about the trend of metrics at every stage of the training process.

Fig. 8
figure 8

Resultant metrics to evaluate the trained UNet Model.

As the graphs in Fig. 8 show, the loss and accuracy metrics tend to saturate at a certain range after 20 epochs in both training and validation. Training the model only up to a saturating level will reduce the chance of overfitting. From the plot we can also infer that the slope is steeper for the first 10 epochs and from 10 to 20 epochs slope deviation is becoming blunter. Post 20 epochs the slope hardly tends to deviate. Accuracy is saturating at the range from 0.87 to 0.92 and similarly loss saturates at the range 0.062 to 0.13. We clearly identify the curves plotted for the training dataset as having a convex nature. Convex nature of the curve conveys that the optimality is achieved in the course of the training. Best weights of the model are extracted and fed to the Jetson Nano board and perform hyacinth segmentation in real time through the external camera module interfaced with the hardware setup.

Fig. 9
figure 9

Visualization of segmentation performed by UNet model.

As shown in Fig. 9, for the Unet Image segmentation, our framework can generate hyacinth segmented images. Our model is capable of segmenting only the water hyacinth plants from the environment image. In our proposal we evaluate the dice coefficient as the metrics to analyze, which indicates the effectiveness of segmenting the desired portions of the image. Dice score is predominant in various studies to quantify the efficiency of image segmentation. Figure 9 has presented samples from different environments with the metrics of dice score when compared with the original mask. The dice coefficient and Intersection over union is detailed in Eq. 19 and Eq. 20.

In order to compute dice coefficient (\(\:{\delta\:}_{Dice}\)), the array of the original mask (\(\:{m}_{i}\)) and array of segmented outcome \(\:\left({g}_{i}\right)\) from the model for that particular input image (\(\:{n}_{i}\)).

$$\:{\delta\:}_{Dice}=\frac{2\:{\sum\:}_{i=1}^{N}{m}_{i}{g}_{i}\:\:}{{\sum\:}_{i=i}^{N}{{m}_{i}}^{2}\:\:\:\:+\:{\sum\:}_{i=1}^{N}{{g}_{i}}^{2}\:\:}$$
(19)

 

Similarly to compute IoU score (\(\:{\delta\:}_{IoU}\)), the array of the original mask (\(\:{m}_{i}\)) and array of segmented outcome \(\:\left({g}_{i}\right)\) from the model for that particular input image (\(\:{n}_{i}\)).

$$\:{\delta\:}_{IoU}=\frac{{\sum\:}_{i=1}^{N}{m}_{i}{g}_{i}\:\:}{{\sum\:}_{i=i}^{N}{{m}_{i}}^{2}\:\:+\:{\sum\:}_{i=1}^{N}{{g}_{i}}^{2}\:\:\:-\:{\sum\:}_{i=1}^{N}{m}_{i}{g}_{i}\:\:}$$
(20)

 

Fig. 10
figure 10

(a) Boxplot with distribution IoU Coeff (\(\:{\delta\:}_{IoU})\) (b) Boxplot with distribution for Dice Coeff (\(\:{\delta\:}_{Dice})\) (c) Distribution-Rug Plot for Dice_Coeff (\(\:{\delta\:}_{Dice})\) Vs IoU_Coeff (\(\:{\delta\:}_{IoU})\) (d) Bar Plot for statistical parameters of Dice_Coeff (\(\:{\delta\:}_{Dice})\) Vs IoU_Coeff (\(\:{\delta\:}_{IoU})\).

As observed from Fig. 10 (a) and Fig. 10 (b), the boxplots visually illustrate the statistical spread of both dice coefficient \(\:{\delta\:}_{Dice}\:\)and IoU score \(\:{\delta\:}_{IoU}\) along with the distribution along the range. On comparing the interquartile range of both \(\:{\delta\:}_{Dice}\)(Table 2) and \(\:{\delta\:}_{IoU}\)(Table 3) From the plot, we can clearly identify that IoU displays a higher interquartile range when compared to Dice Coefficient by encapsulating more samples within the interquartile range. To the contrary, \(\:{\delta\:}_{Dice}\:\)is found to have a higher value of 25th and 75th percentile compared to \(\:{\delta\:}_{IoU}\).

Table 2 Descriptive statistics for \(\:{\delta\:}_{Dice}\:\).
Table 3 Descriptive statistics for \(\:{\delta\:}_{IoU}\).

Along with the Segmentation, VGG19 classification model has been appended in the proposed framework in order to effectively detect the presence of water hyacinth in real time images captured from the external camera module whose metrics are inferred using Table 3; Fig. 11. The metrics evaluated to analyze the performance of our classification model is \(\:Precision=\:\frac{TP}{TP+FP}\), \(\:Recall=\:\frac{TP}{TP+FN}\), \(\:F1\:Score\:=\:\frac{TP}{TP+\frac{1}{2}(FP+FN)}\), \(\:MCC\:=\:\frac{(TP\:\times\:\:TN)\:-\:(FP\:\times\:\:FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\:\)

The fine-tuned VGG19 classifier, operating on segmented regions produced by UNet, achieved 96% overall accuracy on the held-out validation set of 1456 images. Detailed per-class results are presented in Table 4: precision 0.96, recall 0.95, and F1-score 0.97 for the water hyacinth class. The corresponding confusion matrix is shown in Fig. 11.

Table 4 VGG19 model evaluation.
Fig. 11
figure 11

VGG19 Classifier Performance Metrics.

Resultant performance of the framework is found to achieve the optimal efficiency and guarantees significant reliability while imposed on real time by deploying the weights and bias onto the Jetson Nano microcontroller which acts as the brain of this forged hardware setup.

Field testing and data collection conditions

The full system was tested between January and March 2025 in three real freshwater bodies in Chennai, India: (i) a 15 × 20 m university pond, (ii) a 0.6 ha natural lake containing mixed native vegetation, and (iii) a 150 m canal section. Tests covered sunny, partly cloudy and light-rain conditions from 08:00–17:00 h. All 7282 training and validation images were captured directly from the moving prototype (speed 0.3–0.8 m/s, camera height 40–60 cm) using the final mounted IMX219 8 MP module, thereby eliminating domain shift caused by camera motion, vibration or changing viewpoint.

Scalability and operational limitations

The prototype was intentionally designed as a minimal-cost, easily replicable platform for community-level deployment. The perception and control software is completely independent of hull size. Scaling to larger water bodies only requires a bigger hull and more powerful motors/batteries — no changes to the trained models or navigation logic are needed. For reservoirs > 10 ha, multiple units can be deployed as an ultra-low-cost swarm.

The current prototype is limited to small water bodies (< 2 ha), daylight and light-weather operation, and ~ 45–60 min runtime with 25 kg biomass capacity due to its minimal size and budget. Detection accuracy drops 8–12% in highly turbid water, and night or heavy-weather use is not supported. These constraints are hardware-related and can be overcome by modest scaling (larger hull, solar charging, NIR camera) without changing the core algorithms.

Conclusion

The paper introduces a pioneering approach to address the pervasive environmental challenge posed by the rapid proliferation of water hyacinth in aquatic ecosystems. Through the development intelligent navigation aided by methodologies which performs the role of distinctively identifying the hyacinth present ahead based on which the navigation is guided. This significant stride had propelled towards mitigating the adverse impacts on water quality, biodiversity, and human activities. Our deep learning based Unet image segmentation model was trained on water hyacinth samples, coupled with real-time image classification capabilities by CNN based VGG19 classifier to identify and locate the water hyacinth patches within the designated water body. The performance of UNet image segmentation was found to be reliable after observing a 90.26% accuracy with a state of art mean dice score (\(\:{\underset{\_}{\delta\:}}_{Dice}=\:0.927\)) and mean IoU score (\(\:{\underset{\_}{\delta\:}}_{IoU}=\:0.875\)) having interquartile ranges \(\:IQ{R}_{Dice}=0.041\) and \(\:IQ{R}_{IoU}=0.073\). Similarly, the VGG19 classifier serves to the best with 96% accuracy with \(\:AUC=\:0.994\:\)and \(\:MCC\:=\:0.925\:\)as metrics to argue that classification is trustworthy to deploy on the Jetson Nano Microcontroller. Extracted weights and bias from the UNet and VGG19 cascaded framework was translated onto the hardware setup which makes the design automated and deployable in real time. This proposed system holds significant promise for advancing the goals of sustainable development, particularly in the realms of clean water and sanitation, industry innovation, and infrastructure development. By offering a scalable and cost-effective solution for water hyacinth removal, it contributes to the conservation of aquatic ecosystems and the promotion of more conservative life for organisms underwater and balancing our ecosystem. The current RGB-based system achieves only a modest 4–6% drop in Dice score under moderate rain and shadows because UNet primarily exploits texture and shape rather than absolute colour. Extremely turbid water or heavy glare can still reduce contrast. Planned enhancements include (i) addition of near-infrared imaging, (ii) multi-frame temporal consistency checks, and (iii) continual domain adaptation to further increase robustness across water colour, weather and lighting variations.