Introduction

In 2022, global fisheries and aquaculture production reached an all-time high. According to statistics from the Food and Agriculture Organization of the United Nations (FAO), the total global production of fisheries and aquaculture increased to 223.2 million tonnes, comprising 185.4 million tonnes of aquatic animals and 37.8 million tonnes of algae1. The fisheries economy holds a significant position in the global economic system. However, in current aquaculture, common fish diseases, such as bacterial infections and parasitic infestations, are associated with economic losses in many farming scenarios2. As the scale of aquaculture continues to expand, the risk of fish suffering from diseases and their potential impact also rise accordingly3. Consequently, early detection of health anomalies enables the implementation of targeted intervention measures. By informing scientific prevention and control strategies, this approach serves to reduce the risk of extensive disease spread and associated economic losses.

Manual visual inspection remains a common practice for fish health monitoring, which often requires catching a portion of the fish for examination outside the aquaculture environment. This process can be labor-intensive and may inadvertently cause stress or physical damage to the fish4. Furthermore, identifying subtle surface abnormalities through human observation is inherently challenged by environmental factors such as underwater turbidity and fluctuating light conditions, which can limit the consistency of detection, especially in large-scale farming systems. In recent years, with the rapid development of computer vision technology, automated, non-contact underwater imaging methods have emerged as a prominent research direction5. By leveraging deep learning models for continuous monitoring, this approach enables the acquisition of health data without disrupting the fish’s natural behavior, thereby providing a more scalable solution for identifying health trends and the severity of ailments in intensive aquaculture.

Among existing computer vision-based methods for screening fish health abnormalities based on surface-associated symptoms, the mainstream technical approaches still focus on the improvement of CNN and the YOLO series of models. Specifically, while CNN-based methods6,7,8,9,10 exhibit excellent performance in terms of detection accuracy, their application is often characterized by higher computational demands, making it difficult to achieve the low-latency processing required for continuous real-time monitoring. In contrast, models based on improvements to the YOLO series11,12,13,14,15 demonstrate advantages in real-time performance due to their end-to-end architectural design. However, these lightweight architectures may face challenges in maintaining high precision when dealing with complex backgrounds or fine-grained pathological features. Furthermore, Huda et al.16 proposed a soft voting ensemble strategy integrating YOLO and RT-DETR for ornamental fish health screening. Although this method balances accuracy and speed, the ensemble nature of the approach inherently increases the total parameter count and inference overhead compared to single-stage models, potentially hindering efficiency in high-throughput operational environments.

To address the aforementioned challenges and achieve a better balance between detection accuracy and inference efficiency, this paper proposes a novel real-time object detection method for the task of fish body surface symptom detection: Real-Time DETR with Global-Local Adaptive Enhancement for Efficient Object Detection (RT-GalaDet). This method is based on the RT-DETR framework17 and while maintaining high detection accuracy, it significantly enhances the model’s lightweight nature and inference speed. This ensures better fulfillment of the practical needs for real-time, efficient and accurate detection of fish body surface symptoms in aquaculture settings. RT-GalaDet achieves precise detection of fish body surface symptoms by introducing an adaptive enhancement mechanism that balances global and local features, effectively strengthening the model’s fine-grained pathological details for subtle symptoms. The proposed framework provides a streamlined solution characterized by low computational complexity and high throughput, demonstrating strong potential for enabling continuous, real-time monitoring of aquaculture environments. When fish with body surface symptoms are detected, the system provides automated alerts, offering preliminary visual cues that can serve as a reference for professional health assessment. This early-warning mechanism provides a critical window for aquaculture personnel to evaluate potential risks, thereby supporting more timely management decisions that may help mitigate the impact of disease-related losses.

The main contributions of this paper are summarized as follows:

(1) We propose RT-GalaDet, an improved real-time model tailored for fish body surface symptom detection. By integrating State Space Modeling and a local enhancement mechanism, the model achieves a more balanced performance between inference speed and detection accuracy compared to existing CNN and YOLO-based architectures in aquaculture scenarios.

(2) We introduce LocalVSS, a lightweight feature enhancement module that leverages the linear complexity of SS2D to capture long-range spatial dependencies while utilizing local convolutions to preserve fine-grained pathological details. This dual-path design enhances the model’s ability to distinguish subtle lesions from complex underwater backgrounds without the high computational overhead of standard self-attention mechanisms.

(3) We demonstrate the efficacy of a triple-optimized architecture consisting of State Space Modeling, Local Enhancement, and Lightweight Neck Compression. The proposed RT-GalaDet maintains high precision while significantly reducing computational overhead and model size compared to the baseline RT-DETR. This structural efficiency minimizes computational resource consumption, establishing a streamlined technical framework that is well-suited for efficient, real-time monitoring applications where low latency is critical.

Related works

RT-DETR

RT-DETR is an efficient detector designed for real-time object detection tasks. Its core innovation lies in the introduction of a hybrid encoder structure and an uncertainty-minimizing query selection strategy, which successfully extends the DETR framework18 to real-time application scenarios. Experiments on the COCO val2017 dataset19 demonstrate that RT-DETR maintains a high inference speed while exhibiting certain advantages in detection accuracy compared to current mainstream real-time detection frameworks. Thanks to its favorable balance of performance and efficiency, RT-DETR has been widely applied across various practical fields, including agriculture20 and industry21.

However, the AIFI (Adaptive Intra-scale Feature Interaction) module in RT-DETR relies on the multi-head self-attention mechanism mechanism for feature modeling, primarily focusing on high-level semantic features. Although this approach effectively captures long-range spatial dependencies, the computational overhead of the self-attention mechanism remains substantial as the number of input channels and the sequence length increase. Specifically, the quadratic complexity \(O(N^2)\) relative to the sequence length N can still pose a significant burden when processing high-resolution inputs. This characteristic can limit the model’s inference efficiency and scalability in resource-constrained environments.

SS2D

In response to the high computational complexity of the self-attention mechanism, State Space Models (SSMs)22 have emerged as an effective alternative. The core advantage of SSMs lies in their ability to capture long-range contextual dependencies with linear complexity relative to the sequence length. However, existing SSMs are primarily designed for one-dimensional sequence modeling. When directly applied to two-dimensional image data, simply flattening the image into a 1D sequence ignores the spatial structural information and may introduce a significant computational burden as the sequence length increases quadratically with image resolution. Moreover, while SSMs excel at sequence modeling, they often lack inherent spatial positional awareness, making it difficult to effectively leverage the two-dimensional relationships in vision applications. To address the mismatch between one-dimensional scanning and two-dimensional data, the 2D-Selective-Scan (SS2D) module23 was proposed. As shown in Fig. 1, SS2D traverses each input patch along four different scanning paths. Each resulting sequence is processed independently through the S6 module24 and the outputs are finally merged.

While SS2D demonstrates strong long-range contextual information modeling capabilities, its application in fish body surface symptom detection presents unique requirements. Pathological manifestations, such as small spots, localized ulcers, or parasitic infections, typically appear as fine-grained anomalies that occupy a very small fraction of the image. In underwater imaging, these subtle features are often obscured by complex environmental factors, including turbidity, suspended particles and non-uniform lighting. While long-range spatial dependencies helps in locating the fish within the frame, the precise classification of health status relies more heavily on capturing local textures, edges and minute color variations. Therefore, to optimize performance for this specific task, it is essential to balance the long-range modeling of SS2D with enhanced local feature extraction, ensuring that the model remains sensitive to localized pathological details amidst potentially noisy underwater backgrounds.

Fig. 1
Fig. 1
Full size image

Illustration of 2D-Selective-Scan (SS2D)23 (CC BY 4.0).

Data acquisition and processing

Data acquisition

The dataset used in this experiment is sourced from the public dataset fish-project Computer Vision Model25. Following the exclusion of redundant categories, a total of 5,601 original images containing 6,049 bounding box annotations were retained. These annotations cover four fish species: Striped Beakfish (Oplegnathus fasciatus), Black Sea Bream (Acanthopagrus schlegelii), Korean Rockfish (Sebastes schlegelii), and Red Sea Bream (Pagrus major). These species are labeled in the source data as Doldom, Gamseongdom, Jopi-bollag, and Chamdom, respectively. The detailed distribution of these 5,601 original images across the 20 distinct species-health categories is summarized in Table 1. Crucially, the source dataset is characterized by image-level class homogeneity, wherein all bounding box annotations within a given image correspond exclusively to a single category. This characteristic is critical for the subsequent valid execution of image-level stratified sampling.

Table 1 Distribution of original images.

Data preprocessing

To address the ambiguity between taxonomic classification and pathological detection, we standardized the label schema prior to training. In the original dataset, labels referring solely to the species name, such as ”Chamdom,” implicitly represented healthy individuals. To eliminate this inconsistency, we explicitly remapped these species-only labels to a ”species-healthy” format, as illustrated by the specific label ”chamdom-healthy.” Consequently, the dataset comprises 20 distinct fine-grained categories structured as the Cartesian product of the four species and five health statuses: Healthy, Hemorrhage, Ulceration, Eye Injury, and Fin Injury. In the dataset annotations, these specific pathological conditions correspond to the label suffixes ”healthy,” ”bleeding,” ”ulcer,” ”eyedefect,” and ”findefect,” respectively. Under this schema, the model is tasked with simultaneous species identification and health status assessment. In terms of annotation protocol, each fish instance is labeled according to its most prominent health manifestation. This ensures clear boundaries for the classification task and allows the model to learn symptom features within the context of specific species textures. Representative images of the four healthy fish species are presented in Figure 2.

Fig. 2
Fig. 2
Full size image

Image dataset examples (a) striped beakfish (b) black sea bream (c) korean rockfish (d) red sea bream.

To address the significant class imbalance inherent in practical aquaculture environments, we adopted a stratified sampling method at the image level to partition the dataset prior to the application of any data augmentation techniques. Crucially, to mitigate the risk of data leakage arising from potentially correlated frames, we performed a de-duplication screening process on the source dataset. Near-identical images were identified and removed to ensure that the Training, Validation, and Test sets represent visually distinct samples. Following this cleaning step, the stratification was performed based on the image-level category labels to ensure a consistent distribution across the Training (4, 475 images), Validation (549 images), and Test (577 images) sets. This image-level split protocol combined with de-duplication precludes the risk of data leakage, thereby ensuring the validity of the model evaluation. Concurrently, to enhance the model’s ability to learn minority class features and reduce background over-reliance, we employed the CopyPaste method26 exclusively on the training set. In this protocol, the CopyPaste operation treated each individual fish and its corresponding bounding box as the fundamental unit for transfer. To ensure experimental reproducibility, a global random seed of 42 was applied to all stochastic augmentation processes. We imposed a spatial constraint where the Intersection over Union (IoU) between the pasted box and any existing objects was kept below 0.2. Furthermore, for each training image, we selected and randomly pasted 1 to 3 additional bounding boxes from the 16 categories exhibiting fish body surface symptoms. A blending transparency of 0.7 was applied to the pasted regions to maintain visual consistency. This targeted augmentation increased the training set size from 4,475 to 4,816 images while resulting in a final total of 6,762 bounding box annotations, as detailed in Table 2. The Validation and Test sets remained unaltered to ensure fair evaluation.

Table 2 Number of training bounding box annotations.

Image enhancement pipeline

To ensure consistent image characteristics and mitigate the discrepancies introduced by data augmentation, we followed a structured preprocessing pipeline. First, to maintain distribution consistency across the entire dataset, we applied CLAHE (Contrast Limited Adaptive Histogram Equalization)27 and image sharpening to the training, validation, and test sets uniformly. Specifically, CLAHE was implemented with a contrast limit of 2 and a grid size of \(8\times 8\) to prevent abnormal shifts in brightness caused by the superposition of backgrounds in augmented images. Subsequently, a sharpening convolution kernel with a center value of 5 and surrounding values of −1 was applied to highlight subtle pathological features, such as eye defects and hemorrhagic spots. This uniform application ensures that the model encounters the same grayscale range and texture characteristics during both training and evaluation, thereby improving the discrimination of local details. The comparison of images before and after enhancement is shown in Figure 3.

Fig. 3
Fig. 3
Full size image

Image comparison before and after enhancement (a) Healthy chamdom (b) Hemorrhagic gamseongdom (c) Image after CopyPaste processing (d) Image after CLAHE and sharpening processing.

Network architecture

The main body of the RT-GalaDet network consists of three parts: the Backbone, the Neck and the Head, with the structure shown in Figure 4. Within the Backbone, we integrated the SPPF (Spatial Pyramid Pooling - Fast) module, which is a hallmark of the YOLO architecture, to efficiently aggregate multi-scale features and expand the effective receptive field. Building upon this YOLO-inspired design, the Backbone further incorporates Stem, LocalVSS, and ChannelMerge modules to extract multi-dimensional features and capture long-range dependencies while preserving focus on local details. The Neck utilizes VoVGSCSP and GSConv modules for multi-scale feature fusion. This structural optimization facilitates the detection of fish body surface symptoms under constrained computational resources. Finally, the Head leverages the RT-DETR Decoder for precise box regression and category classification.

Fig. 4
Fig. 4
Full size image

RT-GalaDet model architecture.

Backbone

Stem

While State Space Models (SSMs) excel at capturing global dependencies, their sequential scanning mechanism can sometimes lead to insufficient differentiation between foreground objects and complex backgrounds in two-dimensional image tasks28. As illustrated in Figure 5, the Stem is designed with a lightweight structure aimed at providing local priors to the model, thereby reducing background interference. The Stem performs two successive downsampling operations, reducing the input spatial dimensions of the image to one-quarter of the original, while mapping the channels to a final output of 128 for easy processing by subsequent modules. To better preserve details and gradients in low-level features, we drew upon the experience of Vision Transformer29 by adopting the GELU activation function. However, to prioritize computational efficiency and reduce the hardware overhead during inference, we implemented the \(\tanh\) approximation of GELU, as defined in Equation (1). Unlike ReLU, which exhibits a hard zero-clipping for all negative inputs, GELU weights the input by its magnitude based on the Gaussian cumulative distribution function. This allows for a non-zero gradient in the negative domain and provides a stochastic regularization effect that helps preserve fine-grained structural information during early feature extraction. To maintain consistency with the mainstream activations used in detection tasks, we adopted SiLU as the activation function after the second downsampling to maintain strong signals in the high-dimensional semantic space.

$$\begin{aligned} GELU(x) \approx 0.5x(1+tanh(\sqrt{\frac{2}{\pi }}(x+0.044715x^3))) \end{aligned}$$
(1)
Fig. 5
Fig. 5
Full size image

Stem module architecture.

LocalVSS

To address the challenge of small target detection for fish surface lesions, the LocalVSS module is proposed to synergize long-range spatial dependencies modeling with enhanced local feature perception. While the SS2D mechanism provides a global receptive field with linear computational complexity \(\mathcal {O}(N)\), it inherently lacks the inductive bias for local spatial correlations. To compensate for this, we designed the LocalVSS module with a structural focus on spatial redundancy reduction and local feature reinforcement, comprising three components: Projection, LCBlock (LiteConvBlock), and LocalConv. Specifically, LCBlock utilizes depthwise separable convolutions to inject local priors without escalating the computational overhead, thereby maintaining the efficiency-performance trade-off. This configuration formally balances the long-range modeling strengths of SSMs with the fine-grained local sensitivity required for subtle lesion detection. The structural details are illustrated in Fig. 6.

Fig. 6
Fig. 6
Full size image

LocalVSS module architecture.

When the feature map is input into the LocalVSS module, it first passes through a conditional Projection layer. This layer judges the number of channels of the feature map; if the number of channels differs from the module’s expected channel count, the feature map is transformed to the expected channel count via a point-wise convolution, BatchNorm2d, and the GELU activation function. Otherwise, this projection layer is skipped and the feature map proceeds directly to the LCBlock.

The LCBlock is a lightweight residual convolution block designed to enhance local spatial priors while maintaining efficiency. The process begins with a Depthwise Convolution (DWConv), which extracts local spatial features with minimal computational cost. This is followed by a Batch Normalization (BN) layer and two successive Point-wise Convolutions (PWConv). The first PWConv performs a linear transformation to manage channel interactions, while the second restores the channel dimensions. A GELU activation is embedded between the two PWConvs to introduce non-linearity. This architectural sequence ensures the decoupled processing of spatial and channel information, significantly reducing parameters compared to standard \(3\times 3\) convolutions. Finally, a residual connection is integrated to stabilize gradient flow. The formal operator-level definition of LCBlock is provided in Equation (2).

$$\begin{aligned} Y = X + (PWConv(GELU(PWConv(BN(DWConv(X)))))) \end{aligned}$$
(2)

After processing by the LCBlock, the feature map undergoes normalization via Norm (Layer Normalization). By rescaling the features to a consistent mean and variance, Norm facilitates smoother gradient propagation and ensures numerical stability during the subsequent scanning process. The normalized feature map then passes through the SS2D module to extract long-range contextual information. Finally, the resulting output is integrated via a residual connection with the feature map from the initial Projection layer, serving as the input for the LocalConv module.

Related research has demonstrated that enhancing local features or incorporating a local enhancement branch can significantly improve model performance on small objects and image fine-grained textures30,31. In this paper, we designed a lightweight local enhancement module called LocalConv. It uses a \(3\times 3\) Depthwise Convolution to extract local spatial features for each channel, followed by BatchNorm and SiLU and then utilizes a learnable parameter \(\alpha\) to control the weight of this branch. \(\alpha\) is initialized to 0, which makes this branch harmless to the model in the early stages of training and acts as a warm-up mechanism. During training, the model can automatically adjust the value of this parameter by gradually learning the local features of the feature map, which is beneficial for maintaining stability and controllability.

At the end of the Backbone, we integrated the SPPF (Spatial Pyramid Pooling - Fast) module32. As an optimized variant of SPP, SPPF achieves effective receptive-field aggregation by cascading pooling operations with a fixed kernel size. This structure facilitates the fusion of multi-scale features while maintaining a lower computational overhead compared to traditional pyramid pooling, making it ideal for real-time detection. By aggregating information across varying spatial scales, SPPF enhances the model’s ability to represent targets of different sizes, thereby improving its robustness against scale variations in fish body symptoms.

ChannelMerge

In RT-GalaDet, ChannelMerge is used to connect the LocalVSS modules. We introduce the space-to-depth method33 for downsampling the feature maps, which reorganizes the four sub-positions of every \(2\times 2\) pixel block into different channels of the same pixel. Unlike traditional pooling operations, the space-to-depth downsampling method does not discard the pixel information of the feature map; it merely rearranges the spatial information into the channel dimension. Subsequent point-wise convolution then fuses this localized information, which helps reduce the model’s parameter count and computational overhead.

Neck

In the original YOLO-style Neck, the C2f module is commonly used to aggregate multi-scale information and its multi-layer residual stacking structure grants it strong learning capabilities. However, because C2f involves multiple \(3\times 3\) convolutions and residual units, the original Neck has a relatively large number of parameters and FLOPs, which can easily become a performance bottleneck in lightweight models. To better accommodate the demand for model lightweighting in the fish body surface symptom detection application scenario, RT-GalaDet introduces a lighter Neck called Slimneck34, which is based on VoVNet35.

The Slimneck is primarily composed of the VoVGSCSP and GSConv modules, whose operational workflows are illustrated in Fig. 7. In the neck architecture, the feature map output from SPPF is upsampled and concatenated with the LocalVSS output from the Backbone. Subsequently, feature fusion is executed by the VoVGSCSP module. Following two layers of VoVGSCSP processing, the feature map passes through a GSConv layer for channel mixing before entering the subsequent VoVGSCSP module. Finally, the output feature maps from the last three VoVGSCSP modules are fed into the RT-DETR Decoder for final reconstruction and decoding.

Fig. 7
Fig. 7
Full size image

Architectural details of VoVGSCSP and GSConv.

GSConv first processes the input feature map using a \(3\times 3\) standard convolution. The resulting intermediate feature map is then transformed via a \(5\times 5\) depthwise convolution. Subsequently, the feature maps from these two branches, the initial \(3\times 3\) output and the \(5\times 5\) depthwise result, are integrated through a concatenation operation. To prevent feature aggregation from being restricted to specific channel partitions and to minimize the computational overhead associated with memory dimension transformations, GSConv performs a channel shuffle operation. This mechanism mixes the information across channels, yielding an output feature map that maintains the same channel dimensions as those prior to the shuffle.

As a lightweight feature fusion block, VoVGSCSP first compresses the input channels and partitions them into two parallel branches, each with a width equal to 0.5 times the output channel dimension. Feature aggregation is performed in one branch through two consecutive GSConv layers, while the other branch serves as a shortcut. Notably, to maintain feature integrity, only the first GSConv layer utilizes the SiLU activation function, whereas the second GSConv layer is implemented without an activation function. The results from both branches are then concatenated and projected back to the target channel count. Consequently, the model’s parameter count and computational complexity are significantly reduced while preserving its overall expressive power.

Loss function

In terms of the loss function, this paper follows the use of RT-DETR’s Classification Loss, Bounding Box L1 Regression Loss and Distribution Focal Loss. The total loss score is obtained by the weighted summation of these three losses, using weights of 0.5, 7.5, and 1.5 respectively36. The calculation method for the Classification Loss is shown in Equation (3). For each prediction i, \(p_i\) denotes the predicted classification score and \(q_i\) represents the target score. Specifically, \(q_i\) is assigned the value of the IoU between the predicted and ground-truth boxes for positive samples, while it is set to zero for negative samples. The hyperparameters \(\alpha\) and \(\gamma\) are typically set to 0.75 and 2.0 respectively, consistent with the standard implementation in RT-DETR. By assigning larger contributions to high-quality positive samples via the IoU-aware target \(q_i\) and suppressing easy negative samples through the focal modulation term \(p_i^\gamma\), this loss formulation facilitates more effective optimization for multi-class symptom detection. Regarding bounding box regression, \(\hat{b}=(\hat{x},\hat{y},\hat{w},\hat{h})\) represents the location of the predicted box and \(b=(x,y,w,h)\) denotes the ground-truth coordinates. The L1 Regression Loss is calculated according to Equation (4). Furthermore, to refine box boundaries, the Distribution Focal Loss (DFL) is utilized as shown in Equation (5). For a continuous ground-truth coordinate \(y \in [0, n-1]\), let \(l = \lfloor y \rfloor\) and \(r = l + 1\) denote the two closest discrete integers. The model outputs a probability distribution vector \(\textbf{p} = [p_0, p_1, \dots , p_{n-1}]^\top\), which is obtained via a Softmax layer. The terms \(p_l\) and \(p_r\) in Equation (5) represent the specific elements within the vector \(\textbf{p}\) at indices l and r, respectively. This ensures that the network shifts the probability mass toward the integers nearest to the true floating-point coordinate, thereby improving localization precision.

$$\begin{aligned} \mathcal {L}_{\text {cls}} = {\left\{ \begin{array}{ll} - q_i \Big [ q_i \log (p_i) + (1-q_i) \log (1-p_i) \Big ], & q_i> 0, \\ - \alpha \, p_i^\gamma \, \log (1-p_i), & q_i = 0. \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{\text {box}} = \left| \hat{x} - x \right| + \left| \hat{y} - y \right| + \left| \hat{w} - w \right| + \left| \hat{h} - h \right| \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{\text {dfl}} = -\left[ (r - y) \log (p_l) + (y - l) \log (p_r) \right] \end{aligned}$$
(5)

Result analysis and model evaluation

This paper utilized an improved YOLO-architecture-based RT-DETR computational model to realize the training, validation and testing of a real-time detection model for fish body surface symptoms. The experimental environment included a computer configured with the Linux operating system, an NVIDIA GeForce RTX 4090D (24GB) graphics card and an AMD EPYC 9754 CPU. The software environment comprised CUDA 12.4, Python 3.9 and PyTorch 2.8.

Evaluation metrics

The model’s overall performance is comprehensively assessed by combining multiple metrics. For evaluating detection accuracy, this paper utilizes Precision, Recall, \(\text {mAP}_{50}\), and \(\text {mAP}_{50-95}\). To assess computational efficiency, we report Parameters and GFLOPs. Crucially, given the multi-class nature of the dataset, Precision and Recall are calculated on a per-category basis rather than as a binary disease classification. For any specific category c, Precision measures the proportion of correct predictions among all detections assigned to that class. Conversely, Recall measures the proportion of actual ground-truth instances of category c that were successfully detected. Let \(TP_c\), \(FP_c\), and \(FN_c\) represent the number of true positives, false positives, and false negatives for category c, respectively. The calculations for Precision and Recall are defined in Equations (6) and (7).

$$\begin{aligned} \text {Precision}_c= & \frac{TP_c}{TP_c + FP_c} \end{aligned}$$
(6)
$$\begin{aligned} \text {Recall}_c= & \frac{TP_c}{TP_c + FN_c} \end{aligned}$$
(7)

The Average Precision (AP) for each category is computed as the area under the Precision–Recall (PR) curve. Following the standard COCO evaluation protocol, an all-points interpolation strategy is adopted to ensure that the interpolated precision is a non-increasing function of recall and to mitigate the effects of local ranking fluctuations. Specifically, for a given category c, the interpolated precision \(P_{\text {interp}}(r)\) at recall level r is defined as the maximum measured precision over all recall levels \(\tilde{r}\) greater than or equal to r. The AP is then obtained by integrating the interpolated precision over the interval [0, 1], as shown in Equation (8), where k indexes the discretized sampling points along the PR curve. To ensure reproducibility, we adopted the standard COCO evaluation protocol provided by pycocotools, where the evaluation considers a maximum of 100 detections per image. AP and mAP were computed based on the full precision–recall curves, without applying any manually specified confidence threshold, following the official COCO evaluation procedure.

$$\begin{aligned} AP = \int _{0}^{1} P_{interp}(r) \, dr \approx \sum _{k=1}^{N} \left[ r(k) - r(k-1) \right] \cdot P_{interp}(r(k)) \end{aligned}$$
(8)

As this study involves 20 distinct categories spanning both healthy individuals and those exhibiting surface abnormalities, the mean Average Precision (mAP) is employed to evaluate performance by averaging the \(AP_c\) values across this complete set of classes at a specific IoU threshold. Specifically, \(\text {mAP}_{50}\) denotes the mean AP at an IoU threshold of 0.5 as formulated in Equation (9), where \(AP_c\) represents the average precision for the c-th category. To provide a more rigorous assessment of localization refinement and multi-class detection, \(\text {mAP}_{50-95}\) is further calculated by averaging the mAP values across ten IoU thresholds, ranging from 0.50 to 0.95 with a step size of 0.05.

$$\begin{aligned} \text {mAP}_{50} = \frac{1}{C} \sum _{c=1}^{C} AP_{c} \Big |_{\text {IoU}=0.5} \end{aligned}$$
(9)

Regarding the model’s computational complexity, we use Parameters to measure the total number of learnable parameters in the model and GFLOPs to measure the number of floating-point operations required for one forward inference pass. These metrics are used to evaluate the model’s spatial footprint and computational efficiency. Since our primary architectural objective is to develop a highly efficient and lightweight detection framework, we aim to reduce the model’s parameter count and floating-point operations while simultaneously improving computational accuracy.

Results analysis

To ensure a rigorous and fair comparison, strict parity controls were enforced across all experimental settings. Specifically, the input sample size for both the proposed RT-GalaDet and all baseline models was standardized to \(640\times 640\) pixels. During the training phase, identical protocols were maintained for all models: a unified random seed of 42 was applied to guarantee reproducibility, the batch size was set to 16, the training epoch count was set to 200, and the AdamW optimizer37 was employed with an initial learning rate of 0.0001. Furthermore, a warmup period of 2,000 steps was implemented, and Automatic Mixed Precision (AMP) was enabled to optimize computational efficiency. Crucially, to isolate architectural advantages from data processing effects, the Image Enhancement Pipeline and the training-time CopyPaste augmentation were applied uniformly to all comparative models.

Table 3 presents a comparative evaluation of the baseline models and the proposed RT-GalaDet under two distinct configurations: Raw Data and the Full Scheme. The ’Raw Data’ status refers to models trained and evaluated on original imagery without the proposed specific interventions. Conversely, the ’Full Scheme’ integrates both the CopyPaste augmentation strategy and the Image Enhancement Pipeline. The results demonstrate that this synergistic approach leads to consistent improvements across the primary detection metrics, particularly in Precision and Recall. Although a marginal decrease is observed in \(\text {mAP}_{50-95}\) for RT-GalaDet, the overall performance gains indicate that the combination of targeted data augmentation and consistent feature enhancement effectively improves the model’s ability to extract discriminative features from aquaculture imagery.

Table 3 Comparison of model performance on raw data versus the full scheme.

To analyze misclassifications, we employed an object detection confusion matrix. Unlike the mAP calculation, the confusion matrix serves to simulate a specific deployment decision boundary; therefore, we applied a fixed confidence threshold of 0.25 and an IoU matching threshold of 0.5. A prediction is considered a True Positive if its IoU with a ground-truth box of the same class exceeds 0.5. The matrix explicitly includes a ”Background” row and column to account for unmatched instances: the ”Background” row represents False Negatives, while the ”Background” column captures False Positives. The confusion matrix on the test set is shown in Figure 8. As observed from the results, specific misclassifications persist, particularly between the Eye Damage and Ulceration categories. This phenomenon likely arises because the localized lesion areas for these symptoms are relatively small, posing a significant challenge for the model’s subtle feature extraction and spatial localization. Furthermore, we attribute certain detection errors to the limited sample size of diseased fish, which hinders the model’s capacity to distinguish fine-grained pathological features from surrounding healthy tissue. Despite these challenges, the model achieved high accuracy across most categories, demonstrating overall excellent performance in classification accuracy.

Fig. 8
Fig. 8
Full size image

Object detection confusion matrix on test set.

To address these issues, future work can supplement the number of diseased fish samples through real-world photography or methods like Generative Adversarial Networks (GANs)38. Alternatively, more advanced data augmentation techniques39 can be employed to expand the minority samples and increase the number of features available for the model to learn. Regarding the model itself, further optimization can be achieved by introducing lightweight local attention mechanisms40 to enhance the model’s ability to distinguish fine-grained lesions.

To comprehensively evaluate the performance of RT-GalaDet, we compare it with several representative real-time object detection models on the augmented fish body surface symptom dataset. The quantitative results are summarized in Table 4. Overall, YOLO-series models exhibit clear advantages in terms of parameter efficiency. However, with the exception of YOLOv12s, whose Precision is marginally higher than that of RT-GalaDet, the remaining YOLO variants demonstrate notably lower Precision and Recall, indicating inferior detection reliability. Using YOLOX as a reference, RT-GalaDet achieves improvements of up to 25.3% in Precision and 6.1% in Recall. Although its Precision is comparable to that of the most accurate model among the compared methods, namely YOLOv12s, RT-GalaDet outperforms YOLOv12s by 9.8%, 4.3%, and 3.6% in Recall, \(\text {mAP}{50}\), and \(\text {mAP}{50-95}\), respectively. In comparison with YOLO-NAS, which attains similar mAP values with lower GFLOPs, RT-GalaDet yields gains of 11.1% in Precision and 9.3% in Recall, while reducing the number of model parameters by 29.6%.These results demonstrate that RT-GalaDet achieves a more favorable balance between lightweight design and high-accuracy detection. Furthermore, although the RTDETR-YOLOv8s baseline benefits from a hybrid encoder–decoder architecture, its mAP performance remains suboptimal for this specialized detection task.

As classic RT-DETR models, RTDETR-Resnet18 and RTDETR-L achieve high Precision, but their high parameter counts and computational complexity make them less capable of meeting real-time detection requirements compared to the proposed model in this paper. Specifically, compared to the large-scale RTDETR-L, RT-GalaDet reduces the parameter count by \(59.21\%\) and GFLOPs by \(76.88\%\), while maintaining superior accuracy and achieving higher Recall. Even compared to the relatively lightweight RTDETR-Resnet18, RT-GalaDet significantly reduces the parameter count and GFLOPs by \(33.3\%\) and \(57.1\%\), respectively. While existing lightweight models like NanoDet-m41 are superior in lightweight characteristics, their detection accuracy fails to meet the requirements for practical application.

The proposed RT-GalaDet achieves a comprehensive performance with a Precision of \(93.3\%\), Recall of \(89.7\%\), \(\text {mAP}_{50}\) of \(89.0\%\), and \(\text {mAP}_{50-95}\) of \(79.0\%\), all while maintaining a low parameter count and computational volume. Compared to RTDETR-YOLOv8s, RT-GalaDet improves Precision, Recall, \(\text {mAP}_{50}\), and \(\text {mAP}_{50-95}\) by \(1.2\%\), \(1.9\%\), \(1.5\%\), and \(1.2\%\), respectively, while simultaneously reducing GFLOPs by \(9.1\%\). This demonstrates that the architectural optimizations effectively suppress computational overhead without compromising feature expression. According to the prevailing consensus in the computer vision community42,43,44,45, a model achieving 30 frames per second (FPS) is generally considered sufficient for real-time video stream processing. In this study, the FPS of RT-GalaDet was evaluated under a standardized inference protocol, with an input resolution of \(640 \times 640\) pixels and a batch size of 1 to emulate real-time deployment conditions. Under these settings, RT-GalaDet achieves an FPS of 51.98, indicating its strong potential for real-time applications.

Table 4 Model comparison results.

Ablation study

To validate the effectiveness of the proposed improvements, this paper designed five sets of ablation experiments. All experiments were conducted on the same computing platform using an identical dataset and hyperparameter configuration. The detection head was kept fixed during both training and evaluation, and the random seed was set to 42, thereby ensuring the determinism and comparability of the experimental process. The experimental groups included the original RTDETR-YOLOv8s model; a model retaining the Neck but replacing the Backbone with SS2D; a model using SS2D and LCBlock; a model using SS2D, LCBlock and LocalConv; and finally, the model including all Backbone improvements while also replacing the Neck.

The ablation results in Table 5 indicate that substituting the RTDETR-YOLOv8s backbone with SS2D effectively reduces GFLOPs, owing to the simplified computational overhead of the SS2D mechanism. Simultaneously, this architectural shift yields measurable improvements in Recall and mAP. The slight decrease in Precision is likely associated with SS2D’s enhanced focus on spatial feature extraction, which broadens the response range for low-level features. While this facilitates the detection of small targets and blurry boundaries, it may also increase sensitivity to ambiguous background regions, potentially leading to more false positives50.The subsequent integration of the LCBlock introduces a more selective local spatial modeling mechanism. This architectural shift prioritizes highly discriminative regions, which is reflected in the increased Precision but a trade-off in Recall and mAP, suggesting a more restrictive feature activation pattern51. This localized selectivity is compensated by the addition of LocalConv, which broadens the receptive field for subtle edges and small targets. By balancing long-range spatial dependencies with local refinement, the model recovers the metrics lost during the LCBlock phase, leading to a significant overall performance gain52. Finally, the introduction of Slimneck optimizes the model’s structural efficiency, achieving a reduction in parameter count and GFLOPs while maintaining stable Precision and \(\text {mAP}_{50-95}\) alongside increased \(\text {mAP}_{50}\). This consistent trend of improvement across multiple evaluation dimensions, validated under a fixed-seed deterministic protocol, confirms the complementary nature of the proposed modules.

Table 5 Module ablation experiment results.

Figure 9 presents the feature activation visualization generated using Grad-CAM53 to interpret the model’s decision-making process across different scenarios. To ensure consistency and reproducibility, our visualization protocol targets the final convolutional layer of the neck, where the integration of local and global features is most prominent. Crucially, to resolve the ambiguity inherent in multi-object detection contexts, we implemented an instance-specific visualization protocol: the gradient calculation is performed individually for each detected object query by back-propagating the score of its highest-confidence class. The raw activation scores are min–max normalized to the [0, 1] range to stabilize the visualization scale.

Fig. 9
Fig. 9
Full size image

Visual interpretation of model predictions using Grad-CAM (a) Original image (b) RTDETR-SS2D (c) RTDETR-SS2D-LCBlock (d) RTDETR-SS2D-LCBlock-LocalConv (e) ours.

The visualization presents a comparative analysis between a single specimen exhibiting body surface abnormalities, displayed in the left column of Figure 9, and multiple healthy specimens, displayed in the right column. As illustrated in Figure 9b, the exclusive utilization of the SS2D module results in diffuse activation patterns distributed across the background in both scenarios. This observation reflects the capacity of the module for long-range contextual modeling yet highlights a deficiency in specific feature localization. The progression from Figure 9c to 9e demonstrates that the sequential integration of local feature enhancement and channel mixing components induces a discernible trend of spatial concentration within the activation maps. In the proposed RT-GalaDet shown in Figure 9e, this refinement culminates in a dual capability. Specifically, for the specimen with body surface abnormalities in the left column, high-intensity activation kernels are precisely centered on the ulcerated lesions. Conversely, for the multi-object healthy scenario in the right column, the heatmaps align strictly with the species-specific body stripes of each individual fish. This explicit spatial alignment confirms that the proposed architectural components effectively refine the spatial selectivity of the model. Consequently, the network prioritizes valid discriminative cues, encompassing both symptomatic lesions and morphological textures, while successfully mitigating reliance on environmental shortcuts.

Discussion

Model performance and advantages

The proposed RT-GalaDet model demonstrates consistent improvements over the RTDETR-YOLOv8s baseline in terms of Precision, Recall, and mAP. These results suggest that the proposed structural enhancements achieve a favorable balance between detection accuracy and computational efficiency. Specifically, the integration of LocalVSS enhances spatial feature extraction by effectively capturing the morphological and textural characteristics of fish lesions, thereby improving the recognition of fine-grained symptoms. In addition, the Slimneck-based feature fusion strategy substantially reduces computational overhead, further improving the model’s suitability for real-time detection scenarios.

Robustness evaluation under simulated environmental conditions

To preliminarily assess the intrinsic resilience of the proposed RT-GalaDet against environmental perturbations, we conducted a controlled simulation experiment on the test set. To ensure the determinism and verifiability of this stress test, all stochastic augmentations were generated with a fixed random seed of 42. The degradation intensity was carefully calibrated to preserve the semantic integrity of the ground-truth annotations, ensuring that image features remained visually distinguishable for valid evaluation.

As illustrated in Figure 10, the simulation of low-illumination conditions involved reducing the image brightness by \(10\%\)\(30\%\) and the contrast by \(10\%\)\(15\%\), as shown in Figure 10b. To simulate turbid water environments, the original image in Figure 10a was first subjected to Gaussian blur with kernel sizes ranging from 2 to 4 pixels to model light scattering effects. Subsequently, a \(40\%\) probability of RGB channel shifts within the range of \(\pm (3, 10)\) was applied, in which the red channel was decreased while the green or blue channels were increased to reproduce the greenish or yellowish color bias commonly observed in turbid water bodies. In addition, Gamma correction was applied with a \(30\%\) probability using values in the range of 0.85 to 1.15, reflecting brightness attenuation variations at different water depths, resulting in Figure 10c. Furthermore, the degradation strategies illustrated in Figure 10b and Figure 10c were superimposed to generate Figure 10d, which represents the combined effects of simulated low illumination and water turbidity. After applying these respective degradations, the model was evaluated directly using the original training hyperparameters without any test-time adaptation. The quantitative results, strictly limited to the scope of these simulated conditions, are reported in Table 6.

Fig. 10
Fig. 10
Full size image

Visualization of synthetic environmental degradations (a) Original image (b) Simulated low-light condition (c) Simulated turbid water condition (d) Combined synthetic degradation.

Table 6 Performance evaluation under simulated environmental conditions.

The evaluation across the synthetically degraded test set indicates that the proposed model maintains operational stability under these specific simulated perturbations. Specifically, the \(\text {mAP}_{50}\) exhibited marginal decreases of \(3.1\%\) and \(3.3\%\) under simulated low-illumination and turbid scenarios, respectively. Even under the combined synthetic degradation scenario, the \(\text {mAP}_{50}\) remained robust with a decrease of only \(3.3\%\). Analysis of the performance shift in this composite simulation reveals a drop in Recall to \(84.6\%\) alongside an increase in Precision to \(93.6\%\). This phenomenon is likely linked to the distribution of detection confidence scores under severe noise. The simulated environmental interference acts as a natural filter that suppresses the confidence scores of both ambiguous targets and potential background false positives. Since only detections exceeding a fixed confidence threshold are retained, the model primarily outputs high-confidence predictions associated with the most salient features. While this filtering effect increases the miss rate, it simultaneously prunes low-confidence false positives, thereby yielding a higher Precision. These results demonstrate that RT-GalaDet retains feature discriminability within the scope of these synthetic visual disturbances.

Model limitations

Due to the inherent limitations of computer vision methods, ordinary underwater image acquisition equipment cannot obtain evidence of internal fish diseases without causing harm to the fish. Based on this, the methodology of this study is restricted to the detection of symptoms on the fish body surface. When a fish is diseased, it is often accompanied by the appearance of external symptoms and behavioral abnormalities; therefore, fish disease diagnosis requires a comprehensive judgment based on multiple symptoms. However, the method in this study is limited to detecting only the body surface symptoms and does not include the detection of abnormal fish behavior. In summary, the method proposed in this study cannot detect pathogens existing inside the fish; thus, it can only serve as a pre-warning method and an auxiliary detection tool and cannot replace existing fish disease diagnosis methods, still requiring professional personnel to confirm the disease after symptom analysis. Additionally, as the current model belongs to the object detection stage, it does not integrate object tracking and re-identification methods, meaning it cannot track the detected fish exhibiting body surface symptoms.

In terms of experimental data, the images used in this study were collected in a controlled laboratory environment, which typically features more uniform backgrounds and lower noise levels compared to open-water aquaculture. While our simulation experiments targeted specific stressors such as low illumination and turbidity, real-world farming environments present a higher complexity of stochastic noise, varying farming densities, and complex light scattering. Furthermore, due to dataset constraints, our evaluation was limited to a specific range of fish species. The model’s generalizability to species of significantly different sizes or to individuals exhibiting compound symptoms, where multiple pathologies overlap on a single fish, remains to be verified. These factors define the current boundary of our robustness analysis; consequently, the reported performance should be interpreted as the model’s resilience within a defined simulated scope rather than a comprehensive validation across all potential real-world aquaculture variables.

Regarding model performance, as the complexity of real aquaculture environments further increases, the model’s generalization ability still needs improvement when faced with features such as low illumination and poor water quality in practical scenarios. Although the model proposed in this paper reduces the computational complexity to some extent, the risk of limiting the sufficiency of feature fusion54 is an accompanying trade-off in lightweight model design.

Future work

Regarding model functionality, adapting the model to different fish species under various aquaculture conditions will be a future research direction. Expanding the dataset itself is the most direct and effective approach. Future work will involve collecting images in real aquaculture scenarios and extending data application to other fish species of different sizes. This will broaden the application scenarios of the model to target a greater variety of fish in real-world settings. Building upon this, the introduction of object tracking55 and re-identification56 methods can enable precise localization and tracking of every detected fish exhibiting body surface symptoms. Furthermore, since the onset of fish diseases is often accompanied by the concurrent appearance of multiple body surface symptoms and behavioral anomalies, future research will also include the simultaneous detection of multiple symptoms occurring on the body surface of the same fish. By incorporating time-series features, the continuous action detection of fish can be achieved under a multi-task model architecture57. Combining multiple symptoms (both external and behavioral) will provide a more accurate basis for the definitive diagnosis of fish diseases.

To enhance model performance in complex scenarios, knowledge distillation58 serves as a promising strategy for compensating for semantic information loss incurred by structural simplification. Specifically, a high-capacity teacher model, such as RT-DETR-L or a ResNet-101-based variant, can be employed to guide the training of the student model. By implementing feature-based distillation at critical bottleneck layers, the student model is incentivized to internalize rich semantic representations. To facilitate effective knowledge transfer, projection layers are utilized to align feature manifolds across different architectures, while a weighted multi-task loss function balances the distillation objective with primary detection accuracy. Such a mechanism mitigates potential information loss without increasing inference latency. Furthermore, bridging the gap between laboratory results and real-world deployment requires advanced anti-interference techniques. Potential directions include dynamic channel response suppression to enhance resilience against sensor noise59, and the integration of frequency-domain information to preserve high-frequency details for small or blurry targets60. Beyond visual enhancements, domain adaptation61,62 can be leveraged to minimize distribution discrepancies between source and target environments. Finally, multimodal learning63 offers a path to incorporate non-visual data, supplementing the model’s learnable features with complementary information from diverse modalities to strengthen overall diagnostic reliability.

Conclusion

This paper proposes a lightweight framework for the real-time detection of fish body surface symptoms in underwater environments, providing an efficient automated solution to facilitate early disease monitoring in aquaculture. We improved the RT-DETR framework through modular modifications to construct RT-GalaDet, an end-to-end model that integrates State Space Modeling and a Local Enhancement mechanism. This approach simultaneously enhances detection performance and reduces the model’s computational overhead. In terms of evaluation metrics, RT-GalaDet achieved a Precision of \(93.3\%\), Recall of \(89.7\%\), \(\text {mAP}_{50}\) of \(89.0\%\), \(\text {mAP}_{50-95}\) of \(79.0\%\), a Parameter count of 13, 399, 974, GFLOPs of 25.0, and an \(\text {FPS}\) of 51.98. The results indicate that RT-GalaDet provides a viable technical baseline for the real-time monitoring of fish body surface symptoms. By enabling automated pre-warning of visible pathological signs, the model serves as an effective decision-support tool in aquaculture, potentially contributing to the mitigation of disease-related mortality and the enhancement of early-stage health management in fishery production. This paper also analyzes the limitations of computer vision methods for fish body surface symptom detection, providing feasible directions for future work.