Introduction

The zebrafish (Danio rerio) holds a prominent position as a model organism in biomedical research1. Its high genomic homology with humans2, coupled with characteristics such as transparent embryos, rapid growth, and strong reproductive capacity, make it an ideal model for genetic studies, developmental biology, neuroscience, and toxicological research3,4,5. As a highly social species, zebrafish frequently form coordinated swimming collectives known as shoaling behavior, which is a complex motion pattern where individuals interact to create globally ordered group movement. This structure serves as a critical paradigm for investigating zebrafish environmental adaptation mechanisms and social interactions6.

Although computer vision techniques have been increasingly applied to shoaling behavior analysis, challenges remain in achieving reliable and interpretable quantification, particularly under complex conditions. The earliest investigation can be traced back to 2002 when Israeli7 first analyzed fish group behavioral responses under hypoxic stress, revealing a downward shift tendency in shoaling centroid within oxygen-depleted tanks. It was not until 2014 that Sadoul8 developed an algorithm based on ImageJ, employing a relatively straightforward threshold segmentation method for fish group extraction. They proposed two quantitative indices, Group Dispersion Index (GDI) and Group Activity Index (GAI), where GDI quantifies dispersion through total perimeter calculation of black pixels, while GAI estimates activity by subtracting pixel counts of fish groups between consecutive frames. These studies pioneered quantitative analysis but relied on conventional foreground extraction algorithms prone to detection omissions, limiting accuracy.

With the development of advanced detection algorithms, Yu9 investigated zebrafish group anomalies using motion feature statistics-based methodology, where optical flow method was employed to derive shoaling velocity and angular parameters, subsequently evaluating anomalous behaviors through two characteristic factors. Inspired by such indirect feature-based approaches, Zhao10 proposed an improved kinetic energy model for shoaling behavior quantification, utilizing Lucas-Kanade optical flow method to determine group velocity and dispersion. However, the inherent light sensitivity of optical flow techniques can compromise activity intensity quantification. In 2020, Han11 developed a convolutional neural network-based model that integrates shoaling spatial distribution images with optical flow energy maps for behavior recognition. While deep learning improved robustness, the generated features were not readily interpretable, limiting application in ethological research and cross-study comparabilty.

In this study, we propose a cascaded detection-tracking framework for zebrafish shoaling behavior quantification, integrating a multi-scale detection model with global attention and an interactive multiple-model Kalman filter, along with a posture-aware appearance feature network. We further design a multidimensional feature set encompassing kinetic and spatial metrices and validate its effectiveness using ethanol exposure experiments. The contributions of this wok include:

  1. (1)

    A multi-scale detection model with extended feature pyramid networks and global attention mechanisms, improving small-object detection while maintaining real-time processing.

  2. (2)

    An Interactive Multiple Model Kalman Filter and retrained posture-aware appearance feature network, improving tracking robustness in dense shoals and reducing identity switching.

  3. (3)

    A multidimensional feature set enabling quantitative analysis of ethanol-induced behavioral modulation, revealing a biphasic response: low-concentration ethanol increased shoal activity, whereas high concentrations reduced movement and dissolved shoal structure.

Method and materials

Ethics statement

All animal experimental procedures were conducted in accordance with ARRIVE Essential 10 guidelines and approved by the Animal Ethics Committee of Tianjin University. According to standard of China (GB/T 27416–2014), when fish exhibit hyperactivity and loss of equilibrium, this is considered a humane endpoint. At this stage, we used 300 mg of benzocaine hydrochloride to minimize animal suffering.

Behavioral monitoring system

The camera was manufactured by Qingdao Vzense Technology Co., Ltd. Videos were recorded at 1920 × 1080 resolution at 30 fps. A monitoring system (Fig. 1) was constructed specifically designed to capture the image sequence of zebrafish. White balance lights were used to enhance image contrast and clarity, it was set at 300 lx, as according to Gerlai12, this level does not induce significant differences in zebrafish shoaling behavior. Additionally, a light-diffusing panel was employed to prevent pixel anomalies. These measures collectively enhanced image quality and significantly facilitated subsequent image processing.

Fig. 1
figure 1

The main components of a behavior monitoring system, data saved as digital image sequences with defined spatial (pixel) and temporal (FPS) resolutions.

The computational environment setup for neural network training is outlined in Table S1, with detailed hyperparameter configurations provided in Table S2.

A zebrafish imaging dataset comprising 9000 frames captured for 5 min, which was systematically partitioned into training, validation, and test subsets. Manual annotation of training samples was performed using LabelImg, with each zebrafish annotated via axis-aligned bounding boxes. The spatial coordinates and taxonomic labels were stored in Pascal VOC-compliant XML metadata files, ensuring compatibility with mainstream detection frameworks.

Test organisms, grouping, experimental procedures

Fifteen wild-type male red zebrafish (3–5 months old, bred in-house in our laboratory) were utilized in the experiment. Prior to testing, the fish were maintained in a 14:10 light–dark cycle within filtered water for two weeks, with daily water renewal, single feeding sessions, continuous oxygenation, and precise temperature control.

Adopting elements of Teles’ experimental design13, the subjects were randomly allocated to three ethanol (produced by Tianjin Hengxing Chemical Reagent Manufacturing Co., Ltd, 99.8% v/v) exposure groups: 0% (CN, n = 5), 0.50% v/v (EM, n = 5, 10 ml), and 1% v/v (EH, n = 5, 20 ml).

Each shoaling unit consisted of 5 individuals, a group size validated to reliably induce shoaling behavior14. As illustrated in Fig. 2, following 1-h acclimation and exposure phases, 15-min behavioral monitoring sessions generated 81,000 raw image frames for analysis.

Fig. 2
figure 2

Ethanol exposure experimental workflow.

Study design

The proposed shoaling behavior quantification framework in this study is illustrated in Fig. 3.

Fig. 3
figure 3

Proposed shoaling behavior quantification framework.

During detection model development phase, a progressive pipeline spanning image preprocessing to detection model optimization was developed. During preprocessing, an improved adaptive histogram equalization (CLAHE) algorithm combined with gamma correction was employed to enhance anatomical features, delivering high-quality inputs for subsequent detection. Building upon the YOLOv8s baseline model, the multi-scale zebrafish detection model (ZebraYOLO) was established. Key enhancements include:

  1. (1)

    Extension of feature pyramid hierarchy to improve detection accuracy and reduce missed detections

  2. (2)

    Integration of a global attention mechanism to refine feature extraction from occluded or blurred targets

  3. (3)

    Optimization of the loss function to reduce computational overhead and accelerate training.

During tracking stage, the traditional multi-object tracking framework was upgraded through three enhancements:

  1. (1)

    Replacement of the default detector with ZebraYOLO, ensuring reliable input for tracking

  2. (2)

    Interactive Multiple Model Kalman Filter (IMM-KF) integrating constant velocity, constant acceleration, and turning motion models to adaptively handle erratic zebrafish movement

  3. (3)

    Retraining the ReID network using our zebrafish dataset to build a posture-aware appearance feature network, enhancing identity discrimination capabilities.

Development of detection network

High-precision zebrafish detection faces a challenge, which is the missed detections and localization drift caused by the fish’s small body size and rapid movement. Therefore, we redesigned the YOLOv8s detection model with task-specific optimizations.

Extension of feature pyramid levels

The bottleneck in object detection primarily stems from the design of its foundational detection layers. YOLOv8 employs a feature pyramid network (FPN) architecture, generating feature maps at three hierarchical levels: P3 (80 \(\times\) 80), P4 (40 \(\times\) 40), and P5 (20 \(\times\) 20). As derived from Eq. (1), each 21 \(\times\) 21-pixel region in the input image corresponds to only one pixel in the P3 feature map. Statistical analysis reveals that zebrafish typically occupy 10 \(\times\) 30 to 10 \(\times\) 80 pixels in raw images, with the smallest individuals mapped to approximately 1.5 \(\times\) 4 pixels on the P3 layer. At this scale, morphological and textural details of fish are prone to information attenuation due to spatial compression effects, causing critical feature responses to fall below detection thresholds.

$$\left\{ \begin{gathered} r_{0} = 1 \hfill \\ r_{1} = k_{1} \hfill \\ r_{n} = r_{n - 1} + k_{n - 1} \prod\nolimits_{i = 1}^{n - 1} {s_{i} } \left( {n \ge 2} \right) \hfill \\ \end{gathered} \right.$$
(1)

where, \({k}_{n}\), \({s}_{n}\), \({r}_{n}\) correspond to the kernel size, stride, and receptive field dimensions, respectively.

Figure 4 illustrates the core architectural components of the YOLOv8s baseline model, to overcome the limitations of shallow feature representation, we augmented the CSPDarknet backbone with an upsampling module. This module upsamples the P3 layer (80 \(\times\) 80) by a factor of 2 to generate a 160 \(\times\) 160 P2 layer. A novel feature fusion pathway from P2 to P3 was integrated into the FPN-PAN structure, followed by the addition of a P2 detection head in the Head section. This design elevates the receptive field to 9 pixels, the improved architectural configuration is depicted in Fig. 5.

Fig. 4
figure 4

Core architectural components of the YOLOv8s baseline.

Fig. 5
figure 5

Schematic diagram of the hierarchically expanded mapping architecture.

Integration of global attention mechanism

The rapid movement of zebrafish often results in motion blur, which reduces feature confidence in contour regions. Attention mechanisms, initially proposed by Bahdanau15, dynamically enhance edge gradient features by enabling neural networks to allocate computational resources adaptively, focusing on critical information while suppressing irrelevant noise. Traditional attention mechanisms rely on core components such as encoder-decoder frameworks, context vectors, and alignment models. While these mechanisms effectively filter important feature channels and highlight local salient regions, they neglect global interactions across channel-spatial domains, leading to diminished target feature responses and loss of contextual information.

To address these limitations, this study embeds a Global Attention Mechanism16 (GAM) before the Detection Head of YOLOv8s (Fig. 4c). GAM enhances the network’s perceptual capability by establishing global dependencies among input features. As illustrated in Fig. 6, GAM consists of two cascaded submodules, Channel Attention Submodule (Fig. 6a), which Introduces 3D feature stacking and a two-layer Multilayer Perceptron (MLP) to amplify global interactions across channels, and Spatial Attention Submodule (Fig. 6b), which Removes pooling operations and employs convolutional layers to strengthen spatial information fusion, mitigating feature degradation.

Fig. 6
figure 6

Diagram of global attention mechanism.

Improvement of loss function

To enhance zebrafish localization accuracy, we introduce the Minimum Point Distance Intersection over Union (MPDIoU) loss function17, an improved bounding box regression loss that optimizes detection precision by holistically integrating multiple geometric parameters. Additionally, to address partial zebrafish occlusion scenarios, this study introduces the Soft-NMS algorithm. By gradually decreasing the confidence scores of overlapping bounding boxes rather than completely suppressing them, this approach effectively preserves partially occluded targets. The progressive score attenuation mechanism mitigates overly aggressive suppression in traditional NMS (Non-Maximum Suppression), thereby enhancing detection performance for dense targets while maintaining the integrity of overlapping biological specimens.

Figure 7 illustrates the computational framework of MPDIoU. During training, this function minimizes the loss value, guiding the predicted bounding box (\({B}_{prd}\)) to converge toward the ground truth bounding box (\({B}_{gt}\)). All factors in existing bounding box regression losses can be determined using the coordinates of four key points. The transformation formulas are as follows:

$$\left| C \right| = \left[ {\max \left( {x_{2}^{gt} ,x_{2}^{prd} } \right) - \min \left( {x_{1}^{gt} ,x_{1}^{prd} } \right)} \right] \times \left[ {\max \left( {y_{2}^{gt} ,y_{2}^{prd} } \right) - \min \left( {y_{1}^{gt} ,y_{1}^{prd} } \right)} \right]$$
(2)
$$x_{c}^{gt} = \frac{{x_{1}^{gt} + x_{2}^{gt} }}{2},y_{c}^{gt} = \frac{{y_{1}^{gt} + y_{2}^{gt} }}{2}$$
(3)
$$y_{c}^{prd} = \frac{{y_{1}^{prd} + y_{2}^{prd} }}{2},x_{c}^{prd} = \frac{{x_{1}^{prd} + x_{2}^{prd} }}{2}$$
(4)
$$\omega_{gt} = x_{2}^{gt} - x_{1}^{gt} ,h_{gt} = y_{2}^{gt} - y_{1}^{gt}$$
(5)
$$\omega_{prd} = x_{2}^{prd} - x_{1}^{prd} ,h_{prd} = y_{2}^{prd} - y_{1}^{prd}$$
(6)

where \(\left|C\right|\) represents the area of the minimum enclosing rectangle covering both the ground truth annotation bounding box and the predicted bounding box. \(({x}_{c}^{gt},{y}_{c}^{gt})\) and \(({x}_{c}^{prd},{y}_{c}^{prd})\) denote the center coordinates of the ground truth annotation bounding box and the predicted bounding box, respectively. \({\omega }_{gt}\) and \({h}_{gt}\) represent the width and height of the ground truth annotation bounding box, while \({\omega }_{prd}\) and \({h}_{prd}\) correspond to the width and height of the predicted bounding box.

Fig. 7
figure 7

Computational framework of MPDIoU.

Incorporating these enhancements, the restructured multi-scale zebrafish detection model architecture, designated as ZebraYOLO, is depicted in Fig. 8.

Fig. 8
figure 8

Architectural framework of zebraYOLO for detection.

Establishment of tracking framework

The stochastic nature of fish shoal movement, marked by occlusion18, nonlinear motion patterns and high inter-individual similarity19, poses significant challenges to conventional Deep Simple Online and Realtime Tracking (DeepSort) algorithms. To address these challenges, we established a novel zebrafish tracking framework, which improved tracking robustness in dense, dynamic shoaling scenarios, enabling precise behavioral quantification.

Fusion of multi-scale detection models

In zebrafish multi-object tracking tasks, the performance of the detection module directly determines the stability and accuracy of tracking outcomes. Integrating ZebraYOLO as the detector for the tracking system is an optimal choice, as it provides high-quality input by delivering precise positional information of zebrafish across frames. Subsequently, the positional data is fed into the tracking system, where a ReID (Re-Identification) network extracts texture features of targets, generating feature vectors for each region to enable similarity measurement. During the cascade matching stage, a hybrid metric combining Mahala Nobis distance and cosine distance is employed to evaluate the association likelihood between detection regions and motion-predicted bounding boxes. Finally, unique identity assignments are made based on association matching results, and the frame-by-frame tracking data is encoded into video outputs.

Interactive hybrid multi-model Kalman filter

Zebrafish often exhibits nonlinear motion characteristics in image sequences. A typical manifestation is the instantaneous transition from constant velocity to acceleration between consecutive frames, which can easily cause state mismatches in traditional tracking algorithms based on constant velocity assumptions. Taking the classical DeepSORT framework as an example, its single constant-velocity Kalman filter modeling struggles to adapt to complex motion patterns, leading to prediction errors. To address this, we designed a multiple-model Kalman filter architecture that integrates constant velocity, constant acceleration, and motion models. This architecture achieves continuous stable tracking of zebrafish during nonlinear motion, with the algorithmic structure illustrated in Fig. 9

Fig. 9
figure 9

Structure of motion prediction model.

Taking the \(j\) th model as an example, its state equation at time k is:

$${{\varvec{X}}}_{j,k}={{\varvec{A}}}_{j}{{\varvec{X}}}_{j,k-1}+{{\varvec{W}}}_{j,k-1}$$
(7)
$${{\varvec{Z}}}_{k}={{\varvec{H}}}_{j}{{\varvec{X}}}_{j,k}+{{\varvec{V}}}_{j,k}$$
(8)

where \({{\varvec{Z}}}_{k}\) is observation vector, \({{\varvec{X}}}_{j,k}\) is state vector, \({{\varvec{H}}}_{j}\) and \({{\varvec{A}}}_{j}\) are the observation matrix and state transition matrix of the model, respectively. \({{\varvec{W}}}_{j,k}\) is the system noise, which is the Gaussian noise of the covariance matrix \({{\varvec{Q}}}_{j}\), \({{\varvec{V}}}_{j,k}\) is the Gaussian noise of the observed covariance matrix \({{\varvec{R}}}_{j}\). According to Eqs. 7 and 8, the states and observations of the CV model, CA model, and CTRV model can be obtained respectively.

Then consider the transition probability matrix between models denoted as \({\varvec{P}}\), where \({P}_{i,j}\) represents the probability of the \(i\) th model transitioning to the \(j\) th model.

$${\varvec{P}}=\left[\begin{array}{ccc}{P}_{11}& \cdots & {P}_{1r}\\ \vdots & \ddots & \vdots \\ {P}_{r1}& \cdots & {P}_{rr}\end{array}\right]$$
(9)
Input

The prior state estimation of model j at time k is defined as \({{\varvec{X}}}_{j,k-1}\), the covariance matrix at time k-1 is \({{\varvec{P}}}_{j,k-1}\), the mixed state is \({{\varvec{X}}}_{0j,k-1}\), and the initial mixed covariance matrix is \({{\varvec{P}}}_{0j,k-1}\), the initial state of \(j\) is:

$${{\varvec{X}}}_{0j,k-1}=\sum_{i=1}^{r}{{\varvec{X}}}_{i,k-1}{\mu }_{ij,k-1}$$
(10)
$${\mu }_{ij,k-1}=\frac{{{\varvec{P}}}_{ij}{\mu }_{i,k-1}}{{C}_{j}}$$
(11)

where \({{\varvec{P}}}_{ij}\) is the transition probability from model \(i\) to \(j\), \({\mu }_{i,k-1}\) and \({C}_{j}\) are the probability of model \(i\) and the prediction normalization constant of model \(j\) at time k-1, respectively. The following equations are defined:

$${C}_{j}=\sum_{i=1}^{r}{{\varvec{P}}}_{ij}{\mu }_{i,k-1}$$
(12)

mixed covariance of the model is

$${{\varvec{P}}}_{0j,k-1}=\sum_{i=1}^{r}{\mu }_{ij,k-1}\left[{{\varvec{P}}}_{i,k-1}+{\varvec{M}}\right]$$
(13)

where,

$${\varvec{M}}=\left[{{\varvec{X}}}_{i,k-1}-{{\varvec{X}}}_{0j,k-1}\right]{\left[{{\varvec{X}}}_{i,k-1}-{{\varvec{X}}}_{0j,k-1}\right]}^{T}$$
(14)
KF filtering

Subsequently, the inputs of model \(j\), namely \({{\varvec{X}}}_{0j,k-1}\), \({{\varvec{P}}}_{0j,k-1}\), and the observed \({{\varvec{Z}}}_{k}\), are filtered to predict the posterior state \({{\varvec{X}}}_{j,k}^{-}\) and its covariance \({{\varvec{P}}}_{j,k}^{-}\):

$${{\varvec{X}}}_{j,k}^{-}={{\varvec{A}}}_{j}{{\varvec{X}}}_{j,k-1}^{-}$$
(15)
$${{\varvec{P}}}_{j,k}^{-}={{\varvec{A}}}_{j}{{\varvec{P}}}_{0j,k-1}{{{\varvec{A}}}_{j}}^{T}+{Q}_{j}$$
(16)

the Kalman gain can be calculated:

$${{\varvec{K}}}_{j,k}=\frac{{{\varvec{P}}}_{j,k}^{-}{{\varvec{H}}}^{T}}{{\varvec{H}}{{\varvec{P}}}_{j,k}^{-}{{\varvec{H}}}^{T}+{\varvec{R}}}$$
(17)

then update the state and covariance matrix:

$${{\varvec{X}}}_{j,k}={{\varvec{X}}}_{j,k}^{-}+{{\varvec{K}}}_{j,k}\left[{{\varvec{Z}}}_{k}-{{\varvec{H}}}_{j}{{\varvec{X}}}_{j,k}^{-}\right]$$
(18)
$${{\varvec{P}}}_{j,k}=\left[{\varvec{I}}-{{\varvec{K}}}_{j,k}{{\varvec{H}}}_{j}\right]{{\varvec{P}}}_{j,k}^{-}$$
(19)
Update

After predicting each model, calculate the likelihood value to evaluate the model’s ability to adapt to the state, namely the credibility of CV, CA, and CTRV. The likelihood function of model \(j\) at time k is:

$${\Lambda }_{j,k}=\frac{1}{\sqrt{2\pi |{{\varvec{S}}}_{j,k}|}}\text{exp}(-\frac{1}{2}{V}_{j,k}^{T}{{\varvec{S}}}_{j,k}^{-1}{V}_{j,k})$$
(20)

where \({V}_{j,k}\) is the measurement error and \({{\varvec{S}}}_{j,k}\) is its corresponding covariance matrix, which will be calculated according to the following equation:

$${V}_{j,k}={{\varvec{Z}}}_{k}-{{\varvec{H}}}_{j}{{\varvec{X}}}_{j,k}$$
(21)
$${{\varvec{S}}}_{j,k}={{\varvec{H}}}_{j}{{\varvec{P}}}_{j,k}{{\varvec{H}}}_{j}^{T}+{{\varvec{R}}}_{j}$$
(22)

Using the likelihood values of each model to achieve model probability updates:

$${\mu }_{j,k}=\frac{1}{C}{\Lambda }_{j,k}{C}_{j}$$
(23)

the normalization constant C is:

$$C=\sum_{j=1}^{r}{\Lambda }_{j,k}{C}_{j}$$
(24)
Output

The filtering and prediction results of each model are weighed and fused to obtain the state and covariance matrix:

$${{\varvec{X}}}_{k}=\sum_{j=1}^{r}{\mu }_{j,k}{{\varvec{X}}}_{j,k}$$
(25)
$${{\varvec{P}}}_{k}=\sum_{j=1}^{r}{\mu }_{j,k}\left[{{\varvec{P}}}_{j,k}+\left({{\varvec{X}}}_{j,k}-{{\varvec{X}}}_{k}\right){\left({{\varvec{X}}}_{j,k}-{{\varvec{X}}}_{k}\right)}^{T}\right]$$
(26)

The target predicted position is represented by the formula \({{\varvec{X}}}_{k}\), where \({{\varvec{P}}}_{k}\) is the input of the posterior interaction filter.

Retraining of appearance feature network

The appearance feature network, based on representation learning, adopts an architecture comprising a preprocessing module with dual 3 × 3 convolutional layers, a max-pooling layer for dimensionality reduction, a feature extraction layer composed of six-stage residual structures, and a normalized fully connected layer. This design effectively captures subtle inter-individual differences, enabling stable identity recognition even for targets within the same category. The residual modules in the feature extraction network, inspired by ResNet, implement two configurations: basic and downsampling types.

As shown in Fig. 10a, the basic residual block employs dual 3 \(\times\) 3 convolutional layers as its main feature transformation path, integrated with batch normalization (BN) layers and ReLU nonlinear activation to output activated values. This symmetrical structure achieves progressive extraction of high-order features while maintaining spatial resolution and channel dimensions of feature maps. Figure 10b illustrates the downsampling residual block, which implements a channel expansion strategy in its first layer. It compresses feature maps using a convolutional kernel with stride 2, while employing 1 \(\times\) 1 convolution for shortcut branch dimension alignment. This configuration doubles output channels while halving spatial resolution, aligning with deep network feature learning principles to enable multi-scale feature extraction for appearance representation.

Fig. 10
figure 10

Residual unit structure in feature extraction network.

Though the Market1501 dataset (originally designed for pedestrian recognition with large rigid targets) was adopted in the feature extraction network, its inherent characteristics significantly differ from the morphological features of small flexible organisms like zebrafish. To address this limitation, we re-trained the feature extraction module specifically for fish tracking scenarios. By capturing fish-specific features through customized parameter optimization, the ReID model was systematically reconstructed for improved biological target adaptation.

Results

Evaluation of zebrafish detection

The evaluation framework employed in this study rigorously incorporates the following core metrics: Precision (Eq. 27), Recall (Eq. 28), and Average Precision (Eq. 29). To address practical deployment considerations, we further integrate Frames Per Second (FPS) as a critical processing throughput indicator.

$${\varvec{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(27)
$${\varvec{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(28)
$$AP@0.5 = \frac{1}{n}\sum\limits_{i = 1}^{n} {P_{i}^{IoU = 0.5} \left( {R_{i}^{IoU = 0.5} } \right)}$$
(29)

where TP denotes accurate detected zebrafish instances, FP represents background regions erroneously classified as zebrafish, and FN indicates undetected zebrafish instances misclassified as background. The Average Precision (AP) metric quantifies the area under the single-class precision-recall curve, capturing the model’s comprehensive detection capability across varying confidence thresholds. Frames Per Second (FPS) is operationally defined as the number of image frames processed per second by the detection system.

To validate the synergistic effects of three architectural improvements, the P2 detection layer, Global Attention Module (GAM), and MPDIoU loss function on zebrafish detection performance, we conduct systematic ablation experiments. Using the YOLOv8s architecture as the baseline, we implement a phased integration where each enhancement module is incrementally incorporated.

Table 1 records the ablation experiment results. After introducing the P2 detection layer, the accuracy improved from 92.1 to 94.8%, which enhanced the detection capability for zebrafish through high-resolution feature enhancement. With the subsequent incorporation of GAM (Global Attention Module), the accuracy further increased by 2.9 percentage points, and the recall rate reached 97.8%, demonstrating its effectiveness in reinforcing key features of fish. Although the computational complexity increased, resulting in an approximately 12% decrease in FPS (frames per second), the system still maintained real-time performance compared to the image acquisition speed (30 FPS). The comparative detection performance of zebrafish before and after improvements is illustrated on Fig. 11, the low confidence object detection is suppressed, resulting in missed detections (Fig. 11a), and Fig. 11b shows the reconstructed model can effectively improve feature extraction and reduce response threshold to achieve detection.

Table 1 Result of ablation study.
Fig. 11
figure 11

(a) Baseline model zebrafish detection performance. (b) Improved model for zebrafish detection performance.

To validate the comprehensive performance of the proposed detection model ZebraYOLO in zebrafish detection tasks, this section conducts a comparative analysis with various mainstream object detection algorithms that have also been applied in fish detection. The selected models include single-stage detectors (SSD, RT-DETR, YOLOv5s, and CenterNet) and a two-stage model (Faster R-CNN). Figure 12 demonstrates the precision-recall (P-R) curves of different models, while a comprehensive evaluation is performed across three dimensions: detection accuracy, inference speed, and computational efficiency (Table 2).

Fig. 12
figure 12

Precision-recall curve benchmarking against established fish detection methods.

Table 2 Performance comparison results of representative object detection algorithms.

As a classic two-stage detector, Faster R-CNN relies on the Region Proposal Network (RPN) to generate candidate boxes, resulting in 70.5 GFLOPs computational complexity and limited real-time performance. SSD detects objects through multi-scale feature maps, yet its compromised accuracy in small object detection stems from insufficient resolution in shallow feature layers. CenterNet achieves detection by predicting object centroids and dimensions, eliminating anchor box design with lower computational costs, although its centroid localization may exhibit significant errors under motion blur scenarios. RT-DETR, an improved variant of Deformable DETR, balances accuracy and speed, yet its precision remains insufficient to support reliable behavioral conclusions compared with our proposed model. Despite a slight decrease in inference speed, ZebraYOLO achieves breakthrough detection accuracy in zebrafish monitoring tasks while still maintaining practical real-time performance at 28.9 FPS.

Evaluation of zebrafish tracking

To comprehensively and objectively evaluate the performance of multi-target tracking algorithms, the system evaluation method proposed by Barreiros et al20. is chosen, including the Correct Track Rate (CTR) and Correct Identify Rate (CIR). In addition, referring to the Miss Rate (MR) proposed by Bai et al.21, the FPS index is introduced to measure real-time tracking performance, which is the number of frames processed by the tracking algorithm per second. The above indicators are defined as follows:

$$CTR=\frac{\sum number\; of\; frames\; where\; each\; fish\; is\; correctly\; tracked}{number\; of\; fish\; individuals\; \times total\; tracked\; frames}$$
(30)
$$CIR=\frac{\sum correct\; identifications\; post-overlap}{total\; number\; of\; identified\; fish}$$
(31)
$$MR=\frac{\sum number\; of\; frames\; where\; tracking\; is\; lost\; per\; fish}{number\; of\; fish\; individuals \times total\; tracked\; frames}$$
(32)

To validate the performance of the proposed DBT framework in addressing zebrafish nonlinear motion and identity (ID) switching, experiments were conducted using three technical approaches: YOLOv8s \(+\) original deepsort, ZebraYOLO \(+\) original deepsort, ZebraYOLO \(+\) improved deepsort.

Table 3 shows the introduction of the proposed detection model ZebraYOLO incurs a 6-frame processing speed penalty, but it improves both the correct identification rate and tracking accuracy by 1–2 percentage points, because it mitigates trajectory fragmentation during tracking matching, demonstrating that high-performance detection models positively enhance backend tracking performance within the DBT framework. Further optimization by improving DeepSort with ZebraYOLO achieves the correct tracking rate: 98.22%, and the miss rate reduced from 12.63 to 2.16%. This performance leap stems from the hybrid model architecture and zebrafish re-identification training, which effectively reduce trajectory prediction errors and ID switches during occlusions. While the enhanced framework increases computational overhead, lowering the FPS to 17.9, this trade-off remains acceptable in precision-critical laboratory environments.

Table 3 Comparison results of framework performance.

As shown in the upper section of Fig. 13, when occlusion occurs between Fish #2 and Fish #3 in the previous frame while Fish #3 suddenly changes direction, the original DBT framework fails to respond effectively, resulting in ID switches between the two fish. This limitation stems from two main factors: Firstly, the original DBT framework solely relies on the constant velocity motion assumption, which proves inadequate for predicting turning maneuvers. Secondly, the original ReID network demonstrates insufficient capability in feature modeling for individual fish, particularly lacking fine-grained discrimination between visually similar specimens.

Fig. 13
figure 13

Performance comparison of the DBT framework in handling fish occlusion and orientation variations before and after architectural enhancements.

Subsequent experiments using the improved DBT framework (with randomly assigned initial IDs) demonstrate that the occluded fish pair now corresponds to Fish #3 (pink bounding box) and Fish #4 (blue bounding box). As illustrated in the lower section of Fig. 13, through the implementation of hybrid multi-model prediction and retraining of the ReID feature extraction network, the enhanced system successfully tracks this challenging scenario. The CTRV model accurately predicts the trajectory of the turning Fish #4, while the robust tracking performance also benefits from zebrafish-specific ReID retraining—the system maintains correct identification between visually similar Fish #3 and Fish #4 without ID confusion.

As shown in Table 4, the algorithmic framework specifically designed for zebrafish behavioral characteristics in this study achieves higher CIR and a lower MR compared to the general-purpose object tracking algorithms FairMot22 and ByteTrack. These generic algorithms exhibit significant limitations for this task, as they are primarily designed for pedestrian tracking and are poorly adapted to the non-rigid forms and non-linear motion patterns characteristic of zebrafish. Although FairMot achieved the fastest processing speed at 31.6 FPS, its correct identification rate was only 38.16%. ByteTrack improved the identification rate by associating low-confidence detection boxes, but its miss rate remained high at 79.87%. Both idTracker and idTracker.ai23 were developed specifically for animal tracking tasks. In our experiments, these two techniques achieved better results than the pedestrian tracking models, reaching a tracking accuracy of 82.09%. The updated version, idTracker.ai, improves upon the original idTracker24 by incorporating deep learning. It utilizes a two-stage cascaded CNN architecture: the first stage handles recognition, and the second stage assigns IDs. This enhancement boosted the correct identification rate by 5 percentage points compared to the original idTracker. However, its reliance on a global optimization strategy result in high computational complexity, limiting its speed to 15.8 FPS. It also struggles with partial occlusions, leading to a miss rate of 13.55%. The performance of the end-to-end fish tracking network CMFTNet25 in this task indicates that its long-term tracking capability still requires improvement.

Table 4 Performance comparison of representative tracking algorithms.

Discussion

In this study, we developed a cascaded detection-tracking framework that improved tracking continuity and reduced identity switching, even in dense shoals exhibiting occlusion and nonlinear motion. Using this framework, we extracted a multidimensional set of shoaling behavior features to quantify group-level responses to ethanol exposure. These features included kinetic parameters (Shoal Migration Rate, SMR; Shoal Rotation Rate, SRR) and spatial distribution metrics (Shoal Dispersion, SD; Nearest Neighbor Distance, NND; and Inter-Individual Distance, IID). Definitions and biological significance for each parameter are provided in Table 5.

Table 5 Extracted feature set of shoaling behavior.

Experimental procedures for ethanol exposure are described in Sect. “Test organisms, grouping, experimental procedures”. Image processing produced 27,000 data points per experimental group, each containing five shoaling behavior parameters (SMR, SRR, SD, NND, and IID). To reduce noise while preserving temporal patterns, a 45-frame (1.5 s) averaging window was applied, yielding 600 averaged data points per group. Inter-group comparisons were conducted using Mann–Whitney U tests. Given the discrete nature of the data, Daubechies wavelet26 (N = 4) analysis was applied to smooth behavioural features. The statistical results are summarized in Table 6 and Fig. 14.

Table 6 Statistical results of shoaling behavior parameters.
Fig. 14
figure 14

Comparison results between groups of characteristic parameters. ((a) Group comparison of SMR (cm/s), EL vs. CN, P = 0.0071; EH vs. CN, P = 0.0056; (b) Group comparison of SRR, EL vs. CN, P = 0.0036; (c) Group comparison of SD(%), EL vs. CN, P = 0.0020; EH vs. CN, P = 0.0025; (d) Group comparison of NND(cm); (e) Group comparison of IID (cm), EL vs. CN, P = 0.035; EH vs. EL, P = 0.026; EH vs. CN, P = 0.0086).

Previous studies have shown that ethanol exposure alters zebrafish shoaling behavior27,28, with ethanol reaching the brain within 20–40 min post-ingestion29 and producing measurable behavioural effects. Consistent with these findings, our results indicate that ethanol exposure increases both shoal activity and disorganization. In particular, the ethanol-exposed group exhibited ~ 20% higher shoal migration rates than controls (Fig. 14a). Similar to previous findings showing that 0.5% ethanol reduces anxiety and induces hyperactivity30, our results align with reports that sub-1% alcohol exposure increases locomotor activity in adult zebrafish31,32.

Low-dose ethanol also increased fish rotation rates (Fig. 14b), suggesting higher trajectory tortuosity, potentially due to balance impairment under mild aesthesia. Conversely, high-dose ethanol reduced shoal responsiveness, comparable to the activity changes observed in dimethyl sulfoxide (DMSO)-exposed zebrafish33. These results support the biphasic nature of alcohol’s effects, where low-to-moderate doses act as stimulants and higher doses produce sedative effects, a pattern observed across species including rodents, non-human34, and humans35.

Ethanol exposure had no effect on the inter-individual nearest neighbor distance of shoal (Fig. 14d), but we observed a dose-dependent effect on shoal dispersion (Fig. 14c and e), where higher ethanol concentrations reduced structural cohesion and disrupted collective organization. Unlike some stimulants that increase social interaction, ethanol exposure did not reduce shoal area; instead, higher ethanol concentrations (0.25–1%) consistently produced social behavior effects36. These findings are consistent with Lin et al.37, who reported dose-dependent increases in shoal area in ethanol-treated adult zebrafish. The neural mechanisms behind ethanol-induced behavioral changes remain unclear and are likely complex, involving multiple synaptic plasticity processes and diverse molecular targets38. Sedative effects of ethanol have been linked to modulation of the GABAergic system39, whereas stimulatory effects involve glutamatergic (including NMSA receptors), dopaminergic38, hypothalamic–pituitary–adrenal (HPA), vasopressin, and opioid pathways. Additionally, ethanol-induced shoal dissolution may be associated with astrocyte phenotype alterations40.

Despite the promising results of this study, several limitations must be acknowledged. First, all experiments were conducted under controlled laboratory conditions using a single strain of adult zebrafish. This design minimized biological and environmental variability, allowing us to focus on validating the technical framework; however, it limits the generalizability of our behavioral findings across other strains, developmental stages, or ecological contexts. Second, the relatively small sample size (N = 5 per group) was chosen for technical feasibility rather than hypothesis-driven inference and may limit statistical power. Third, while our model achieved high detection and tracking accuracy within the tested setting, its performance in more complex environments (e.g., variable lighting, occlusion in dense shoals, or field applications) remains to be validated. Furthermore, the temporal averaging method used for behavioral feature extraction may obscure transient but biologically meaningful short-term dynamics. Lastly, although we introduced a multivariate analysis (XGBoost + SHAP) to explore feature importance, further biological interpretation—particularly linking behavior to neurophysiological mechanisms—requires expanded datasets and cross-validation across laboratories. These limitations highlight key directions for future research and optimization.

While this study advances zebrafish shoaling quantification technology, the current model’s computational size poses deployment challenges. Future work should explore knowledge distillation and neural architecture search to reduce model complexity while preserving performance. In addition, developing a real-time monitoring and toxicity grading system for water quality based on this framework has promising applications5,41.