Introduction

Depressive disorders include a large range of conditions having in common the presence of specific symptoms, such as sadness, emptiness, irritable mood, and loss of pleasure or interest in activities, accompanied by somatic and cognitive changes that significantly affect the individual’s capacity to function1. This group of disorders represents a common and global medical problem: it has been estimated that 3.8% of the world population experience depression (approximately 280 million people), including 5% of adults, and it has been projected that this disease will rank first by 20302,3. Due to its incomplete responsiveness to pharmacological treatments and the large number of side effects of existing treatments4,5,6, screening of novel antidepressants is still an important practice in current research. A classical preclinical model for screening antidepressant drugs or evaluating depressive-like behaviours in rodents is the Forced Swim Test (FST), originally developed by Porsolt in the 1970s to assess the anti-depressive properties of drugs7. The importance of this test is evident from the growing number of scientific publications utilizing it. For instance, a PubMed search using "Forced Swim Test" as a keyword yielded more than 8600 hits, with over 5500 just in the last 10 years (see Fig. 1).

Fig. 1
figure 1

Historical count of the 'forced swim test’ tag in PubMed. From 2000 to 2024, a PubMed search retrieved more than 8000 scientific articles using 'forced swim test’ as a keyword. In particular, 5500 articles have been published in the past 10 years alone.

The test is based on the observation that rodents immersed in a cylinder filled with water, after initial intense escape-directed behaviour (i.e., swimming and climbing), stop struggling and show passive immobile behaviour (floating with only movements necessary to keep the nose above the water surface), when they learn that escape is impossible. Immobile behaviour is considered a passive coping strategy and is believed to reflect learned helplessness (i.e., behavioural despair) revealing depressive-like behaviour in rodents8.

Over the years it has been amply demonstrated that the FST is sensitive to all major classes of antidepressant drugs which consistently reduce the amount of immobility time in the test by increasing active escape behaviour9,10,11,12. A modified FST originally introduced by Detke and co-workers in 1995 can even distinguish between serotonergic and noradrenergic antidepressants through the distinction of active behaviours into swimming (horizontal movements across the water surface) and climbing (vertical movements against walls)10,13,14,15. Lately, the application of this test has expanded to include the evaluation of depressive-like states16,17,18,19,20 as well as coping strategies7,21,22 in animal models of psychiatric disorders and stress, becoming one of the most widely used behavioural tests worldwide22,23. Besides having good predictive validity, the FST is straightforward to conduct and requires minimal specialized equipment.

Nevertheless, there are still some major hurdles associated with it.

In a large majority of the studies, animals’ behaviours have been evaluated by hand by a trained annotator, and less frequently automated systems based on video analysis have been employed. Both methods have relevant but different limitations. The manual approach is prone to human subjectivity, making it difficult not only to replicate findings across laboratories but also to maintain standardized scoring methods over time in the same laboratory. Moreover, manual scoring requires a large amount of training time, it is very time-consuming, and even scoring by the same researcher shows changes over time due to the experience gained during work or other potential bias factors24,25,26. On the other hand, the currently available automatic systems are costly and do not allow adequate detection of the specific kind (i.e., swim vs climbing) of behaviour performed by the rodent. Indeed, these methods extrapolate the degree of mobility of the animal by evaluating the frame-to-frame variations and/or the position and velocity of the centroid of the animal or other key points (such as nose and tail)27,28,29,30,31,32,33.

Considering this general framework, there is a clear need to develop alternative methods to overcome the limitations outlined above and to characterize rodent behaviours more effectively34. Having this purpose, here, we propose a new approach based on machine learning (ML) implemented via deep neural networks.

To decrease the human bias given by the selection of some specific extracted features, we work with the information given by the frame, taking into account the dynamical evolution encoded in temporal sequences of frames. Indeed, rodents’ behaviours feature complex sequences of dynamic actions, and therefore the information provided by a single frame is not sufficient to appreciate the spatiotemporal nature of the behaviour and to classify it properly. Hence, in our approach, small portions of the video are preprocessed into 3D-tensors35. The ML architecture chosen to process such kind of data is a 3D residual convolutional neural network (3D RCNN). Indeed, these architectures have already proven successful in the literature on human action recognition36,37. For the training and the later testing of the ML model, datasets of video-recorded FST were collected and, when necessary, manually scored in our laboratory.

Once the ML algorithm was trained and validated, we tested it by comparing its outcome with that produced by the human manual scoring on rats under the influence of the selective serotonin reuptake inhibitor fluoxetine and tricyclic antidepressant desipramine. The first is known to predominantly increase swimming while the latter evokes a marked enhancement in climbing behaviour14,15,38,39. Once a good performance of the ML algorithm was demonstrated, a second FST experiment with different antidepressants was carried out to confirm the ability of our ML algorithm to properly identify the antidepressant based on the behavioural repertoire evoked and described in the literature14,40,41,42. Results demonstrated that our ML algorithm successfully discriminates among the three behaviours, proving to be a standardized, unbiased, and objective method for behavioural analysis of the FST in rats. If appropriately extended and trained, this model could be applied to analyze various behavioural tests, including but not limited to the open field, fear conditioning, and drug-induced withdrawal symptoms.

Results

To develop the ML model, we used an in-lab-made training dataset consisting of 78 video-recorded forced swimming tests. To mitigate the disadvantages associated with running the study in different environmental settings, here we have provided geometrical details on how to perform the experiment (Table 1).

Table 1 Principal parameters used for the experiments.

Those videos were divided into 10,062 sub-videos of 3 s and mapped into a standardized 3D tensor through a phyton script written using OpenCV module43. For each sub-video, the main behaviour was identified by two experienced researchers who labeled the behaviours as immobility, swimming, or climbing. An example of the 3D tensors for each class of behaviour is shown in Fig. 2.

Fig. 2
figure 2

Example of 15 frames extracted from the 3D-tensor, for each behaviour. Each 3D-tensor contains 75 frames. For the present representation, one frame was shown every five frames.

Then, this training dataset was split into 90% train and 10% validation sets, for the training of the ML model.

The ML model trained is a 3D RCNN permitting the extraction of complex spatio-temporal features useful for the classification task. The model was coded using the Keras library44 with tensorflow backend45. The architecture is sketched in Fig. 3.

Fig. 3
figure 3

The model architecture used for behaviour classification. The model is composed of a cascade of 3D convolutions organized in residual blocks to extract the spatio-temporal features of the data. Then, after an average global pooling, a series of dense layers was used to learn non-linear combination of the extracted features. Finally, a three-node softmax layer is used to produce a probability distribution of the possible behaviours label. The behaviour corresponding to the highest probability is the output of the model. We set the kernel size of the 3D convolutions to (3,3,3), the kernel size of the 3D average pooling to (2,2,2), and the dropout value to 0.1. The variable n in the convolutional neural layer is the number of its output channel.

The model was trained by using Adam optimizer46 on 120 epochs, and we used early stopping on validation accuracy as regularization (accuracy of 88.89% on validation set, Fig. 4A).

Fig. 4
figure 4

Performance of the ML scoring compared with manual scoring. The confusion matrix on the validation set (A). The confusion matrix on the sum of the drug experiment videos (B). As an example, a video has been selected to show the concordance between the two scoring methods via the confusion matrix (C) and via the time plot in which any mismatch between the two methods is also shown (red bars). Each 5-min video is split into a sequence of 100 3-s tensors, used as inputs for the ML algorithm (D).

Comparison of the ML algorithm with the human manual scoring on the effects of fluoxetine (FLX) and desipramine (DMI) in the FST

The ML labeling and the manual scoring are in good agreement, as noticeable from the confusion matrix of the complete dataset in Fig. 4B, and the 86 ± 4% accuracy measured on the videos (mean ± standard deviation). The Cohen’s Kappa coefficient measured is 0.7809, showing a substantial agreement between the two scoring methods47. Figure 4C,D show an example of a video labeled by the annotator and by the ML: the confusion matrix of the single video and the time-plot of the two behaviour recognitions (by the annotator and the ML) with their differences (red bars), respectively.

There are strong correlations between the two scoring methods (Fig. 5) as confirmed by the following analyses. For the total times classified in each class, as measured by the Pearson’s correlation, the following scores are obtained: immobility: \(r=0.918\); swimming: \(r=0.858\); climbing: \(r=0.955\) (Fig. 5). ANOVA analyses are run on each behavioural data scored using either method in order to compare them and verify whether the ML algorithm could recognize a statistically significant difference in response to antidepressant as human manual scoring would do.

Fig. 5
figure 5

Comparison between the behaviour times of behaviour measured with the two scoring systems. Scatter plots of the percentage time performing the three behaviours predicted by the trained model show the positive correlation between manual and ML labeling systems. As measured by a Pearson’s correlation, the following scores are obtained: immobility \(r=0.918\); swimming \(r=0.858\); climbing \(r=0.955\).

ANOVA demonstrated no statistically significant difference between the manual and the ML scoring, whereas a statistically significant effect of the drugs in determining the specific type of behaviour expressed by the animals was detected. The time spent immobile significantly decreases in rats treated with DMI and those treated with FLX compared to the control group (Fig. 6A). The analysis also shows that the animals treated with FLX exhibit higher time performing swimming compared to the vehicle-treated animals (Fig. 6B). Whereas the animals treated with DMI show enhanced time performing climbing behaviour compared to the control group (Fig. 6C). Both scoring methods measured these statistically significant differences.

Fig. 6
figure 6

Measurement of the effect of the drugs on FST behaviour measured with ML algorithm and manual scoring. Comparison between the manual and the ML algorithm scoring following drug treatment. When the ML algorithm was compared with manual scoring, no statistically significant differences were detected between the two methods for any behaviour assessed in the FST (Immobility: F(1,42) = 0.31, p = 0.579; Swimming: F(1,42) = 1.5, p = 0.227; Climbing: F(1,42) = 1.64, p = 0.208). Consistent with well-established literature on the effects of FLX (n = 8) and DMI (n = 8) in the FST, both methods successfully identified: (A) a decrease in immobility time (F(2,42) = 19.42, p < 0.0001) in rats treated with antidepressants compared to the vehicle group (Manual: vehicle vs. DMI, p < 0.0001; vehicle vs. FLX, p = 0.0003. ML: vehicle vs. DMI, p = 0.007; vehicle vs. FLX, p = 0.003) (B) the SSRI FLX preferentially increased swimming behaviour (F(2,42) = 13.34, p < 0.0001. Manual: vehicle vs. FLX, p = 0.002. ML: vehicle vs. FLX, p = 0.018); and (C) the TCA DMI preferentially increased climbing behaviour (F(2,42) = 13.02, p < 0.0001. Manual: vehicle vs. DMI, p = 0.0003. ML: vehicle vs. DMI, p = 0.008). Two-way ANOVA followed by Dunnett’s multiple-comparisons test was used. Data are presented as mean ± s.e.m. * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001 vs. vehicle treatment.

ML algorithm evaluation of the effects of amitriptyline (AMI), paroxetine (PRX), and venlafaxine (VLX) in the FST

Based on previous results we predicted that our ML is able to discriminate between the various classes of antidepressants. To confirm this assumption, we used our ML algorithm to analyze the behavioural responses to three well-known antidepressants, amitriptyline, paroxetine, and venlafaxine, characterized by distinct mechanisms of action.

The analysis reveals a decrease in the immobility time in the animals treated with AMI and VLX compared to the vehicle group (Fig. 7A). Rats that received PRX and VLX significantly increase their time spent performing swimming (Fig. 7B), while animals treated with AMI significantly perform more climbing compared to the vehicle group (Fig. 7C).

Fig. 7
figure 7

Measurement of the effect of the drugs on FST behaviour measured with ML algorithm. The ML algorithm accurately categorizes the effect of different classes of antidepressants in the FST. According to the literature, our ML model assigned the TCA AMI (n = 5) high levels of climbing, whereas, for the SSRI PRX (n = 5), the model detected increased swimming. In the case of VLX (n = 5), which, compared to TCAs and SSRIs, has a more balanced noradrenergic/serotonergic profile, our model indicated a corresponding balance between swimming/climbing performance. (A) Immobility (F(3,16) = 4.48, p = 0.018. vehicle vs. AMI, p = 0.014; vehicle vs. PRX, p = 0.656; vehicle vs. VLX, p = 0.044); (B) Swimming (F(3,16) = 5.93, p = 0.006. vehicle vs. AMI, p = 0.97; vehicle vs. PRX, p = 0.027; vehicle vs. VLX, p = 0.011); (C) Climbing (F(3,16) = 10.55, p = 0.0005. vehicle vs. AMI, p = 0.023; vehicle vs. PRX, p = 0.075; vehicle vs. VLX, p = 0.541). One-way ANOVA followed by Dunnett’s multiple-comparisons test was used. Data are presented as mean \(\pm\) s.e.m. * p < 0.005 vs. the vehicle treatment.

These results are in line with data from the FST literature demonstrating that PRX and VLX preferentially increase the swimming time14,42, while AMI increases the climbing time in the rat40,41.

Discussion

The FST is a well-validated animal test fundamental to understanding the underlying pathophysiology of depressive disorders, assessing stress coping strategies, and evaluating the efficacy of existing and novel antidepressant drugs. However, despite the large use of this test in laboratory animals, a bias-free, accurate, reproducible, and less labor-intensive method to score the FST is still lacking.

In the present paper, we describe the development of an algorithm based on a 3D RCNN ML model, that automatically recognizes immobility, swimming, and climbing, the three main types of behaviour in the FST. Specifically, using a FST video dataset recorded in our laboratory, we first trained the ML algorithm to recognize these behaviours. Afterward, we validated it by applying it to the analysis of two additional datasets from experiments with antidepressants. The first drug experiment was labeled both by hand and by using our ML algorithm, while the second experiment was analyzed only by the ML algorithm.

The results demonstrate that once trained the model can correctly categorize and quantify the three behaviours in high accordance with the manual scoring of the two trained researchers. Moreover, data obtained following drug treatments demonstrated that the model is able to discriminate between different classes of antidepressants. In fact, in accordance with the literature the model assigned high levels of climbing to desipramine and amitriptyline, two predominantly noradrenergic tricyclic antidepressants (TCAs). Whereas when fluoxetine and paroxetine, two selective serotonin reuptake inhibitors (SSRIs), were tested the model detected higher levels of swimming than climbing10,13,14,15. Venlafaxine, compared to TCAs and SSRIs, has a more balanced noradrenergic/serotonergic profile. Accordingly, our model assigned to this drug a more balanced swimming/climbing performance. Finally, for all the antidepressants tested our ML model identified a significant increase in mobility, which is also consistent with the results of manual scoring reported in the literature10,13,14,15.

Compared to human manual scoring and the current automatic systems, our 3D RCNN model presents several advantages. For instance, we estimate that the time necessary to label a 5-min video by hand is at least 30 min, while our ML approach, considering also the whole preprocessing, takes around 1 min leading to significant time saving and reduced effort for the experimenter. Our ML algorithm is naturally blind to treatment conditions which eliminates the researcher’s confirmation bias. Multiple factors can influence human performance. For instance, with training over time, performance can progressively improve during the scoring phase. However, it can also worsen due to tiredness and distraction when scoring multiple videos consecutively. An ML-based scoring system is immune to these biases, ensuring consistent and standardized analysis.

Another source of bias is the interpretation of animal behaviours, which can vary among researchers leading to differences in classification and quantification. Our ML method provides an unambiguous and standardized tool for the identification of behaviours in the FST.

Compared to the other automatic systems currently available, which require feature extraction, the present 3D RCNN algorithm works directly with video pixels representing a significant advantage. Selecting specific features (such as the degree of mobility, or position coordinates of animal parts) is an arbitrary choice that may cause an oversimplified description of the rodent’s behaviour potentially leading to a loss of valuable information for its identification and labeling.

On the other hand, a potential disadvantage of our model is that it may not perform well in significantly different experimental environments, such as varying light conditions, camera position, or differences in the animal’s size relative to the cylinder. To mitigate these concerns here, we provide a detailed description of the environmental and technical parameters adopted to train and validate our ML model. In addition, it should also be considered that these limitations can be easily overcome by adopting a transfer-learning strategy and retraining the model on new datasets. Indeed, such approaches have proven successful in many computer vision applications (see, e.g.,48), enabling researchers to address data sparsity for specific tasks by retraining neural networks that were previously trained on more general tasks for which abundant datasets were available.

Another possible strategy is to quickly build a new dataset by manually adjusting the labels predicted by the ML algorithm, following a human-in-the-middle paradigm. Retraining in this way can help refine the algorithm to better suit the specific needs of different laboratories.

In conclusion, we demonstrate that the proposed machine learning model coupled with the specific elaboration of video recordings can effectively discriminate the behavioural effects of different antidepressant drugs. This approach mitigates several biases typically associated with human labeling in recognizing rodents’ behaviours in the forced swim test.

Materials and methods

Animals

A total of 122 male and female Wistar rats (3–4 months old) were used for the present study. All subjects were bred in-house at the animal facility of the University of Camerino (Camerino, Italy). Rats were housed four per cage according to their sex and kept under a reversed 12:12 h light/dark cycle (lights off at 7 AM) in a temperature (20–22 °C) and humidity (45–50%) controlled room. Food (4RF18, Mucedola, Settimo Milanese, Italy) and tap water were provided ad libitum.

To train the ML algorithm, 78 male and female rats (n = 39/sex, weighing 350–500 g males and 250–300 g females) were used. The first (n = 24, 8/group) and the second experiment (n = 20, 5/group) were conducted in male Wistar rats only (400–500 g), following validated and well-established reference literature.

Before starting the experimental procedures, animals were handled 5 min a day for 5 days by the same operators who performed the experiments. Experiments were conducted during the dark phase of the light/dark cycle. All procedures were approved by the local ethical committee of the University of Camerino and the Italian Ministry of Health (prot.1D580.19). Experiments were conducted in adherence with the European Community Council Directive for Care and Use of Laboratory Animals and the National Institutes of Health Guide for the Care and Use of Laboratory Animals.

Experiments were carried out in accordance with ARRIVE 2.0 guidelines.

Forced swim test

The FST experiments were carried out following a previously described and validated procedure14,42. Briefly, swimming sessions were conducted by gently placing rats individually in a transparent plexiglass cylinder (height 50 cm, diameter 28 cm) containing 30-cm water at 23–25 °C.

For the drug experiments, two sessions were conducted: an initial 15-min pretest followed 24 h later by a 5-min test. Drug treatments were administered during the period between these two sessions. Following each swimming session, rats were removed from the cylinder, dried with paper towels, and placed in front of a source of heat for 20 min and then returned to their home cages. Test sessions were videotaped through a camera placed in front of the cylinders (camera resolution: 704 × 576, FPS: 25 Hz). The time spent in immobility, climbing, and swimming was measured.

Drugs

Fluoxetine hydrochloride (Sigma, St. Louis, MO) and desipramine hydrochloride (Sigma, St. Louis, MO) were dissolved in distilled water and administered subcutaneously (s.c.) at the doses of 20 mg/kg in a volume of 4 ml/kg.

Amitriptyline (E.G. S.p.A, Milan, Italy), and Paroxetine (Angelini S.p.A., Rome, Italy) were diluted in distilled water and administered s.c. at the doses of 20 mg/kg in a volume of 4 ml/kg. Venlafaxine (Italfarmaco S.p.A, Milan, Italy) was prepared as above (diluted in distilled water, 20 mg/kg) and administered s.c. in a volume of 2 ml/kg.

All drugs were administered 23.5, 5, and 1 h prior to the 5-min swimming test49.

Drug doses were chosen based on published data14,40,41,42,50. To habituate rats to the drug injection procedure they received vehicle injections for three consecutive days before starting the experiments.

Experimental procedures

Dataset preprocessing

To provide the network with all spatiotemporal information necessary for behaviour recognition, as inputs we used Time x Length x Height (TxLxH) 3D spatiotemporal tensors, where T is the temporal dimension, and L and H are the spatial dimensions of the resized frame.

The training dataset was composed of 78 videos for a total of 503 min, and it was independently labeled by two experienced researchers. Any inconsistency between the two annotators was re-evaluated together to build a univocal dataset.

To label the dataset, the videos were analyzed by identifying the main behaviour in short temporal segments of the video, each lasting 3 s. This is a variant of the 5 s scoring method proposed by Slattery and Cryan in 201249.

The length of 3 s was chosen to obtain the best balance between an appropriate and clear recognition of the behaviour by the human annotators and dataset dimension that was purposely kept as small as possible to extract more instances from each video collected and limit the computational resources involved.

In this way, 10,062 3D tensors were extracted from the 78 videos.

Swimming, climbing, and immobility were the three main behaviours analyzed. Since diving occurs rarely, it was labeled as swimming49.

The dataset was preprocessed through a phyton script written using OpenCV module43,44,45 as described below. The area of the cylinder was manually selected, and the resulting Region of Interest (ROI) was wrapped into a rectangle with a standardized dimension LxH. To enhance the visibility of the rat every frame was subtracted from the first frame of the video showing no rat. The sequence of those images organized in 3-s blocks was concatenated to build a 3D tensor of size 75 × 64 × 128 (TxLxH), where 75 is the number of concatenated frames, and 64 × 128 is the dimensions of a single-wrapped frame. Finally, the pixels of the 3D spatiotemporal tensor were normalized. The whole process is shown in Fig. 8A. The occurrence of each behaviour in the dataset is shown in Fig. 8B.

Fig. 8
figure 8

Dataset construction. To build the dataset the video is split into 3-s sub-videos and each sequence of frames composes a single data; every frame is subtracted from the first frame of the video containing no rodent; then, the ROI (blue line) of the cylinder is wrapped into a rectangle of standard dimension (LxH). Finally, the resulting 3D-tensor of dimension TxLxH is normalized. The related portion of the video was manually evaluated by two annotators to recognize immobility, swimming, and climbing behaviours (A). The final dataset is composed of the three interest behaviours (B).

The dataset was then randomly split into 90% train and 10% validation sets. The train dataset was used for the optimization of the ML parameters during the training phase, while the validation dataset was used to estimate the performance of the model on new data.

Model architecture and training

The model architecture was based on a series of 3D convolutional layers and residual connections, coded using the Keras library44 with tensorflow backend45. The convolutional part was connected to a series of dense layers through a global pooling operation. The role of the dense part was to operate high-level processing on the features extracted by the convolutional layers. Finally, a three-neuron layer with softmax activation was used for the recognition of the behavioural classes.

Since the behaviour recognition is invariant under the mirroring of the images around the vertical axis, the training dataset was batch-augmented by making use of this transformation51.

The model was trained by using Adam optimizer46 for 120 epochs, exploiting computational acceleration provided by a state-of-the-art graphic processing unit, namely, a GPU RTX A6000. The batch size to train the model is set to 8, and considering the batch augmentation just described, each batch contains 16 examples, while the learning rate is 10^-4. SCRIVI IN MATEMAtICA FrAME.

To prevent the model from overfitting on the train dataset, early stopping was used. The model showing the highest accuracy on the validation set was selected and then used for pharmacological validation experiments.

The library versions used are reported in Table 2.

Table 2 The version of the main libraries used.

Comparison of the ML algorithm with the human manual scoring: Evaluation of the forced swimming effects of FLX and DMI.

To evaluate the accuracy and reliability of the ML algorithm trained, a pharmacological experiment was performed to test the forced swimming response to established antidepressant drugs. Two groups of rats subjected to a 15-min pretest received FLX (20 mg/kg, 4 ml/kg) or DMI (20 mg/kg, 4 ml/kg) 24 h, 5 h, and 1 h prior to the start of the given 5-min test. A third group of animals was treated with saline and served as a control. Videos were analyzed both by one of the two experienced annotators and the trained ML algorithm scoring system.

As previously described, the videos were split into 3 s sub-video and preprocessed. The resulting 3D tensor was analyzed by the trained model (Fig. 9).

Fig. 9
figure 9

Pipeline of ML annotation applied to a video. The video is split into 3-s sub-videos, and they are preprocessed as already described in Fig. 8: every frame of the sequence is subtracted from the first frame of the original video containing no rat; then, the ROI (blue line) of the cylinder is wrapped into a rectangle of standard dimension (LxH). Finally, the resulting 3D-tensor of dimension TxLxH is normalized. The resulting 3D tensor is evaluated by the model already trained.

ML algorithm evaluation of the effects of AMI, PRX, and VLX in the FST

A second pharmacological experiment was carried out to assess the effectiveness of our ML algorithm in distinguishing the subtle differences in the swimming response elicited by different classes of antidepressants. For this purpose, employing the same protocol described above, AMI (20 mg/kg, 4 ml/kg), PRX (20 mg/kg, 4 ml/kg), and VLX (20 mg/kg, 2 ml/kg) were used.

Statistical analysis

Behavioural data were analyzed by ANOVA using GraphPad Prism version 9.5.1 (GraphPad Software, San Diego, California, USA).

In the first experiment, the comparison between annotator and ML scoring methods was evaluated for each behaviour via two-way ANOVA with scoring methods and treatment as between factors. In the second experiment, when the only ML scoring method was applied, the one-way ANOVA was used with drug as between-subjects factor.

ANOVA was followed by the Dunnett’s post-hoc test when appropriate, and statistical significance was conventionally set at p < 0.05.

To evaluate the correlation of the two scoring methods the Pearson’s correlation was used.