Background & Summary

Under the current scenario of global biodiversity loss, there is an urgent need for more precise and informed environmental management1. In this sense, data derived from animal monitoring plays a crucial role in informing environmental managers for species conservation2,3. Animal surveys provide important data on population sizes, distribution and trends over time, which are essential to assess the state of ecosystems and identify species at risk4,5. By using systematic monitoring data on animal and bird populations, scientists can detect early warning signs of environmental changes, such as habitat loss, climate change impacts, and pollution effects6,7,8. This information helps environmental managers develop targeted conservation strategies, prioritize resource allocation, and implement timely interventions to protect vulnerable species and their habitats2,9,10. Furthermore, bird surveys often serve as indicators of local ecological conditions, given birds’ sensitivity to environmental changes, making them invaluable in the broader context of biodiversity conservation and ecosystem management11. However, monitoring birds, as any other animal, is highly resource-consuming. Thus, automated monitoring systems that are able to reduce the investment required for accurate population data are much needed.

The first step to create algorithms that detect species automatically is to create datasets with information on the species traits to train those algorithms. For example, a common way to classify species is by their vocalizations12. For this reason, organizations such as the Xeno-Canto Foundation (https://xeno-canto.org) compiled a large-scale online database13 of bird sounds from more than 200,000 voice recordings and 10,000 species worldwide. This dataset was crowsourced and today it is still growing. The huge amount of data provided by this dataset has facilitated the organization of challenges to create bird-detection algorithms using acoustic data in understudied areas, such as those led by Cornell Lab (https://www.birds.cornell.edu). This is the case of BirdCLEF202314, or BirdCLEF202415, which used acoustic recordings of eastern African and Indian birds, respectively. While these datasets contain many short recordings from a wide variety of different birds, other authors have released datasets composed of fewer but longer recordings, which imitate a real wildlife scenario. Examples of this are NIPS4BPlus16, which contains 687 recordings summing a total of 30 hours of recordings or BirdVox-full-night17, which has 6 recordings of 10 hours each.

Although audio is a common way to classify bird species and the field of bioacoustics has increased tremendously in the latest years, another possible approach to identify species automatically is using images18. One of such bird image datasets is Birds52519, which offers a collection of almost 90,000 images involving 525 different bird species. Another standard image dataset is CUB-200-201120, which provides 11,788 images from 200 different bird species. This dataset not only provides bird species, but also bounding boxes and part locations for each image. There are also datasets aimed at specific world regions like NABirds21, which includes almost 50,000 images from the 400 most common birds seen in North America. This dataset provides a fine-grained classification of species as its annotations differentiate between male, female and juvenile birds. These datasets can be used to create algorithms for the automatic detection of the species based on image data.

However, another important source of animal ecology information that has been much less studied because of the technological challenges of its use are videos. Video recordings may offer information not only about which species are present in a specific place, but also about their behavior. Information about animal behavior may be very relevant to inform about individual and population responses to anthropogenic impacts and has therefore been linked to conservation biology and restoration success22,23,24,25. Besides its potential for animal monitoring and conservation, the number of databases on wildlife behavior are more limited. For example, the VB100 dataset26, comprises 1416 clips of approximately 30 seconds. This dataset involves 100 different species from North American birds. The unique dataset comprised by annotated videos with birds behavior available in the literature is the Animal Kingdom dataset27, which is not specifically aimed at birds and contains annotated videos from multiple animals. Specifically, it contains 30,000 video sequences of multi-label behaviors involving 6 different animal classes; however, the number of bird videos was not specified by the authors. Table 1 summarizes the main information of the datasets reviewed.

Table 1 Summary of reviewed bird datasets.

Due to the scarcity of datasets involving birds videos annotated with its behaviors, this study proposes the development of the first fine-grained behavior detection dataset for birds. Differently from Animal Kingdom, where a video is associated with the multiple behaviors happening, in our dataset, spatio-temporal behavior annotations are provided. This implies that videos are annotated per-frame, where the behavior happening and the location is annotated in each frame (i.e., bounding box). Moreover, the identification of the bird species appearing in the video is also provided. The proposed dataset is composed by 178 videos recorded in Spanish wetlands, more specifically in the region of Alicante (southeastern Spain). The 178 videos expand to 858 behavior clips involving 13 different bird species. The average duration of each of the behavior clips is 19.84 seconds and the total duration of the dataset recorded is 58 minutes and 53 seconds. The annotation process involved several steps of data curation, with a technical team working alongside a group of professional ecologists. In comparison to other bird video datasets, ours is the first to offer annotations for species, behaviors, and localization. Furthermore, Visual WetlandBirds is the first dataset to provide frame-level annotations. A features’ comparison of bird video datasets is presented in Table 2.

Table 2 Features comparison between available birds video datasets.

Table 3 reflects the different species collected for the dataset, distinguishing between their common and scientific names. The number of videos and minutes recorded for each species is also included.

Table 3 Statistics for each of the bird species.

Seven main behaviors were identified as key activities recorded in our dataset. These represent the main activities performed by waterbirds in nature28. In Table 4, these behaviors are specified alongside the number of clips recorded per each of them and the mean duration of each behavior in frames. A clip is a piece of video where a bird is performing a specific behavior.

Table 4 Total number of clips per behavior and their mean duration.

Figure 1 presents sample frames where only a single bird individual can be distinguished. However, this dataset contains not only videos with a single individual, but also videos where several birds appear together. This is the case for gregarious species, which are species that concentrate in an area for the purpose of different activities. Although the individuals of gregarious species often share the same behavior at the same time, it is also common that several behaviors can be seen in the same video at the same time. Figure 2 shows some sample frames where this phenomenon happens. Videos involving different birds and/or performing different activities sequentially were cut in clips where a unique individual is performing a unique behavior in order to get the statistics shown in Table 4.

Fig. 1
figure 1

Sample frames from the dataset.

Fig. 2
figure 2

Sample frames where gregarious birds appear performing different behaviors.

Between the seven behaviors proposed, it should be underlined the difference between the Alert, Preening and Resting behaviors. These action distinctions were established by the ecology team. We considered that the bird was Resting when it was standing without making any movement. The bird was performing the Alert behavior when it was moving its head from one side to another, moving and looking around for possible dangers. Finally, we considered that the bird was Preening when it was standing and cleaning its body feathers with its beak. The remaining behaviors are not explained because of their obvious meaning.

As it can be seen in Table 4, the number of clips per behavior is unbalanced between classes. This is because the recording of videos where some specific behaviors are happening is more uncommon, as happens with Flying or Preening, which represent the activities with the lowest number of clips in the dataset. These behaviors are difficult to record since they are performed with a lower frequency. In order to be able to collect more data on these less common behaviors, more hardware and human resources (i.e., cameras and professional ecologists) are needed to cover a wider area of the wetlands. Furthermore, another technique like data augmentation29 can generate synthetic data from the actual one. While allocating more human and hardware resources would ensure that the quality of the new data remains high, it is also costly as high-quality cameras and extra ecology professionals are expensive. On the counterpart, synthetic generation techniques are being highly used in current research as they provide a costless way of increasing the amount of data available for training. Although costless, synthetic data can decrease the quality of the dataset, so a trade-off between real and synthetic data would be necessary to limit the cost without compromising the quality of the videos. Although the unbalanced nature of the behaviors, no balancing technique over this data was applied in the released dataset in order to maximize the number of different environments captured, ensuring in this way the variability of contexts where the birds are recorded.

Additionally, Table 4 also shows the mean duration of the clips per behavior. It is worth noting the difference in the number of frames between Flying, that represents the minimum with 61 frames with respect to Swimming, which represents the absolute maximum with a value of 257 frames. This difference is explained in the nature of the behaviors, as swimming is naturally a slow behavior, which can be performed for a long time over the same area. However, flying is a fast behavior, and the bird quickly get outs of the camera focus, especially for videos obtained by camera traps, which cannot follow the bird while it is moving.

In order to collect the videos, we deployed a set of camera traps and high quality cameras in Alicante wetlands. The camera traps were able to automatically record videos based on the motion detected in the environment. We complemented the camera trap videos with recordings from high quality cameras. In these videos, a human is controlling the focus of the camera, obtaining better views and perspectives of the birds being recorded. Species recorded, behaviors identified and the camera deployment areas were described by professional ecologist based on their expertise. In Fig. 3 some video frame crops can be observed, where all the bird species developing the different behaviors available in the dataset can be seen.

Fig. 3
figure 3

Video frame crops of bird species performing the seven behaviors composing the dataset.

After the data collection, a semi-automatic annotation method composed by an annotation tool and a deep learning model was used in order to get the videos annotated. After the annotation, a cross-validation was conducted to ensure the annotation quality. This method is deeply explained in the next section.

In order to test the dataset for species and behavior identification, two baseline experimentation were carried out: one for the bird classification task, which involves the classification of the specie and the correct localization of the bird given input frames, and a second one for the behavior detection task, which involves the correct classification of the behavior being performed by one bird during a set of frames.

Methods

Data acquisition

The acquisition of the data was conducted within Alicante wetlands, specifically within the wetlands of La Mata Natural Park and El Hondo Natural Park (sutheastern Spain). In these places, we deployed a collection of high-resolution cameras and camera traps in different areas of the wetlands. These areas were determined by the species expected to be recorded, as different species can be commonly seen in different wetland areas.

Camera traps are activated when movement is detected and thus can record for long periods of time without human intervention. The usage of automatic camera traps30,31,32 is common in the monitoring of wildlife as it provides a low-cost approach to collect video and image data from the environment. However, the focus of this camera is fix and thus the videos of the same individual are often short. Manual cameras require the presence of a human while recording and are thus more time-consuming. Also, the presence of the cameraman may affect the animal’s behavior. However, it permits manual changes of cameras’ perspectives in order to correctly record the bird behavior.

Two models of camera traps were used: the Browning Strike Force Pro HD and the Bushnell Core HD, both featuring a sensor resolution of 24 megapixels, a shot speed of 0,21 seconds, and a field of view of 55. For manual recordings, the Canon Powershot SX70 was employed, which has a sensor resolution of 20,3 megapixels and a shot speed of 5 × 10−4 seconds. As different camera models and capture settings were used, videos of different resolutions were obtained: 87 videos at 1920 × 1080px, 75 videos at 1296 × 720px, 14 videos at 1280 × 720px, 1 video at 960 × 540px, and 1 video at 3840 × 2160px.

The species selected were the most commonly found in Alicante wetlands, facilitating the recording of videos and providing valuable data to the natural parks where videos were recorded. In terms of behaviors, we identified the most representative ones of the selected species, in order to cover as much as possible the range of activities developed by the birds.

To ensure the generalization capabilities of models trained on this dataset, a variety of lighting and seasonal conditions, backgrounds, viewpoints, and video resolutions were considered. Regarding lighting conditions, the professionals responsible for the recordings were instructed to capture footage of birds at different times of day, thereby enhancing data variability. The dataset includes diverse lighting scenarios such as daylight, sunset, low-light, and backlight. Low-light and backlight scenes pose additional challenges for detection models, as they reduce the visibility of color features (often relevant for species identification) and make it more difficult to distinguish bird silhouettes from the background (e.g., top-right crop in Fig. 3). Regarding seasonal conditions, video recordings were conducted throughout the entire year to ensure representation of the environmental variability associated with the four seasons. However, due to Alicante’s characteristically low annual precipitation, most of the videos in the dataset feature either sunny or cloudy weather conditions. While this may limit atmospheric diversity, it can also benefit the model training process by facilitating clearer visual identification of bird species, as the absence of rain-related distortions contributes to more interpretable video data.

To mitigate background bias in species detection, recordings were captured in a variety of natural contexts. Although the dataset was collected in wetland environments, it includes birds situated on water, the ground, grass, and tree branches (e.g., background differences between Alert crops in Fig. 3). Additionally, variations in lighting conditions affect water color, further contributing to background diversity (e.g., water color differences in Resting crops in Fig. 3). The dataset also includes a range of camera viewpoints. In some sequences, birds appear in the foreground, while in others, they are captured at greater distances, simulating real-world variability. Lastly, the inclusion of videos with different resolutions enhances the adaptability of models to real-world deployment settings, where camera quality may vary. For optimal performance in specific environments, it is recommended to fine-tune the models using data collected from the intended deployment context.

Data annotation

Accurate annotation of the captured data is a determining factor in obtaining relevant results when training deep learning models on this data. To ensure annotation accuracy, the usage of annotation tools33,34 is extended, as they provide a user-friendly interface that makes this process easy and accessible to non-technical staff.

There are many open-source annotation tools available on the market. CVAT (https://github.com/cvat-ai/cvat) is one of the most popular ones, as it provides annotation support for images and videos, including a variety of formats for exporting the data. VoTT (https://github.com/microsoft/VoTT) is also popular when annotating videos, as it offers multiple annotation shapes and integration with Microsoft services to easily upload data to Azure. Other simpler annotation tools are labelme (https://github.com/labelmeai/labelme) or LabelImg (https://github.com/HumanSignal/labelImg), which are aimed at annotating images and their capabilities are more limited. For our purpose, we decided to use CVAT because of the large number of exportable formats available, the great collaborative environment it offers, and its easy integration with semi-automatic and automatic annotation processes.

As the need for larger amounts of data to train deep learning models increased, researchers began to enhance annotation tools with automatic systems that could alleviate this task. Annotation tools integrate machine learning models35 that can automatically infer what would otherwise be manually annotated. Common tasks performed by automated annotation tools are object detection36 and semantic segmentation37. While the former predicts the bounding box and class of each object in the image, the latter predicts regions of interest associated with specific categories.

Although automated annotation systems have demonstrated strong performance, semi-automated annotation processes are ultimately used because they ensure the creation of highly accurate annotations while greatly reducing the amount of human intervention required. Semi-automated annotation studies are widely used in the medical field38,39, where precision is a key factor throughout the design.

In this study, a semi-automated annotation approach was followed, based on CVAT and its possible integration with powerful computer vision models. Our approach consisted of five main steps: Species classification, bird localization, behavior classification, subject identification, data curation, and post-processing. Each of these stages is described in more detail below. Figure 4 shows this process.

  1. 1.

    Manual species classification: In this first step, the ecologists manually labeled each video with the main bird species that appeared. The main species is that of the bird in the focus of the camera. This way, annotations of birds that are different from the main species will not be included in the video annotations.

  2. 2.

    Automated bird localization: Then, an object detection model was used to predict the localization of the bounding boxes of the birds that appear in each of the video frames. YOLOv740 was chosen as the object detection model for ease of implementation, as it is already integrated into CVAT. Since the model provided by CVAT is trained on general purpose data, the class predicted by default for each bounding box is not be the bird species, but the class bird. To avoid manually changing all the bounding box classes, we used an option provided by CVAT to associate a user-defined class with the class detected by the model. In this way, the class bird was associated with the species appearing in the video.

  3. 3.

    Manual behavior classification: This manual stage had a twofold objective. First, they checked and corrected erroneous bounding boxes, and second, they annotated for each bounding box the behavior performed by each bird. To annotate the behaviors, CVAT bounding boxes tags were used.

  4. 4.

    Automated subject identification: When using automatic annotation models such as YOLOv7, CVAT does not support bounding box correspondence between frames. In other words, if a video shows two birds developing different behaviors, there is no relationship between the bounding boxes of adjacent frames, so it is not possible to analyze the birds’ behaviors. This is not possible because the next frame will show two new bounding boxes whose relation to the one being analyzed is not known. To address this problem, the Euclidean distance41 was used to correlate the bounding boxes of adjacent frames. The euclidean distance calculates the distance between the centers of the bounding boxes of adjacent frames and then correlates the bounding boxes with the minimum distance. The center of the bounding box was calculated as follows:

    $$c=\left(\frac{{x}_{\min }+{x}_{\max }}{2},\frac{{y}_{\min }+{y}_{\max }}{2}\right)$$
    (1)

    Given the centre of the bounding boxes, the Euclidean distance was calculated as:

    $$d({c}_{1},{c}_{2})=\sqrt{{\left({x}_{2}-{x}_{1}\right)}^{2}+{\left({y}_{2}-{y}_{1}\right)}^{2}}$$
    (2)
    $$\begin{array}{rcl}{\rm{where}}\,{c}_{1} & = & ({x}_{1},{y}_{1})\\ {\rm{and}}\,{c}_{2} & = & ({x}_{2},{y}_{2})\end{array}$$
  5. 5.

    Manual data curation: After the labeling of species, bounding boxes, behaviors, and subjects, an overall review of all annotations was conducted to ensure the high quality of the data. To conduct the review, videos were assigned to all ecologists equally.

  6. 6.

    Automated post-processing: Once the annotations were complete, their format was adapted to make them easy to use and understand. To achieve this goal, the approach used in the AVA-Kinetics dataset42 was followed. In this approach, a CSV file was used to contain annotations containing localized behaviors of multiple subjects. To export the data into the CSV format, the data was first exported from CVAT using the CVAT Video 1.1 format. Some Python scrips were then used to extract only the relevant information from the exported data and dump it into the output CSV file.

Fig. 4
figure 4

Visual representation of stages involved in the annotation process. Birds are first classified into species by annotators and localized using a YOLO model. Then, annotators recognize bird behaviors, subjects are identified using a Python script, and finally, the data is curated and post-processed.

Annotation criteria

The annotation process involved addressing several specific challenges identified during the annotation stage. Two primary issues emerged: the annotation of individual birds exhibiting multiple behaviors simultaneously, and the misclassification of minor sub-movements as the dominant behavior.

It is common for birds to perform more than one activity at the same time. However, in our annotation protocol, only a single behavior could be assigned to each bird per frame. In such cases, the behavior considered to be most biologically relevant was selected. In our dataset, this situation was associated with the behavior Feeding, which often co-occurred with locomotor behaviors such as Walking or Swimming. Based on input from ecological experts involved in the project, Feeding was prioritized due to its higher biological relevance. This behavior is closely linked to key ecological functions. Moreover, it serves as an indicator of habitat quality, as successful foraging reflects the availability of adequate food resources within the wetland environment. Moreover, Feeding can provide insights into species-specific foraging strategies and dietary preferences, which are valuable for ecological monitoring and conservation applications. In addition, Feeding behavior tends to be more behaviorally diverse and species-specific, thereby offering richer information for training models to distinguish fine-grained differences between species (a central contribution of the dataset). In contrast, behaviors such as Walking or Swimming are more ubiquitous and less discriminative across species. Finally, as Feeding occurred less frequently than other behaviors, prioritizing it in multi-behavior frames also contributed to improve class balance across the annotated data. Figure 5 shows an example of how Feeding is prioritized.

Fig. 5
figure 5

Sample clip in which a bird performs the Feeding and Walking behavior simultaneously. In such cases, Feeding is prioritized by annotators due to its higher biological relevance.

Animals often change among actions very fast, as a response to the changing environment. Thus, to consider a collection of movements of a bird as a behavior, this had to last a minimum of 30 frames, otherwise this collection of movements was identified as a sub-movement of another main behavior, which is the one annotated for those frames. Moreover, this also facilitates the training of deep learning models, as very short behaviors are difficult to segment and classify due to the limited motion information they provide. In the annotation tool used (CVAT, as described in Data annotation subsection), a progress bar displaying the total number of frames and the current frame being viewed was provided. This feature made it easier for annotators to determine whether a behavior lasted at least 30 frames. Figure 6 shows an example of this annotation strategy.

Fig. 6
figure 6

Within this clip, a bird is doing the behavior Preening, but shortly interrupts the behavior and transitions to Resting. As Resting lasts for less than 30 frames, it is considered as part of the Preening behavior.

Data Records

The dataset presented in this study is open access and accessible through Zenodo43. Within the Zenodo repository, there are five main elements:

  • Videos folder: This folder contains the 178 videos that comprise the dataset. Videos are identified by their name, which is composed of a numeric value and the species that appears in the video. The format is the following “ID-VIDEO.SPECIES-NAME.mp4".

  • Bounding boxes CSV: The bounding_boxes.csv file contains all the annotations of the dataset. It follows a format of 10 columns, ordered as follows: Global identifier of the row, video identifier, frame identifier within the video, activity identifier, subject identifier, species appearing in the video, and the four coordinates of the bounding box (top-left x-coordinate, top-left y-coordinate, bottom-right x-coordinate, and bottom-right y-coordinate). Each of the CSV rows represents the information of one bounding box within one frame of a video.

  • Behavior identifiers CSV: The behavior_ID.csv file contains a mapping of the seven behavior classes that make up the dataset and their numeric identifiers.

  • Species identifiers CSV: The file species_ID.csv contains a mapping between the 13 different bird species and their numerical identifier.

  • Splits JSON: The splits.json file contains the videos associated with each train, validation and test split.

Technical Validation

To ensure high quality recordings and accurate annotations, the entire process was carried out by expert ecologists. Ecologists used a semi-automated approach during the annotation process, as mentioned in the Data annotation section.

Firstly, video recordings were supervised by a group of experts who set up camera traps in strategic areas and also recorded some high-quality videos. For each video, these experts manually annotated the species appearing in the video. The same experts then manually corrected bounding box errors and annotated bird behavior, together with a number of collaborators with a background in ecology. Finally, a final stage of manual cross-checking of annotations was carried out by the experts and collaborators. The expertise of the annotators responsible for collecting and annotating the videos, together with the final cross-review process, ensures the quality and cleanliness of the data.

To qualitatively evaluate the quality of the annotations, we conducted an inter-annotator agreement assessment. This evaluation measures the consistency of the labeling criteria adopted by different annotators using three complementary metrics: Cohen’s Kappa44, Fleiss’ Kappa45, and the macro-averaged F1 score. Furthermore, as this dataset has been conceived mainly to be used in deep learning pipelines, baseline deep learning models trained on our dataset were developed. As mentioned previously, the purpose of this dataset is twofold, as it provides annotation data for performing bird species detection and behavior classification tasks. Thus, one baseline per each task was developed using PyTorch as coding platform.

Inter-annotator agreement assessment

In order to evaluate the annotation consistency between annotators, three metrics were used in the assessment: Cohen’s Kappa, Fleiss’ Kappa, and the macro-averaged F1 score. Cohen’s Kappa and Fleiss’ Kappa are widely used metrics for assessing annotation quality in multi-annotator settings, as they quantify agreement beyond what would be expected by chance. While Cohen’s Kappa measures the agreement between two annotators, Fleiss’ Kappa generalizes this concept to more than two annotators, offering a single global measure of inter-annotator reliability. This is particularly relevant for our dataset, which was annotated by four individuals. In contrast, the macro-averaged F1 score measures the degree to which annotators consistently assign the same class labels, evaluating agreement on a per-class basis. We specifically use the macro version of the F1 score because it treats each class equally, thus mitigating the effects of class imbalance. Table 5 reports the results obtained using these metrics. Since both Cohen’s Kappa and the macro F1 score are pairwise metrics, the table presents their average pairwise scores. For further information, Fig. 7 shows full pairwise agreement matrices.

Table 5 Results of the inter-annotator agreement evaluation. Cohen’s Kappa and Macro F1 values shown are the average of each of the pairwise values obtained.
Fig. 7
figure 7

Inter-annotator agreement matrices for the dataset. (a) Cohen’s Kappa. (b) Macro F1 Score.

The results reported in Table 5 indicate a high degree of annotation consistency. The average pairwise Cohen’s Kappa (0,858) and the overall Fleiss’ Kappa (0,855) suggest an excellent level of agreement among annotators. These results confirm that annotators followed a consistent set of criteria when labeling bird behaviors. Additionally, the macro-averaged F1 score of 0,946 highlights strong class-wise consistency, showing that annotators not only agreed in general but also consistently identified the same behavior categories across clips. This supports the reliability of the dataset for training and evaluating deep learning models.

Species classification

As the dataset was primarily designed for training deep learning models, two baseline models were developed to evaluate its applicability. First, the baseline pipeline for species classification is introduced. This baseline is based on a YOLOv946 model trained over 50 epochs in the proposed dataset. YOLOv9 was selected due to its widespread use and strong reputation in object detection pipelines. The model is notable for its low inference times, making it suitable for real-time applications, while maintaining high accuracy across a wide range of scenarios.

Train, test and validation splits were generated from the full set of videos with a distribution 70-15-15. The splits were constructed using a stratified strategy based on the species and behaviors appearing in the videos. The distribution was computed by taking into account the number of frames that constitute each video (e.g., one video with 1,000 frames is equivalent to five videos with 200 frames). For efficient training, a downsampling of 10 is performed on the frames extracted from the videos. This can be done without affecting the performance of the model, as the difference between successive frames is minimal. The frames were extracted while maintaining the source FPS (Frames Per Second) of each video. During the training stage, a learning rate of 0.01 was used and a GeForce RTX 3090 GPU was used as the hardware platform. The test results from the baseline are shown next:

Table 6 shows the test results for species classification in terms of precision, recall, mAP50, and mAP50-95 metrics. mAP50-95 is a common object recognition metric that refers to the mAP (mean Average Precision) computed over 10 different IoU (Intersection over Union) thresholds, specifically from 0.50 to 0.95 in increments of 0.05. The results demonstrate that YOLOv9 achieves strong performance for the task, with a maximum precision of 0.835 and a high recall of 0.759. The mAP metrics, which evaluate the accuracy of bounding box localizations, also indicate robust performance, reaching 0.801 for mAP50 and 0.556 for mAP50-95. These are notable results, especially considering the challenges of achieving high mAP scores with stricter IoU thresholds.

Table 6 Results of the YOLO-based baseline developed for bird species classification.

To provide a more comprehensive understanding of the evaluation, the confusion matrix for the results is given. Figure 8 shows the confusion matrix, where it can be observed that the majority of the errors are due to the confusion of the ground truth class with the background class.

Fig. 8
figure 8

Confusion matrix of species classification pipeline.

Behavior detection

Secondly, the behavior detection baseline is presented. In this baseline, four different video classification models were trained end-to-end to perform the behavior classification task. The trained models were Video MViT47, Video S3D48, Video SwinTransformer49, Video ResNet50, and TimeSFormer51. These models were selected due to their popularity for video classification tasks across a wide range of contexts, as well as their ease of use through the PyTorch and HuggingFace Transformers libraries, which facilitates the reproducibility of experiments. Moreover, the selected models are based on different architectures commonly used in computer vision: while Video S3D and Video ResNet rely on convolutional networks, Video MViT, Video SwinTransformer, and TimeSFormer are built upon the Transformer architecture as their fundamental building block. All model architectures and pretrained weights were extracted from PyTorch.

For the training, test and validation splits, the same distribution is used as for the species classification baseline. Input videos were downsampled with a downsample rate of 3, selecting the first frame as the representative of each set (i.e., only the first frame of each set of 3 is kept). Regarding the training hyperparameters, a learning rate tuning was conducted using a uniform sampling strategy with minimum and maximum values of 0.0001 and 0.01, respectively. Similarly to the species classification baseline, training was performed on a GeForce RTX 3090 GPU. The results for each model are shown below:

From the Table 7 it can be concluded that the Video ResNet model is the one which learns better the complexity of the dataset, showcasing a maximum performance of 0.56. Conversely, the model with the lowest score is the S3D model, with an accuracy of 0.29. These results show the challenge posed by the dataset under study, which presents a limited amount of data. The limited amount of data available to train complex deep learning models demonstrates the need for more resources to capture more data. Furthermore, the development of new training strategies and deep learning architectures that fit the data needs should be explored in order to improve the baseline results obtained.

Table 7 Results of the baseline models for behavior detection in terms of accuracy. The learning rate shown is the one that achieved the highest accuracies during hyperparameter tuning.

Usage Notes

Since the data annotations are provided in CSV format, it is recommended to use Python libraries such as Pandas, which is specifically designed to read and manage CSV data. In the official GitHub repository containing the code, there are usage examples of how to load and prepare the data to be fed into deep learning models. It is recommended to read the dataset.py script in the behavior_detection directory as an example.