Background & Summary

Non-human primates, our close evolutionary cousins, exhibit various complex behaviors, including the extensive use of acoustically diverse vocal signals for communication within conspecifics. By conducting comparative research on non-human primates, valuable insights can be gained into the evolutionary development of speech and language. Although non-human vocalizations can be difficult for humans to decipher, large acoustic datasets may make it possible to identify important nuances that are critical to communication among animals but may be imperceptible to the human ear. There has been considerable interest recently in the common marmoset (Callithrix jacchus) as a neuroscientific model organism1, and many attempts have been made to study and characterize its vocal repertoire2,3,4,5,6,7.

Several studies have used machine learning to detect and/or segment and/or label marmoset vocalizations. For example, Turesson et al.8 attempted to identify the most robust classifier for a small labeled vocalization dataset (~300 samples), while Phaniraj et al.9 sought to optimize source identification from a small dataset size of labeled marmoset vocalizations (~7 K samples), and Zhang et al.6 focused on finding the best supervised deep learning-based methods for detecting and classifying marmoset vocalizations in a medium dataset of labeled vocalizations (~20 K samples).

However, the existing studies present several important limitations. First, the audio recording setups did not allow recording above a sampling rate of 48 kHz, which would allow the full frequency range of marmoset vocalizations, corresponding to their hearing range from 125 Hz to 36 kHz10, to be recorded. Second, the existing datasets did not provide a sufficient number of labeled vocalizations to leverage advanced analytical methods. Fine-grained statistical analyses, such as those based on deep learning for decoding animal communication11, require substantial data (usually hundreds of thousands, e.g., Best et al.12 for animal vocalizations and up to millions in state-of-the-art DNNs). Finally, the manually labeled vocalizations in most marmoset research studies are often not shared publicly. To address this gap, we employed signal processing and deep learning tools to segment automatically and cluster vocalizations, following methods detailed in recent computational neuroethology literature12,13,14. We then implemented an iterative refinement process to label vocalizations across extensive recordings, minimizing the need for expert supervision.

Here, we present a large collection of vocalizations of marmosets. We have acquired and segmented over 800,000 vocalizations with a sampling rate of 96 kHz from a soundproofed animal facility room that contains three cages (~20 marmosets) over a period of three years. Marmosets are capable of producing a diverse array of vocalizations, including trills, phees, twitters, tsiks, seeps, and infant cries, even when kept in captivity2,4,15,16.

Such a high-rendering method might produce noisier segmentation and labeling; therefore, we validated our dataset by selecting a representative sample of 700 recordings, which were reviewed and cross-examined by four independent experts to ensure accuracy and consistency. Yet, our method has the benefit of opening the way for four data-driven approaches. First, our comprehensive dataset allows for the characterization of the acoustical properties of the marmoset vocal repertoire at a group level, such as investigating whether certain call types often occur together17. This strategy has been proposed to instigate inter-species communication differences13. Second, future works could leverage such a large amount of data for studying the cortical processing of vocalizations using deep learning. In the last decade, DNN-based representations have emerged as the class of computational model that correlates best with human brain responses to a wide range of auditory stimulations, including speech18,19, language20,21, natural sounds22, and even music23. They have also been proven proficient in untangling the neuro-computational mechanisms of face processing in macaques24,25,26. Leading computational neuroscience groups advocated that engineering new bio-inspired DNN-based representations would further help us understand the development, organization, and learning objective of sensory cortical processing27,28,29,30,31. A key parameter in training such models is the ecological relevance of the training data32. No study has investigated the possible similarity between monkey vocalization cortical processing and the representations learned by DNNs trained on monkey vocalizations. Indeed, training such networks would require substantial data, at least hundreds of thousands of input data12, and up to millions for state-of-the-art approaches33,34–no dataset of this nature is, to date, publicly available for monkey vocalizations. We propose a dataset of 253 hours of segmented marmoset vocalizations acquired during social interactions within three family groups, allowing the above-mentioned training regimes to be used for the first time for supervised and unsupervised training, with the latter seeming to be the most bio-plausible32. Such DNN models trained on marmoset vocalizations could then be mapped with existing marmoset brain responses to vocalizations35,36,37,38, using representational similarity analysis39 (e.g., in22,40) or brain-scoring41. Third, another approach could be to focus on classifier training using labeled data for passive monitoring in natural settings42. Since our vocalizations were recorded in laboratory settings, a model trained with our dataset might need transfer learning to match the distribution of the auditory environment to monitor43. Finally, a recent key study in NeuroAI leveraged deep-learning-inspired learning approaches to map behavioral actions with neural activity44, a line of research that could benefit from using a high number of vocalizations.

However, it is important to acknowledge certain limitations. The dataset lacks information about the sender’s identity and the context in which the vocalization was produced. This absence of sender identification restricts the study of information encoded in vocalizations and the study of developmental processes (vocal ontogeny), such as the resemblance of infant to adult call types. Future research could benefit from monitoring systems that capture sender and contextual information.

Methods

Animal retrieval and cares

This study involved a total of thirty-five common marmosets (Callithrix jacchus) from a single colony initially structured into three families (same dialect). The marmosets were kept on a 12-hour light/dark cycle that began at 8 a.m. They were fed 3 times per day (pellets around 9 a.m., vegetables around 11 a.m., treats and gum arabic around 3 p.m.). They cannot hide in boxes or anything that can absorb vocalizations. Before entering the housing area, all staff (including researchers, veterinarians, and animal technicians) had to knock on the door to alert the animals and prevent any form of stress. Moreover, they were instructed not to talk in the presence of the marmosets. Most animals were present during the same period, except when a conflict or death occurred, at which point we re-established breeding pairs to maintain the colony. Consequently, while new family groups were formed over time, only three distinct families were in the room at any given moment, housed in three cages (each cage is 1.05 m long × 0.85 m wide × 2 m high). For more details on the periods of inclusion of each monkey, refer to Supplementary Table 1 and Supplementary Fig. 3. All animals included are the offspring of parents and grandparents that were born and raised in captivity for research purposes. All experimental procedures were in compliance with the European directive (2010/63/UE) and were approved by the Ethics Board of Institut de Neurosciences de la Timone (reference 2019010911313842).

Experimental setup

Acoustic recorders were set up in a lab with captive marmosets (Fig. 1). The recordings were made using one microphone (C-100, Sony Corporation, Japan, frequency response of 20 Hz to 50 kHz) placed directly in the room of three marmoset families (e.g., Supplementary Table 1). We used the tracks from the one with the best signal quality. The mixing desk (RME Fireface UFX II, RME, Germany) and the computer allowing the recording via Adobe Audition (Adobe, CA, USA) were located in an adjacent room. Husbandry and technical rooms are soundproofed from the rest of the laboratory animal facility. Audio data was recorded from December 2019 to April 2023, consisting of 997 hours of data recording (wav format, sampling rate: 96 kHz, depth: 32 bit).

Fig. 1
Fig. 1
Full size image

Schematic of the recording system. The diagram shown here is a schematic drawing of the recording setup, and the relative sizes and positions of the components are not to scale. The husbandry room (1) contained three cages (only one visible here, (2)) and one microphone (3). The technical room (4) was separated by a wall and contained a mixing desk (5) and a computer (6) allowing the recording. Husbandry and technical rooms were soundproof thanks to specialized insulation (7).

Segmentation and labeling

To build a dataset of marmoset vocalizations annotated by type, we followed the pipeline shown in Fig. 2, which comprises three steps: Detection, Cluster-based labeling and Iterative label refinement. Each step is described in this section.

Fig. 2
Fig. 2
Full size image

Pipeline for creating the published database. The Detection phase aims to provide a first pool of segmented vocalizations to start the Cluster-based labeling phase. Later in the process, in the Iterative label refinement phase, all of the raw recordings are passed to the trained classifier.

The Detection phase first aims to isolate a first pool of vocalizations from background noise (examples of noise in Supplementary Fig. 5). We used a stationary noise reduction algorithm relying on spectral gating (noisereduce Python package13). We then partially identified (recordings from 2019-2020) the vocalization sound events using a dynamic-thresholding segmentation algorithm13. This algorithm dynamically sets a noise threshold based on the anticipated level of silence within a vocal behavior clip. It then identifies syllables as continuous vocal segments separated by noise (for more details, cf13, Segmentation). This first phase allowed us to get approximately 100,000 segmented audio events (including some ‘noise’ audio events at this stage) (Fig. 2, blue panel; see hyperparameters in Supplementary Table 2). A measurement of the detection and segmentation accuracy is provided and exemplified in Supplementary Fig. 1.

Given the large number of utterances to label, we opted for a semi-automated procedure leveraging unsupervised and self-supervised machine learning strategies to explore the sound event space and label the vocalization types, as well as filter out the noisy sound events (Fig. 2, orange panel). A convolutional auto-encoder (network architecture and particularities of the training procedure are detailed in12) was trained on spectrograms of 0.5 seconds long acoustic extracts to encode them into a 16-dimensional latent space, allowing the measurement of vocalization similarity10,13. The representations were short-time Fourier Transforms (STFT) with a Hann window of 1,024, no FFT padding, and a hop size of 368), on which we applied a Mel-like bank of 128 triangular filters, logarithmically spread between 1 kHz and 48 kHz. The Mel scale is a popular choice of center frequencies aiming to mimic pitch perception characteristics of the human auditory system, which, in our case, we extend to higher frequencies to fully cover that of marmoset vocalizations. These representations were subsequently treated as points in a feature space after applying the dimensionality reduction algorithm UMAP45. We then clustered vocalizations close to one another in feature space, using a density-based algorithm46, allowing the annotation of vocalizations by type (Fig. 2, orange panel, ‘Clustered sound events’). Clusters, which encompass hundreds to thousands of sound events, were meticulously examined by experts. They associated these clusters with specific call types and filtered out any misclassifications. For each cluster, an expert reviewed a folder of spectrogram images, discarding any that did not align with the cluster’s general trend. Subsequently, these cluster sounds were categorized either by vocalization type or labeled as ‘noise.’ This process yielded a partially labeled database, essential for the subsequent iterative label refinement procedure.

After compiling the initial database, we engaged in an iterative process: we trained a classifier and then improved its predictions by visually inspecting and manually correcting multiple spectrograms displayed simultaneously. These spectrograms were sampled from instances where the classifier misidentified labels with high confidence (Fig. 2., green panel).

We continued this process until no outliers were identified for each label. We empirically observed that each label could use the classifier’s confidence score as a generalization threshold (Infant cry ≥ 0.5, Phee ≥ 0.7, Seep ≥ 0.86, Trill ≥ 0.86, Tsik ≥ 0.7, and Twitter ≥ 0.7). For example, if the classifier predicts that a given vocalization is a Phee call and the associated confidence score is greater than or equal to 0.7, we retained this prediction. Conversely, any vocalization with prediction confidence below these label-specific thresholds was re-labeled as ‘Vocalization’ (i.e., a vocalization of unknown type; examples can be found in Supplementary Fig. 6). At this stage, some call types were excluded because the classifier could not classify them robustly due to insufficient representation in the dataset (e.g., composite call types like Seep-Ek and Trill-Phee). Ultimately, we identified the six most certain vocalization types: Infant cry, Phee, Seep, Trill, Tsik, and Twitter (see examples in Supplementary Fig. 4).

Finally, in the last step (Iterative label refinement, box 2), we manually segmented all the raw recordings with overlapping and ran the trained classifier in each segment. We then calculate the maximum confidence points to identify each vocalization onset and offset. At this stage, most of the onset and offset were slightly shifted. To correct for the classifier imprecisions, we empirically adjusted each vocalization’s start and end times based on its predicted label post-classification (Supplementary Table 3). As a result of this process, we were able to segment 871,044 vocalizations (253 hours), of which 215,000 (72 hours) were identified as a specific type of vocalization (see Fig. 3 for the latent projection of all the vocalizations, colored by label). To allow users to investigate the dataset further and visualize the call types’ relationships, we developed Marmaudio Explorer, an interactive visualization interface of the low-dimensional projection of the vocalizations (Supplementary Fig. 9). It allows users to select a group of points, which makes the spectrograms of the corresponding vocalizations appear on the Spectrograms panel, and saves the selected metadata and spectrogram images in a new folder, allowing for deep investigation.

Fig. 3
Fig. 3
Full size image

Latent projection of vocalizations. For each segmented vocalization, we computed a spectrotemporal representation. Using the trained encoder, we transformed these representations into a 16-dimensional space. From there, we employed the UMAP technique to map the data into latent feature spaces. The colored points denote the predictions where the classifier assigned a high confidence score.

With timestamps for each vocalization, the dataset could offer a starting point for exploring the sequential organization of the marmoset vocal repertoire at the group level, though not at the individual level. See Figs. 4, 5 for visual representations of the vocalizations’ distribution; see Supplementary Table 4 for the distribution by label. One additional application could be to monitor the time the vocalizations are produced (see Supplementary Fig. 7 and the code to make it).

Fig. 4
Fig. 4
Full size image

Distribution of vocalizations. Distribution over call type of 871,044 vocalizations. Seeps are rarer than the other call types. Previous literature has described the Seep as a warning and an alarm call2,4,51. Therefore, we do not expect a higher Seep sample size.

Fig. 5
Fig. 5
Full size image

Temporal distribution of vocalizations over time. Distribution over time of 215,000 labeled vocalizations (72 hours in total). For each month, the proportion of vocalization type is indicated in thousands of vocalizations and in hours. The proportion of labeled/unlabeled vocalization is 25/75% (unlabeled omitted here). Initial recordings began as a proof of concept in 2019–2021, confirming the feasibility of segmenting and labeling vocalizations from raw data. Recording frequency gradually increased in 2022 to capture more detailed patterns. Starting in 2022, precise time information (hour, minute, second, and millisecond) was saved. Temporal variation reflects occasional disturbances in group structure, including conflicts and deaths.

Since several cages are in the same room, a few overlaps can still occur; in such cases, we consider the sound extract ‘noisy’. The recording is removed from the analysis to avoid misidentification. However, we know from previous studies there is a turn-taking system in marmoset conversations. For example, marmosets modulate their calls to reduce overlap47 or favor call productions during periods of silence to avoid overlaps48.

Data Records

The data are publicly available on the Zenodo data repository49.The data consist of:

  1. 1.

    Vocalizations.zip: The 871,044 recorded audio files (FLAC format, sampling rate: 96 kHz, depth: 32 bit).

  2. 2.

    Annotations.tsv: The annotation file with 871,044 annotations. These annotations were obtained from the semi-automatic labeling (see above) and include details such as the predicted vocalization type. The content of each column in the annotation file is described in Table 1. Each annotation corresponds to a single vocalization in one file. Most files contain one vocalization, corresponding to one predicted call type—though a few files contain several vocalizations.

    Table 1 Annotation details.
  3. 3.

    Metadata.pdf: A metadata file that details the annotation definitions, the vocalization type error rates, the description of the recorded subjects, the temporal distribution of vocalizations by label over time, and the individual presence over time with family group (respectively Table 1, 2, Supplementary Table 1, Supplementary Table 4, Supplementary Fig. 3).

    Table 2 Vocalization type error rates.
  4. 4.

    Raw_audio_example_2020_03_05_0.wav: An example raw audio file of 5 minutes long.

  5. 5.

    Audio_Examples.zip: A set of audio example files (a small subset of representative examples sampled from the recorded audio files).

  6. 6.

    Noise_Examples.zip: A few representative audio example files of noise.

  7. 7.

    Code_Usage.zip: A code folder containing a sample Python code exemplifying data loading and plotting of vocalization spectrograms; a sample Python code exemplifying classifier loading and vocalization type prediction; the associated code to run the examples; a classifier trained on the six vocalization types: Phee, Trill, Seep, Twitter, Tsik, Infant cry (which can also be found in Supplementary Code 1,2).

  8. 8.

    Technical_Validation_Data.zip: 700 representative recordings and cross-examining them with four experts.

The recorded audio files are divided into folders by month of recording, with no more than 10,000 files per folder. The annotation and metadata files are in tabular separated-value format (TSV) to ease their use with automatic tools and allow direct upload into spreadsheet software. The metadata file includes descriptions of all identifiers in the annotation file. The example files contain several audio recordings that illustrate different recorded sounds. They are provided to help users become more familiar with the recorded data. These examples include Phee calls, Twitter calls, Infant cry, and examples of background noises.

Technical Validation

The annotation types were defined by (Manon Obliger-Debouche) M.O.D. and Sabrina Ravel (S.R.). The recordings were annotated semi-automatically by Charly Lamothe (C.L.) and Paul Best (P.B.1). These observers were certified after annotating several days of recordings, which were then validated by an expert. (M.O.D. or SR). 700 annotated recordings (100 per label type) were sampled randomly and were then carefully re-annotated by M.O.D., C.L., P.B.1. and S.R. The annotators were presented with each of the 700 randomly sampled vocalizations along with the prediction of the trained classifier, then had to reply if they agreed with it by yes/no (see example of the annotation interface in Supplementary Fig. 8).

Errors were counted when there was a discrepancy between the post-hoc and the original annotations or when the post-hoc examination concluded that some doubt still existed (e.g., if only 3 out of the 4 confirm it, it’s considered an error, procedure from50). The error rate was computed as the proportion of errors divided by the total annotations, resulting in an average of 9.43% (90.57% Confidence-Interval [CI]: 86.00–95.00%) for the vocalization type identification. Accuracy was calculated as 1‐error rate (see Table 2 below for scores per label type). We quantified inter-rater reliability using the accuracy across various vocalization categories, showing high accuracy between specific raters (Supplementary Fig. 2).