Fig. 2
From: Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

Illustration of fixed window sound event existence classification. In bioacoustics, a common approach to detect sound events of interest is classification of audio segments with fixed window sizes. The usual procedure begins with the conversion of raw audio into a visual representation, such as a spectrogram. Subsequently, the spectrogram is divided into segments using a fixed time window (e.g., 7 s in this example) and a window step size (e.g., also 7 s in this example). By employing a visual classification model, the presence or absence of the sound event of interest is predicted for each segment. Using these predictions, we can obtain approximate time stamps for the localization of sound events. In practice, step sizes are often smaller than window sizes for higher classification resolution. For example, under the BEANS setup, the Jackdaw benchmark has a 2-second window size with a 1-second step size51.