Table 4 Experiment results on text prompts.
From: Multi-modal Language models in bioacoustics with zero-shot transfer: a case study
(a) CLAP-HTS-AT (2.1M) performance of recognizing birds in the background. Higher is better | |||
|---|---|---|---|
Is this a sound of {} or frogs? | Ap | ||
Birds Birds singing Birds singing in the background Birds singing far in the background Supervised baseline ap: | 0.54 0.63 0.73 0.79 0.88 | ||
(b) CLAP-HTS-AT (2.1 M) performance of recognizing gunshot sounds in tropical rain forest. Higher is better | |||
Is this a sound of {A} or {B}? | Ap | ||
A: Gunshots, B: Noise A: Gunshots in the distance, B: Noise A: Gunshots in the distance, B: Broken branches or noise Supervised baseline ap: | 0.36 0.57 0.67 0.64 | ||
(c) CLAP-PANN (128 K) performance of recognizing meerkat sounds using 2-second window | |||
Is this a sound of {} or non-animal noise? | Ap | ||
Meerkats Meerkats growling Meerkats clucking Meerkats clucking or growling Growling Clucking Clucking or growling Animals Animals growling Animals clucking Animals clucking or growling Supervised baseline ap: | 0.56 0.68 0.80 0.79 0.63 0.82 0.78 0.85 0.82 0.86 0.88 0.94 | ||