Table 4 Experiment results on text prompts.

From: Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

(a) CLAP-HTS-AT (2.1M) performance of recognizing birds in the background. Higher is better

Is this a sound of {} or frogs?

Ap

Birds

Birds singing

Birds singing in the background

Birds singing far in the background

Supervised baseline ap:

0.54

0.63

0.73

0.79

0.88

(b) CLAP-HTS-AT (2.1 M) performance of recognizing gunshot sounds in tropical rain forest. Higher is better

Is this a sound of {A} or {B}?

Ap

A: Gunshots, B: Noise

A: Gunshots in the distance, B: Noise

A: Gunshots in the distance, B: Broken branches or noise

Supervised baseline ap:

0.36

0.57

0.67

0.64

(c) CLAP-PANN (128 K) performance of recognizing meerkat sounds using 2-second window

Is this a sound of {} or non-animal noise?

Ap

Meerkats

Meerkats growling

Meerkats clucking

Meerkats clucking or growling

Growling

Clucking

Clucking or growling

Animals

Animals growling

Animals clucking

Animals clucking or growling

Supervised baseline ap:

0.56

0.68

0.80

0.79

0.63

0.82

0.78

0.85

0.82

0.86

0.88

0.94