Table 2 Average Precision (AP) comparisons between CLAP (growing number of pretraining pairs) and supervised baselines. Higher is better.

From: Multi-modal Language models in bioacoustics with zero-shot transfer: a case study

Settings

Models

Jackdaw

Freefield

Warblr

Rfcx-Bird

Rfcx-Frog

Hiceas

Enabirds

Meerkat

Tropical-Gunshots

Supervised

ResNet-18

0.99

0.83

0.96

0.88

0.79

0.30

0.98

0.94

0.64

Zero-Shot Transfer

CLAP-HTS-AT (450 K)

0.95(↓)

0.82(↓)

0.96(-)

0.70(↓)

0.78(↓)

0.29(↓)

0.96(↓)

0.81(↓)

0.49(↓)

CLAP-HTS-AT (2.1 M)

0.96(↓)

0.84()

0.96(-)

0.79(↓)

0.81()

0.30(-)

0.98(-)

0.87(↓)

0.67()