Table 2 Average Precision (AP) comparisons between CLAP (growing number of pretraining pairs) and supervised baselines. Higher is better.
From: Multi-modal Language models in bioacoustics with zero-shot transfer: a case study
Settings | Models | Jackdaw | Freefield | Warblr | Rfcx-Bird | Rfcx-Frog | Hiceas | Enabirds | Meerkat | Tropical-Gunshots |
|---|---|---|---|---|---|---|---|---|---|---|
Supervised | ResNet-18 | 0.99 | 0.83 | 0.96 | 0.88 | 0.79 | 0.30 | 0.98 | 0.94 | 0.64 |
Zero-Shot Transfer | CLAP-HTS-AT (450 K) | 0.95(↓) | 0.82(↓) | 0.96(-) | 0.70(↓) | 0.78(↓) | 0.29(↓) | 0.96(↓) | 0.81(↓) | 0.49(↓) |
CLAP-HTS-AT (2.1 M) | 0.96(↓) | 0.84(↑) | 0.96(-) | 0.79(↓) | 0.81(↑) | 0.30(-) | 0.98(-) | 0.87(↓) | 0.67(↑) |