Table 6 Performance comparison across different approaches
Model | Accuracy | PPV | Sensitivity | F1 score |
|---|---|---|---|---|
w/o speaker diarization | 63.64% | 48.12% | 63.64% | 54.80% |
w/ pyannote23 | 68.18% | 59.68% | 68.18% | 60.45% |
w/ Microsoft35 | 72.73% | 71.25% | 72.73% | 71.98% |
w/ Google36 | 69.14% | 71.52% | 93.22% | 80.94% |
w/ human diarization + Mistral | 56.60% | 75.07% | 57.68% | 65.24% |
w/ human diarization + Qwen | 49.21% | 73.51% | 43.88% | 54.95% |
w/ human diarization + Gemini | 69.10% | 72.81% | 89.63% | 80.35% |
w/ human diarization + GPT | 82.00% | 91.06% | 82.49% | 86.57% |