Table 6 Performance comparison across different approaches

From: Exploiting large language models for diagnosing autism associated language disorders and identifying distinct features

Model

Accuracy

PPV

Sensitivity

F1 score

w/o speaker diarization

63.64%

48.12%

63.64%

54.80%

w/ pyannote23

68.18%

59.68%

68.18%

60.45%

w/ Microsoft35

72.73%

71.25%

72.73%

71.98%

w/ Google36

69.14%

71.52%

93.22%

80.94%

w/ human diarization + Mistral

56.60%

75.07%

57.68%

65.24%

w/ human diarization + Qwen

49.21%

73.51%

43.88%

54.95%

w/ human diarization + Gemini

69.10%

72.81%

89.63%

80.35%

w/ human diarization + GPT

82.00%

91.06%

82.49%

86.57%

  1. Bold values represent the best performance in each column metric.