Table 3 Performance metrics comparison for different models, including human performance. Statistical significance was assessed using a Student’s t-test comparing the AI models against human performance. Significance levels: *\(p<0.05\), **\(p<0.01\), ***\(p<0.001\). All values are given in percentage and represent the mean ± standard deviation. The highest value for each metric is bold and underlined, and the second-highest is bold only.

From: AI-assisted phenotyping in a zebrafish hypophosphatasia model enables early and precise detection of skeletal alterations

Metric

BEiT

ResNet

ViT

Human

AUC

\(\underline{{\textbf {84.3 }}\pm {\textbf {2.0}}}\)***

77.0 ± 3.4***

71.8 ± 3.1***

54.6 ± 3.6

Accuracy

\(\underline{{\textbf {68.1}} \pm {\textbf {2.3}}}\)***

58.7 ± 1.9*

52.9 ± 3.1

38.0 ± 6.4

F1 Score

\(\underline{{\textbf {67.9}} \pm {\textbf {2.0}}}\)***

58.4 ± 1.9**

51.6 ± 3.9**

36.8 ± 5.3

Precision

\(\underline{{\textbf {69.6}} \pm {\textbf {1.8}}}\)***

60.5 ± 2.8***

55.4 ± 4.4***

38.2 ± 4.2

TPR

\(\underline{{\textbf {68.4}} \pm {\textbf {2.5}}}\)***

58.4 ± 1.8**

52.6 ± 3.4**

37.6 ± 5.1

FPR

\(\underline{{\textbf {84.0}} \pm {\textbf {1.2}}}\)***

79.2 ± 0.9**

76.4 ± 1.6*

68.9 ± 2.7