Table 4 Top 11 statistical analysis conducted in the studies
Statistical test | Definition | Number of studies |
|---|---|---|
T-Test | A statistical test used to determine if the means of two groups are significantly different from each other. It is commonly used to compare the performance of an LLM against a human benchmark54,86,89. | 17 |
Mann–Whitney U test | A non-parametric test used to compare two independent samples to assess whether their population distributions differ. It is an alternative to the T-test when the data is not normally distributed28,48. | 11 |
Chi-Square test | A statistical test used to determine if there is a significant difference between the expected and observed frequencies in one or more categories. It is commonly used to assess the goodness-of-fit between an LLM’s output and expected clinical outcomes71,90. | 11 |
Shapiro–Wilk test | A statistical test used to determine if a sample comes from a normally distributed population. It is often used to check the normality assumption for the application of other parametric tests53. | 6 |
ANOVA | A statistical test used to determine if there are any statistically significant differences between the means of two or more independent groups78,82. | 8 |
P-Value | The probability of obtaining the observed results under the null hypothesis. It is used to determine the statistical significance of the differences observed between an LLM’s performance and a benchmark39,76. | 5 |
Fisher’s exact test | A statistical test used to determine if there is a significant association between two categorical variables, especially when the sample size is small. It is an alternative to the Chi-Square test in such cases90,91. | 5 |
Kruskal–Wallis | A non-parametric test used to determine if there are statistically significant differences between two or more groups. It is an alternative to the one-way ANOVA when the assumptions for ANOVA are not met73. | 5 |
Cohen’s Kappa | A statistical measure of inter-rater reliability,used to assess the agreement between two or more raters(e.g., LLM vs human) in classifying or categorizing items79. | 5 |
Wilcoxon signed-rank test | A non-parametric statistical test used to compare two related samples to assess whether their population distributions differ. It is an alternative to the paired T-test when the data is not normally distributed92. | 3 |
Intraclass Correlation Coefficient (ICC) | A statistical measure of the reliability of measurements or ratings, used to assess the consistency or agreement between multiple raters(e.g., LLM vs human) on the same set of items93. | 3 |