Table 4 Top 11 statistical analysis conducted in the studies

From: A framework for human evaluation of large language models in healthcare derived from literature review

Statistical test

Definition

Number of studies

T-Test

A statistical test used to determine if the means of two groups are significantly different from each other. It is commonly used to compare the performance of an LLM against a human benchmark54,86,89.

17

Mann–Whitney U test

A non-parametric test used to compare two independent samples to assess whether their population distributions differ. It is an alternative to the T-test when the data is not normally distributed28,48.

11

Chi-Square test

A statistical test used to determine if there is a significant difference between the expected and observed frequencies in one or more categories. It is commonly used to assess the goodness-of-fit between an LLM’s output and expected clinical outcomes71,90.

11

Shapiro–Wilk test

A statistical test used to determine if a sample comes from a normally distributed population. It is often used to check the normality assumption for the application of other parametric tests53.

6

ANOVA

A statistical test used to determine if there are any statistically significant differences between the means of two or more independent groups78,82.

8

P-Value

The probability of obtaining the observed results under the null hypothesis. It is used to determine the statistical significance of the differences observed between an LLM’s performance and a benchmark39,76.

5

Fisher’s exact test

A statistical test used to determine if there is a significant association between two categorical variables, especially when the sample size is small. It is an alternative to the Chi-Square test in such cases90,91.

5

Kruskal–Wallis

A non-parametric test used to determine if there are statistically significant differences between two or more groups. It is an alternative to the one-way ANOVA when the assumptions for ANOVA are not met73.

5

Cohen’s Kappa

A statistical measure of inter-rater reliability,used to assess the agreement between two or more raters(e.g., LLM vs human) in classifying or categorizing items79.

5

Wilcoxon signed-rank test

A non-parametric statistical test used to compare two related samples to assess whether their population distributions differ. It is an alternative to the paired T-test when the data is not normally distributed92.

3

Intraclass Correlation Coefficient (ICC)

A statistical measure of the reliability of measurements or ratings, used to assess the consistency or agreement between multiple raters(e.g., LLM vs human) on the same set of items93.

3