npj Digital Medicine

Table 4 Top 11 statistical analysis conducted in the studies

From: A framework for human evaluation of large language models in healthcare derived from literature review

Statistical test	Definition	Number of studies
T-Test	A statistical test used to determine if the means of two groups are significantly different from each other. It is commonly used to compare the performance of an LLM against a human benchmark^54,86,89.	17
Mann–Whitney U test	A non-parametric test used to compare two independent samples to assess whether their population distributions differ. It is an alternative to the T-test when the data is not normally distributed^28,48.	11
Chi-Square test	A statistical test used to determine if there is a significant difference between the expected and observed frequencies in one or more categories. It is commonly used to assess the goodness-of-fit between an LLM’s output and expected clinical outcomes^71,90.	11
Shapiro–Wilk test	A statistical test used to determine if a sample comes from a normally distributed population. It is often used to check the normality assumption for the application of other parametric tests⁵³.	6
ANOVA	A statistical test used to determine if there are any statistically significant differences between the means of two or more independent groups^78,82.	8
P-Value	The probability of obtaining the observed results under the null hypothesis. It is used to determine the statistical significance of the differences observed between an LLM’s performance and a benchmark^39,76.	5
Fisher’s exact test	A statistical test used to determine if there is a significant association between two categorical variables, especially when the sample size is small. It is an alternative to the Chi-Square test in such cases^90,91.	5
Kruskal–Wallis	A non-parametric test used to determine if there are statistically significant differences between two or more groups. It is an alternative to the one-way ANOVA when the assumptions for ANOVA are not met⁷³.	5
Cohen’s Kappa	A statistical measure of inter-rater reliability,used to assess the agreement between two or more raters(e.g., LLM vs human) in classifying or categorizing items⁷⁹.	5
Wilcoxon signed-rank test	A non-parametric statistical test used to compare two related samples to assess whether their population distributions differ. It is an alternative to the paired T-test when the data is not normally distributed⁹².	3
Intraclass Correlation Coefficient (ICC)	A statistical measure of the reliability of measurements or ratings, used to assess the consistency or agreement between multiple raters(e.g., LLM vs human) on the same set of items⁹³.	3

Back to article page

Search

Advanced search

Quick links