Fig. 6: Decision tree for choosing the right statistical test based on type of assessment and metric.

The choice of specific statistical tests is based on the type of data and the evaluation objectives within the context of each study. Parametric tests such as t-tests and ANOVA are chosen when the data are normally distributed and the goal is to compare means between groups, ensuring that the means of different groups are statistically analyzed to identify significant differences. Non-parametric tests like the Mann–Whitney U test and Kruskal–Wallis test are used when the data do not meet normality assumptions, providing robust alternatives for comparing medians or distributions between groups for ordinal or non-normally distributed data. Chi-Square and Fisher’s Exact tests are suitable for analyzing categorical data and assessing associations or goodness-of-fit between observed and expected frequencies, making them appropriate for evaluating the fit between LLM-generated medical evidence and clinical guidelines. Measures like Cohen’s Kappa and ICC are utilized to assess inter-rater reliability, ensuring that the agreement between evaluators is not due to chance and enhancing the reliability of the evaluation results.