Table 2 Evaluation metrics for healthcare chatbots

From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

User-centered metrics

Low-level metrics

Definition

Problem

Benchmark

Accuracy

Intrinsic

Linguistical issues and irrelevant responses

Linguistical issues and irrelevant responses

OpenbookQA93, MedQA-USMLE94, QuAC95, BoolQ96, NaturalQuestions97, RAFT5, HellaSwag98, CNN99,100, XSUM101, BLiMP102, The Pile103, ICE104, TwitterAAE105, WikiFact106, NarrativeQA107

SSI

Measuring the relevancy of the generated response

Irrelevant responses

OpenAI Evals108, ParlAI109, SuperGLUE110, MMLU111, BigBench112, NarrativeQA107, OpenbookQA93, QuAC95, WikiFact106, BoolQ96 NaturalQuestions97, MedQA-USMLE94

Robustness

Gauging the resilience of chatbot to any disruptions

Lack of resilience and validity

GLUE113, CoQA114, LAMBADA1, TriviaQA115, ANLI116, MNLI117, SQUAD118

Generalization

Assessing chatbot’s performance on unfamiliar tasks

Overfitting, limited transferability, and lack of validity

TyDiQA68, PromptBench119, AdvGLUE116, TextFlint120, DDXPlus116, MGSM121

Conciseness

Measuring response conciseness accurately

Wordiness and redundancy

KoLA122, AlpacaEval8, PandaLM123, GLUE-X117, EleutherAIEval5

Up-to-dateness

Evaluating the up-to-dateness of generated response

Hallucination, out-to-dateness, and lack of validity

WikiFact 106

Groundedness

Evaluating the factual validity of generated responses

Out-to-dateness, lack of reasoning, lack of validity, and hallucination

LSAT124, Dyck125, Synthetic reasoning126 WikiFact106, bAbI127, Entity matching128, Data imputation129, HumanEval130, APPS131, MATH132, GSM8K133

Trustworthiness

Safety and Security

Measuring compliance of generated responses to ethical aspects

Toxicity

RealToxicityPrompts134, TruthfulQA135, CivilComments49, BOLD136, BBQ137

Privacy

Evaluating the model’s use of sensitive user information

Lack of privacy

DP-SGD138,139

Bias

Measuring the generated response bias toward specific populations

Lack of personzalition and toxicity

CrowS-Pairs140, WinoGender13, BBQ137, TruthfulQA135, RealToxicityPrompts134, CivilComments49

Interpretability

Assessing user interpretability of generated responses

Lack of reasoning and hallucination

HumanEval130, APPS131, GSM8K133, HellaSwag98, LogiQA141, WikiFact106, Synthetic reasoning126, bAbI127, Dyck125, Entity matching128, Data imputation129, MATH132

Empathy

Emotional Support

Measuring chatbots’ integration of user emotions

Lack of personalization and toxicity

TruthfulQA135, CivilComments49, IMDB142, BBQ137, BOLD136, RealToxicityPrompts134

Health Literacy

Assessing response understandability across different levels of health knowledge

Lack of empathy and personalization

ParlAI109, SuperGLUE110

Fairness

Evaluating chatbot’s consistency, quality, and fairness across demographic users

Lack of personalization, empathy, reliability, and toxicity

OpenAIEvals108, ETHICS143, ParlAI109, IMBD142, MoralExceptQA144, MACHIAVELLI145, BOLD136, SOCIALCHEM-101146, TruthfulQA135, BBQ137, CivilComments49, RealToxicityPrompts134

Personalization

Gauging chatbot conversation’s level of individualization

Toxicity, lack of personalization, empathy, and reliability

RealToxicityPrompts134, BOLD136, BBQ137, IMBD142, TrusthfulQA135, CivilComments49

Performance

Memory Efficiency

Measuring chatbot’s memory usage

Latency and lack of usability

ANLI116, ParlAI109

FLOP

Assessing Chatbot’s floating point operation count

Latency and lack of usability

ANLI116, ParlAI109

Token Limit

Assessing chatbot’s performance (computational and memory)

Latency and lack of usability

–

Number of Parameter

Evaluating model’s data processing and learning capacity

Latency and lack of usability

–