npj Digital Medicine

Table 2 Evaluation metrics for healthcare chatbots

From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI

User-centered metrics	Low-level metrics	Definition	Problem	Benchmark
Accuracy	Intrinsic	Linguistical issues and irrelevant responses	Linguistical issues and irrelevant responses	OpenbookQA⁹³, MedQA-USMLE⁹⁴, QuAC⁹⁵, BoolQ⁹⁶, NaturalQuestions⁹⁷, RAFT⁵, HellaSwag⁹⁸, CNN^99,100, XSUM¹⁰¹, BLiMP¹⁰², The Pile¹⁰³, ICE¹⁰⁴, TwitterAAE¹⁰⁵, WikiFact¹⁰⁶, NarrativeQA¹⁰⁷
	SSI	Measuring the relevancy of the generated response	Irrelevant responses	OpenAI Evals¹⁰⁸, ParlAI¹⁰⁹, SuperGLUE¹¹⁰, MMLU¹¹¹, BigBench¹¹², NarrativeQA¹⁰⁷, OpenbookQA⁹³, QuAC⁹⁵, WikiFact¹⁰⁶, BoolQ⁹⁶ NaturalQuestions⁹⁷, MedQA-USMLE⁹⁴
	Robustness	Gauging the resilience of chatbot to any disruptions	Lack of resilience and validity	GLUE¹¹³, CoQA¹¹⁴, LAMBADA¹, TriviaQA¹¹⁵, ANLI¹¹⁶, MNLI¹¹⁷, SQUAD¹¹⁸
	Generalization	Assessing chatbot’s performance on unfamiliar tasks	Overfitting, limited transferability, and lack of validity	TyDiQA⁶⁸, PromptBench¹¹⁹, AdvGLUE¹¹⁶, TextFlint¹²⁰, DDXPlus¹¹⁶, MGSM¹²¹
	Conciseness	Measuring response conciseness accurately	Wordiness and redundancy	KoLA¹²², AlpacaEval⁸, PandaLM¹²³, GLUE-X¹¹⁷, EleutherAIEval⁵
	Up-to-dateness	Evaluating the up-to-dateness of generated response	Hallucination, out-to-dateness, and lack of validity	WikiFact ¹⁰⁶
	Groundedness	Evaluating the factual validity of generated responses	Out-to-dateness, lack of reasoning, lack of validity, and hallucination	LSAT¹²⁴, Dyck¹²⁵, Synthetic reasoning¹²⁶ WikiFact¹⁰⁶, bAbI¹²⁷, Entity matching¹²⁸, Data imputation¹²⁹, HumanEval¹³⁰, APPS¹³¹, MATH¹³², GSM8K¹³³
Trustworthiness	Safety and Security	Measuring compliance of generated responses to ethical aspects	Toxicity	RealToxicityPrompts¹³⁴, TruthfulQA¹³⁵, CivilComments⁴⁹, BOLD¹³⁶, BBQ¹³⁷
	Privacy	Evaluating the model’s use of sensitive user information	Lack of privacy	DP-SGD^138,139
	Bias	Measuring the generated response bias toward specific populations	Lack of personzalition and toxicity	CrowS-Pairs¹⁴⁰, WinoGender¹³, BBQ¹³⁷, TruthfulQA¹³⁵, RealToxicityPrompts¹³⁴, CivilComments⁴⁹
	Interpretability	Assessing user interpretability of generated responses	Lack of reasoning and hallucination	HumanEval¹³⁰, APPS¹³¹, GSM8K¹³³, HellaSwag⁹⁸, LogiQA¹⁴¹, WikiFact¹⁰⁶, Synthetic reasoning¹²⁶, bAbI¹²⁷, Dyck¹²⁵, Entity matching¹²⁸, Data imputation¹²⁹, MATH¹³²
Empathy	Emotional Support	Measuring chatbots’ integration of user emotions	Lack of personalization and toxicity	TruthfulQA¹³⁵, CivilComments⁴⁹, IMDB¹⁴², BBQ¹³⁷, BOLD¹³⁶, RealToxicityPrompts¹³⁴
	Health Literacy	Assessing response understandability across different levels of health knowledge	Lack of empathy and personalization	ParlAI¹⁰⁹, SuperGLUE¹¹⁰
	Fairness	Evaluating chatbot’s consistency, quality, and fairness across demographic users	Lack of personalization, empathy, reliability, and toxicity	OpenAIEvals¹⁰⁸, ETHICS¹⁴³, ParlAI¹⁰⁹, IMBD¹⁴², MoralExceptQA¹⁴⁴, MACHIAVELLI¹⁴⁵, BOLD¹³⁶, SOCIALCHEM-101¹⁴⁶, TruthfulQA¹³⁵, BBQ¹³⁷, CivilComments⁴⁹, RealToxicityPrompts¹³⁴
	Personalization	Gauging chatbot conversation’s level of individualization	Toxicity, lack of personalization, empathy, and reliability	RealToxicityPrompts¹³⁴, BOLD¹³⁶, BBQ¹³⁷, IMBD¹⁴², TrusthfulQA¹³⁵, CivilComments⁴⁹
Performance	Memory Efficiency	Measuring chatbot’s memory usage	Latency and lack of usability	ANLI¹¹⁶, ParlAI¹⁰⁹
	FLOP	Assessing Chatbot’s floating point operation count	Latency and lack of usability	ANLI¹¹⁶, ParlAI¹⁰⁹
	Token Limit	Assessing chatbot’s performance (computational and memory)	Latency and lack of usability	–
	Number of Parameter	Evaluating model’s data processing and learning capacity	Latency and lack of usability	–

Back to article page

Search

Advanced search

Quick links