Table 2 Evaluation metrics for healthcare chatbots
From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI
User-centered metrics | Low-level metrics | Definition | Problem | Benchmark |
|---|---|---|---|---|
Accuracy | Intrinsic | Linguistical issues and irrelevant responses | Linguistical issues and irrelevant responses | OpenbookQA93, MedQA-USMLE94, QuAC95, BoolQ96, NaturalQuestions97, RAFT5, HellaSwag98, CNN99,100, XSUM101, BLiMP102, The Pile103, ICE104, TwitterAAE105, WikiFact106, NarrativeQA107 |
SSI | Measuring the relevancy of the generated response | Irrelevant responses | OpenAI Evals108, ParlAI109, SuperGLUE110, MMLU111, BigBench112, NarrativeQA107, OpenbookQA93, QuAC95, WikiFact106, BoolQ96 NaturalQuestions97, MedQA-USMLE94 | |
Robustness | Gauging the resilience of chatbot to any disruptions | Lack of resilience and validity | GLUE113, CoQA114, LAMBADA1, TriviaQA115, ANLI116, MNLI117, SQUAD118 | |
Generalization | Assessing chatbot’s performance on unfamiliar tasks | Overfitting, limited transferability, and lack of validity | TyDiQA68, PromptBench119, AdvGLUE116, TextFlint120, DDXPlus116, MGSM121 | |
Conciseness | Measuring response conciseness accurately | Wordiness and redundancy | KoLA122, AlpacaEval8, PandaLM123, GLUE-X117, EleutherAIEval5 | |
Up-to-dateness | Evaluating the up-to-dateness of generated response | Hallucination, out-to-dateness, and lack of validity | WikiFact 106 | |
Groundedness | Evaluating the factual validity of generated responses | Out-to-dateness, lack of reasoning, lack of validity, and hallucination | LSAT124, Dyck125, Synthetic reasoning126 WikiFact106, bAbI127, Entity matching128, Data imputation129, HumanEval130, APPS131, MATH132, GSM8K133 | |
Trustworthiness | Safety and Security | Measuring compliance of generated responses to ethical aspects | Toxicity | RealToxicityPrompts134, TruthfulQA135, CivilComments49, BOLD136, BBQ137 |
Privacy | Evaluating the model’s use of sensitive user information | Lack of privacy | ||
Bias | Measuring the generated response bias toward specific populations | Lack of personzalition and toxicity | CrowS-Pairs140, WinoGender13, BBQ137, TruthfulQA135, RealToxicityPrompts134, CivilComments49 | |
Interpretability | Assessing user interpretability of generated responses | Lack of reasoning and hallucination | HumanEval130, APPS131, GSM8K133, HellaSwag98, LogiQA141, WikiFact106, Synthetic reasoning126, bAbI127, Dyck125, Entity matching128, Data imputation129, MATH132 | |
Empathy | Emotional Support | Measuring chatbots’ integration of user emotions | Lack of personalization and toxicity | TruthfulQA135, CivilComments49, IMDB142, BBQ137, BOLD136, RealToxicityPrompts134 |
Health Literacy | Assessing response understandability across different levels of health knowledge | Lack of empathy and personalization | ||
Fairness | Evaluating chatbot’s consistency, quality, and fairness across demographic users | Lack of personalization, empathy, reliability, and toxicity | OpenAIEvals108, ETHICS143, ParlAI109, IMBD142, MoralExceptQA144, MACHIAVELLI145, BOLD136, SOCIALCHEM-101146, TruthfulQA135, BBQ137, CivilComments49, RealToxicityPrompts134 | |
Personalization | Gauging chatbot conversation’s level of individualization | Toxicity, lack of personalization, empathy, and reliability | RealToxicityPrompts134, BOLD136, BBQ137, IMBD142, TrusthfulQA135, CivilComments49 | |
Performance | Memory Efficiency | Measuring chatbot’s memory usage | Latency and lack of usability | |
FLOP | Assessing Chatbot’s floating point operation count | Latency and lack of usability | ||
Token Limit | Assessing chatbot’s performance (computational and memory) | Latency and lack of usability | – | |
Number of Parameter | Evaluating model’s data processing and learning capacity | Latency and lack of usability | – |