Table 1 Evaluation dimensions and indicators of AI agent in the healthcare

Dimension	Primary indicator	Representative metrics	Operational example (typical agent)
Basic indicators	Objective correctness	Accuracy, Precision, Recall, F1-score, ROC-AUC – be used to measure the correctness of the model’s prediction results	ClinicalAgent⁶², MedAgents³⁸
	Semantic correctness	BLEU, ROUGE, METEOR, BERTScore -- be utilized to assess the semantic correctness of a model	MedAide⁴⁰, MedReAct’N’MedReFlex⁴³
	Task completion	Completion rate, success rate, (tool use) -- be used as indicators to examine how well the model achieves a specific medical task	Agent for oncology⁴¹, MMedAgent³⁴
Developmental indicators	Efficiency level	Response time, number of interaction rounds -- be placed on the response time and the number of interaction rounds	Diaggpt⁸¹, MDAgents³⁹
	Content & presentation quality	Richness, usefulness, safety, ethical compliance, readability, coherence – ensure output content meets requirements in terms of text quality and content value	Polaris⁵⁵, CheXagent⁴⁴,
	Humanistic care	Humanistic care, confidence, adherence, satisfaction –assess the appropriateness of humanistic considerations and user acceptability in the interaction.	AgentClinic⁷⁹, Chat Ella⁹⁶

The “Operational example” column lists typical AI agents for each evaluation dimension, but these agents are not limited to that dimension alone.

Quick links

Search