Table 1 Evaluation dimensions and indicators of AI agent in the healthcare

From: AI agent in healthcare: applications, evaluations, and future directions

Dimension

Primary indicator

Representative metrics

Operational example (typical agent)

Basic indicators

Objective correctness

Accuracy, Precision, Recall, F1-score, ROC-AUC – be used to measure the correctness of the model’s prediction results

ClinicalAgent62, MedAgents38

 

Semantic correctness

BLEU, ROUGE, METEOR, BERTScore -- be utilized to assess the semantic correctness of a model

MedAide40, MedReAct’N’MedReFlex43

 

Task completion

Completion rate, success rate, (tool use) -- be used as indicators to examine how well the model achieves a specific medical task

Agent for oncology41, MMedAgent34

Developmental indicators

Efficiency level

Response time, number of interaction rounds -- be placed on the response time and the number of interaction rounds

Diaggpt81, MDAgents39

 

Content & presentation quality

Richness, usefulness, safety, ethical compliance, readability, coherence – ensure output content meets requirements in terms of text quality and content value

Polaris55, CheXagent44,

 

Humanistic care

Humanistic care, confidence, adherence, satisfaction –assess the appropriateness of humanistic considerations and user acceptability in the interaction.

AgentClinic79, Chat Ella96

  1. The “Operational example” column lists typical AI agents for each evaluation dimension, but these agents are not limited to that dimension alone.