Table 3 Summary of unified evaluation constructs

From: A scoping review of large language models for generative tasks in mental health care

Step

Higher-order construct

Lower-order construct

Definition

Examples

Article references

1

Safety, Privacy, and Fairness

Safety

Prevent worse outcomes for the patient, provider, or health system from occurring as a result of the use of an ML algorithm.

Outcome proxy appropriateness, Data provenance, Harm control, Reducing automation bias, Critical help, Ethics, etc.

20,34

Safety, Privacy, and Fairness

Privacy

Protect privacy according to standards like HIPAA and GDPR, ensuring user autonomy and dignity.

Data exchange, Data collection and storage, Data usage, Privacy Policy, Data protection, etc.

35

Safety, Privacy, and Fairness

Fairness and bias management

Ensure the chatbot operate with minimized and acknowledged biases to ensure fair outcomes.

Systemic Bias, Computational and Statistical Bias, Human-cognitive biases, Population bias, etc.

20

2

Trustworthiness and Usefulness

Beneficence

Ensure the chatbot positively impacts its intended outcomes, emphasizing measurable benefits over potential risks

Health Outcomes, Clinical Evidence, User Behaviors, Intervention, Healthcare System, etc.

14,15,16,18,20,21,22,35

Trustworthiness and Usefulness

Generalizability

Apply learned patterns to new, unseen data.

Contextual Adaptability, Novel Data Performance, etc.

20,34

Trustworthiness and Usefulness

Reliability

Ensure that the chatbot consistently performs as intended under various conditions and maintains dependable operation over time.

Failure Prevention, Robustness, Workflow Integration, Reproducibility, Monitoring, Up-to-dateness, etc.

19,48

Trustworthiness and Usefulness

Validity

Ensure the chatbot performs as expected in real-world conditions

Data Relevance and Credibility, Language Understanding, Information Retrieval Accuracy, Outcome Accuracy, Task Completion, etc.

20,21,26,34

3

Design and Operational Effectiveness

Accessibility

Ensure those involved in the chatbot’s lifecycle uphold standards of auditability and harm minimization.

Versatile access, User literacy required, User experience, User Interface Design, Simplicity/Ease of Use, etc.

15,16,18,20,21,26,28,32,35

Design and Operational Effectiveness

Personalized Engagement

Tailor responses based on patient data and preferences.

Personalization, Anthropomorphism/relationship, User Adherence, Feedback Incorporation, Progress awareness, etc.

18,20,23,31,32,33,34,35

Design and Operational Effectiveness

Cost-Effectiveness

Assess whether the chatbot delivers beneficial outcomes at a reasonable cost, providing a better or more economical solution compared to existing methods.

Comparative Effectiveness, Economical Viability, Environmental Viability, Task Efficiency, Workflow Considerations, etc.

20,26,34

  1. Table 3 summarizes the mapped primary and second-level constructs across the reviewed studies. We have also included examples of sub-constructs for each mapped second-level construct for the readers to understand the mapped constructs. Further details of evaluation subjects, evaluation methods, sample sizes, scale names, original constructs, mapped second-level constructs, and levels associated with each article can be found in Supplementary Table 3. Practical evaluation questions related to each construct can be found in the original article.
  2. Constructs have been mapped to the second level to avoid excessive scarcity and granularity.