Fig. 2: Semi-automated pipeline for LLMs and LVLMs evaluation.
From: Evaluating the performance of large language & visual-language models in cervical cytology screening

Fed questions and system prompts into each model, collect and format answers, and finally use statistical metrics, LLM-based metrics, and expert evaluation to assess the answers.