Table 1 A brief overview of intrinsic metrics for LLMs
From: Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI
Name | Focus | Measure | Model |
---|---|---|---|
 | Calculates precision based on the number of mutual n consecutive words between reference and generated text. | ||
 | Calculates F1-score based on the number of mutual n consecutive words between reference and generated text. | ||
 | Likelihood of the model generating the reference text. | ||
 | Creates a similarity matrix between reference and generated text and calculates the weighted sum of maximum similarity in the matrix. | BERT, RoBERT20 | |
 | Calculates F1-Score (with more weight on recall) based on the number of matched words considering synonyms in the reference and generated text. | GPT-3.569 | |
General | Is calculated by dividing the number of correctly generated relevant words by the total number of generated words. | ||
 | Is calculated by dividing the number of correctly generated relevant words by the total number of possible relevant words. | ||
 | Is calculated as the harmonic mean of precision and recall. | ||
TER84 | Â | Is computed based on the minimum number of edits required to transform the generated text into the reference text. | GPT-485 |
MoverScore86 | Â | Like BERTScore calculates similarity matrix but considers many-to-one word relationships. | GPT-3.569 |
NIST84 | Â | Similar to BLEU with the difference that it gives higher weight to more valuable mutual n consecutive words. | |
 | Calculating the percentage of successful diagnosis. | ||
Dialog | Evaluating the chatbot’s ability to accurately inquire about relevant symptoms. | ||
 | Averaging number of turns the average number of turns or interactions between the user and chatbot. |