Table 5 Comparison of RAG (Retrieval-Augmented Generation) results between database 1 and database 2.
From: Medical QA dialogue datasets in RAG systems performance evaluation and ChatGPT optimization
Metrics | Average_Rag_D1 | Average_Rag_D2 | t-statistic | p-value |
|---|---|---|---|---|
ROUGE-1-r | 0.180514 | 0.196569 | 2.51839 | 0.012016 |
ROUGE-2-r | 0.028287 | 0.035278 | 2.282786 | 0.022748 |
ROUGE-L-r | 0.128812 | 0.139418 | 2.03938 | 0.041795 |
ROUGE-1-p | 0.238408 | 0.244621 | 0.807963 | 0.419392 |
ROUGE-2-p | 0.039806 | 0.046591 | 1.50975 | 0.131568 |
ROUGE-L-p | 0.170528 | 0.174567 | 0.608478 | 0.543072 |
ROUGE-1-f | 0.19621 | 0.209108 | 2.198792 | 0.028226 |
ROUGE-2-f | 0.031278 | 0.038199 | 2.094711 | 0.036563 |
ROUGE-L-f | 0.139866 | 0.148446 | 1.736565 | 0.082913 |
BLEU | 0.133574 | 0.148997 | 2.563567 | 0.010572 |
bert_score_P | 0.717491 | 0.720673 | 0.975277 | 0.329766 |
bert_score_R | 0.688262 | 0.696385 | 2.529947 | 0.011631 |
bert_score_F1 | 0.702098 | 0.707819 | 1.965018 | 0.049815 |