Table 5 Comparison of RAG (Retrieval-Augmented Generation) results between database 1 and database 2.

From: Medical QA dialogue datasets in RAG systems performance evaluation and ChatGPT optimization

Metrics

Average_Rag_D1

Average_Rag_D2

t-statistic

p-value

ROUGE-1-r

0.180514

0.196569

2.51839

0.012016

ROUGE-2-r

0.028287

0.035278

2.282786

0.022748

ROUGE-L-r

0.128812

0.139418

2.03938

0.041795

ROUGE-1-p

0.238408

0.244621

0.807963

0.419392

ROUGE-2-p

0.039806

0.046591

1.50975

0.131568

ROUGE-L-p

0.170528

0.174567

0.608478

0.543072

ROUGE-1-f

0.19621

0.209108

2.198792

0.028226

ROUGE-2-f

0.031278

0.038199

2.094711

0.036563

ROUGE-L-f

0.139866

0.148446

1.736565

0.082913

BLEU

0.133574

0.148997

2.563567

0.010572

bert_score_P

0.717491

0.720673

0.975277

0.329766

bert_score_R

0.688262

0.696385

2.529947

0.011631

bert_score_F1

0.702098

0.707819

1.965018

0.049815