Table 7 Retrieval strategies in title_plus_ask (GPT-4o, RAG).
From: Medical QA dialogue datasets in RAG systems performance evaluation and ChatGPT optimization
Label | ROUGE-L F1 | Δ vs. baseline | 95% CI | p-value | BERTScore-F1 | BLEU-2 | Cosine |
|---|---|---|---|---|---|---|---|
baseline_vector_gpt4o | 0.1491 | – | – | – | 0.7038 | 0.0300 | 0.4261 |
hybrid_rerank_gpt4o | 0.1551 | + 0.0061 | [+ 0.0018, + 0.0104] | 0.003 | 0.7015 | 0.0325 | 0.4593 |
hybrid_rrf_gpt4o | 0.1566 | + 0.0076 | [+ 0.0032, + 0.0121] | 0.001 | 0.7020 | 0.0319 | 0.4824 |