Table 4 Retrieval performance metrics for different models at K = 5 and K = 10.

From: A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

K

Method

Overall

Multiple

Single

Recall

MAP

NDCG

Recall

MAP

NDCG

Recall

MAP

NDCG

5

Native RAG

54.6  ± 1.1

52.5  ± 0.9

62.9  ± 1.3

26.9  ± 1.5

42.5  ± 1.8

49.5  ± 2.1

70.6  ± 1.0

58.3  ± 0.8

70.6  ± 1.0

Temporal Filter

49.0  ± 1.4

45.1  ± 1.2

56.8  ± 1.5

15.5  ± 1.9

23.2  ± 2.2

30.8  ± 2.5

68.5  ± 1.2

57.8  ± 1.1

70.9  ± 0.9

Query Rewrite

55.7†  ± 1.0

53.3†  ± 0.8

64.0†  ± 1.1

29.0†  ± 1.4

44.2†  ± 1.6

51.6†  ± 1.9

71.3†  ± 0.9

58.6  ± 0.7

71.3†  ± 0.9

Query Decomposition

61.9†  ± 0.9

56.4†  ± 0.7

71.8†  ± 1.0

45.2†  ± 1.2

52.4†  ± 1.4

72.1†  ± 1.5

71.6†  ± 0.9

58.6  ± 0.7

71.6†  ± 0.9

10

Native RAG

61.8  ± 1.0

53.5  ± 0.9

71.8  ± 1.1

35.6  ± 1.6

43.7  ± 1.7

62.6  ± 1.9

77.1  ± 0.9

59.1  ± 0.8

77.1  ± 0.9

Temporal Filter

51.7  ± 1.3

43.7  ± 1.3

60.7  ± 1.6

17.5  ± 2.0

20.6  ± 2.4

35.5  ± 2.7

71.6  ± 1.1

57.2  ± 1.2

74.3  ± 1.0

Query Rewrite

62.6†  ± 0.9

53.8  ± 0.8

72.3†  ± 1.0

36.7†  ± 1.5

44.3†  ± 1.6

63.2†  ± 1.8

77.7†  ± 0.8

59.3  ± 0.7

77.7†  ± 0.8

Query Decomposition

68.2†  ± 0.8

57.6†  ± 0.7

78.9†  ± 0.9

53.9†  ± 1.3

53.1†  ± 1.5

83.2†  ± 1.4

76.5  ± 0.9

59.3  ± 0.7

76.5  ± 0.9

  1. All metrics are reported in percentage (%). Results are reported as mean  ± standard deviation. The best result in each column is bolded. † indicates a statistically significant improvement over the Native RAG baseline (p < 0.05).