Table 3 Performance comparison of models on the GVC dataset.

From: Argument centric causal intervention for cross document event coreference resolution

Model

MUC

B\(^3\)

CEAF\(_e\)

LEA

CoNLL

P

R

F1

P

R

F1

P

R

F1

P

R

F1

F1

CDLM-based

 Caciularu2021

90.4

85.0

87.6

83.8

80.8

82.3

63.5

74.7

68.6

-

-

-

79.5

LLM-based

 Llama

84.3

93.9

88.8

38.1

89.5

53.4

54.9

28.9

37.9

-

-

-

60.0

 GPT-3.5-Turbo

81.9

88.6

85.1

35.4

82.6

49.6

41.1

27.1

32.7

-

-

-

55.8

 Nath2024

94.2

91.6

92.9

82.1

86.7

84.3

68.1

75.8

71.7

-

-

-

83.0

Graph-based

 Chen2025

91.4

92.6

92.0

81.4

89.4

85.2

81.9

78.7

80.3

-

-

-

85.8

Pairwise-based

 Barhom2019

-

-

-

66.0

81.0

72.7

-

-

-

-

-

-

-

 Held2021

91.2

91.8

91.5

83.8

82.2

83.0

77.9

75.5

76.7

82.3

79.0

80.6

83.7

 Yu2022

88.5

92.9

90.6

80.3

82.1

81.2

71.8

79.5

75.5

-

-

-

82.4

 Ahmed2023

91.1

84.0

87.4

76.4

79.0

77.7

52.5

69.6

59.9

63.9

74.1

68.6

75.0

 Ding2024

92.1

90.4

91.3

86.8

84.8

85.8

73.2

78.9

76.0

-

-

-

84.4

 ACCI

92.8

92.4

92.6

85.0

88.3

86.6

75.6

77.2

76.4

84.0

79.6

81.7

85.2

  1. - indicates that performance scores were not reported for the corresponding evaluation metric.