Table 3 Performance of different models on datasets of various labeled imbalance ratio γ

From: Unveiling the power of language models in chemical research question answering

Model

Setting 1 (500/40k, γ = 5)

Setting 2 (2k/20k, γ = 23)

Setting 3 (2k/40k, γ = 23)

Setting 4 (4k/40k, γ = 48)

 

Accuracy

F1

Accuracy

F1

Accuracy

F1

Accuracy

F1

Supervised

66.84

66.71

69.80

68.57

69.80

68.57

70.62

68.59

PubMedQA

67.56

67.30

71.20

69.37

72.12

69.45

72.30

67.72

FixMatch

67.64

64.74

71.40

69.46

72.34

69.14

72.98

68.96

SoftMatch

70.16

67.38

71.53

69.71

72.24

69.75

73.54

68.99

FreeMatch

69.56

66.42

72.14

70.23

72.60

69.72

72.68

68.13

ChemMatch

71.36

68.55

73.12

70.84

73.84

70.93

74.28

71.06

- Improvement (%)

+2.59%

+3.20%

+1.36%

+0.87%

+1.71%

+1.74%

+2.20%

+4.30%

  1. The numbers in the bracket are the number of supervised and unsupervised cases in training set, respectively. Numbers in bold denote significant improvements over the FreeMatch baseline, as determined by a two-tailed paired t-test with a p-value < 0.05. This notation is consistently used throughout the tables. The improvement percentage is compared to the overall best baseline, FreeMatch.