Table 2 Accuracy of LLM-only and LLM-assisted risk-of-bias assessments

From: Language models for data extraction and risk of bias assessment in complementary medicine

Domains of ROB assessment

LLM-only risk-of-bias assessments

LLM-assisted risk-of-bias assessments

Rate difference between LLM-assisted and LLM-only ROB assessments (95% CI)

No. of correct assessments

No. of total assessments

Correct assessment rate (95% CI)

Mean No. of correct assessments per reviewer

Mean No. of total assessments per reviewer

Correct assessment rate (95% CI)

Sequence generation

94

107

87.85% (80.12% to 93.37%)

103

107

96.26% (90.70% to 98.97%)

8.41% (1.25% to 15.57%)

Allocation sequence concealment

103

107

96.26% (90.70% to 98.97%)

105

107

98.13% (93.41% to 99.77%)

1.87% (-2.55% to 6.29%)

Blinding: patients

102

107

95.33% (89.43% to 98.47%)

104

107

97.20% (92.02% to 99.42%)

1.87% (-3.21% to 6.95%)

Blinding: healthcare providers

103

107

96.26% (90.70% to 98.97%)

105

107

98.13% (93.41% to 99.77%)

1.87% (-2.55% to 6.29%)

Blinding: data collectors

101

107

94.39% (88.19% to 97.91%)

103

107

96.26% (90.70% to 98.97%)

1.87% (-3.78% to 7.52%)

Blinding: outcome assessors

103

107

96.26% (90.70% to 98.97%)

103

107

96.26% (90.70% to 98.97%)

0.00% (-5.08% to 5.08%)

Blinding: data analysts

103

107

96.26% (90.70% to 98.97%)

104

107

97.20% (92.02% to 99.42%)

0.93% (-3.83% to 5.70%)

Missing outcome data

107

107

100.00% (96.61% to 100.00%)

106

107

99.07% (94.90% to 99.98%)

-0.93% (-3.49% to 1.62%)

Selective outcome reporting

106

107

99.07% (94.90% to 99.98%)

106

107

99.07% (94.90% to 99.98%)

0.00% (-2.58% to 2.58%)

Other bias

102

107

95.33% (89.43% to 98.47%)

103

107

96.26% (90.70% to 98.97%)

0.93% (-4.44% to 6.31%)

Overall

1024

1070

95.70% (94.31% to 96.84%)

1041

1070

97.29% (96.13% to 98.18%)

1.59% (0.03% to 3.15%)

  1. LLM: large language model.