Table 1 Fleiss Kappa of different prompts in different models

From: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Model

Prompt

Fleiss Kappa

95% CI

gpt-4-Web

IO

0.525

0.523

0.527

0-COT

0.450

0.448

0.452

P-COT

0.334

0.332

0.337

ROT

0.467

0.465

0.470

gpt-4-API

IO

0.288

0.286

0.290

0-COT

0.067

0.065

0.069

P-COT

0.331

0.330

0.333

ROT

0.205

0.203

0.206

gpt-4-API-0

IO

0.525

0.523

0.526

0-COT

0.285

0.283

0.287

P-COT

0.660

0.658

0.661

ROT

0.451

0.449

0.453

Bard

IO

0.374

0.372

0.376

0-COT

0.355

0.353

0.357

P-COT

0.323

0.321

0.326

ROT

0.180

0.178

0.182

gpt-3.5-Web

IO

0.409

0.407

0.411

0-COT

−0.002

−0.004

0.000

P-COT

0.276

0.274

0.278

ROT

0.016

0.014

0.018

gpt-3.5-API

IO

0.188

0.186

0.190

0-COT

0.004

0.002

0.006

P-COT

0.031

0.029

0.033

ROT

0.014

0.012

0.016

gpt-3.5-API-0

IO

0.984

0.983

0.986

0-COT

0.461

0.459

0.464

P-COT

0.533

0.531

0.535

ROT

0.581

0.578

0.583

gpt-3.5-ft

IO

0.162

0.160

0.164

0-COT

0.021

0.020

0.023

P-COT

0.065

0.063

0.067

ROT

0.033

0.032

0.035

gpt-3.5-ft-0

IO

0.982

0.980

0.984

0-COT

0.412

0.410

0.414

P-COT

0.355

0.353

0.356

ROT

0.398

0.396

0.400