Table 3 The comparison of zero-shot performances among Me-LLaMA models and their backbone models LLaMA2

From: Medical foundation large language models for comprehensive text analysis and beyond

Dataset

Metric

LLaMA2 13B (backbone)

Me-LLaMA 13B (backbone + pre-train only)

LLaMA2 13B-instruct (backbone + instruction tuning only)

Me-LLaMA-13B-chat (backbone + pre-train + instruction tuning)

LLaMA2 70B (backbone)

Me-LLaMA 70B (backbone + pre-train only)

LLaMA2 70B-instruct (backbone + instruction tuning only)

Me-LLaMA-70B-chat (backbone + pre-train + instruction tuning)

PubMedQA

Acc

0.216

0.266

0.436

0.700

0.132

0.682

0.764

0.768

Macro-F1

0.177

0.250

0.416

0.504

0.152

0.520

0.531

0.557

MedQA

Acc

0.000

0.000

0.013

0.427

0.005

0.281

0.499

0.523

Macro-F1

0.000

0.000

0.024

0.422

0.009

0.350

0.493

0.521

MedMCQA

Acc

0.003

0.003

0.014

0.449

0.012

0.447

0.501

0.539

Macro-F1

0.006

0.005

0.029

0.440

0.024

0.396

0.493

0.538

EmrQA

Acc

0.000

0.005

0.050

0.048

0.000

0.021

0.181

0.119

F1

0.038

0.122

0.286

0.307

0.000

0.172

0.399

0.346

i2b2

Macro-F1

0.008

0.030

0.232

0.263

0.181

0.224

0.245

0.329

DDI

Macro-F1

0.035

0.036

0.164

0.214

0.034

0.118

0.121

0.283

HoC

Macro-F1

0.253

0.210

0.194

0.335

0.255

0.252

0.563

0.544

MTsample

Macro-F1

0.042

0.072

0.176

0.229

0.066

0.226

0.364

0.384

PubMed

R-L

0.170

0.168

0.183

0.116

0.167

0.119

0.112

0.169

BERTS

0.654

0.654

0.667

0.445

0.654

0.654

0.601

0.678

MIMIC-CXR

R-L

0.051

0.172

0.360

0.400

0.059

0.137

0.367

0.418

BERTS

0.566

0.697

0.791

0.797

0.577

0.649

0.784

0.787

BioNLI

Macro-F1

0.109

0.060

0.185

0.195

0.285

0.499

0.345

0.436

MedNLI

Macro-F1

0.172

0.206

0.457

0.472

0.265

0.256

0.657

0.675