Table 1 The supervised fine-tuning performance of various open source LLMs on six tasks

From: Medical foundation large language models for comprehensive text analysis and beyond

Task

Dataset

Metric

LLaMA2 13B

PMC-LLaMA 13B

Me-LLaMA 13B

LLaMA2 70B

Meditron 70B

Me-LLaMA 70B

Question answering

PubMedQA

Acc

0.800

0.778

0.802

0.800

0.800

0.814

Macro-F1

0.560

0.544

0.562

0.560

–

0.572

MedQA

Acc

0.467

0.456

0.493

0.598

0.607

0.623

Macro-F1

0.465

0.454

0.487

0.595

–

0.621

MedMCQA

Acc

0.527

0.548

0.557

0.626

0.651

0.643

Macro-F1

0.524

0.545

0.551

0.625

–

0.640

EmrQA

Acc

0.789

0.810

0.857

0.847

0.850

0.854

F1

0.730

0.738

0.751

0.751

0.751

0.751

Named entity recognition

i2b2

Macro-F1

0.904

0.901

0.906

0.913

0.908

0.910

Relation extraction

DDI

Macro-F1

0.622

0.622

0.559

0.746

0.737

0.779

Classification

HoC

Macro-F1

0.696

0.422

0.684

0.818

0.702

0.841

MTsample

Macro-F1

0.430

0.345

0.451

0.458

0.284

0.544

Summarization

PubMed

R-L

0.191

0.091

0.197

0.211

0.197

0.209

BERTS

0.663

0.516

0.679

0.689

0.677

0.700

MIMIC-CXR

R-L

0.437

0.139

0.453

0.440

0.458

0.476

BERTS

0.816

0.694

0.821

0.813

0.824

0.828

Natural language inference

BioNLI

Macro-F1

0.409

0.332

0.447

0.447

0.444

0.566

MedNLI

Macro-F1

0.881

0.868

0.903

0.884

0.897

0.916

  1. BERTS means BERTScore28.