Table 4 Comparison of GatorTronS with existing transformer-based LLMs for semantic textual similarity, natural language inference, and question answering.

From: A study of generative large language model for medical research and healthcare

 

Semantic textual similarity

Natural language inference

Question answering

2019 n2c223

MedNLI24

emrQA Medication25

emrQA Relation25

Transformer

Pearson correlation

Accuracy

F1 score

Exact Match

F1 score

Exact Match

ClinicalBERT

0.879

0.827

0.691

0.241

0.931

0.853

GatorTron, 90B

0.881

0.867

0.718

0.298

0.954

0.903

GatorTronS, 1B

0.853

0.851

0.702

0.288

0.965

0.924

GatorTronS, 5B

0.888

0.882

0.726

0.305

0.968

0.926

GatorTronS, 10B

0.893

0.886

0.728

0.311

0.972

0.929

GatorTronS, 20B

0.898

0.885

0.726

0.307

0.973

0.927

  1. B: billion words of text. The best evaluation scores are bolded.