Table 4 Benchmarking Ci-SSGAN against state-of-the-art large language models for automated glaucoma classification

From: Clinically informed semi-supervised learning improves disease annotation and equity from electronic health records: a glaucoma case study

Model

Class

Samples

Accuracy

F1 score

AUROC

AUCPR

GPT-4o

Non-GL

38

0.474

0.451

0.689

0.273

OAG/S

60

0.667

0.537

0.741

0.362

ACG/S

49

0.601

0.625

0.766

0.461

XFG/S

55

0.752

0.793

0.858

0.679

PDG/S

45

0.801

0.816

0.883

0.701

SGL

47

0.552

0.709

0.775

0.613

Overall

294

0.641

0.655

0.785

0.515

Med-Gemma

Non-GL

38

0.184

0.187

0.534

0.141

OAG/S

60

0.451

0.274

0.491

0.201

ACG/S

49

0.122

0.135

0.492

0.165

XFG/S

55

0.073

0.098

0.488

0.184

PDG/S

45

0.045

0.062

0.486

0.151

SGL

47

0.085

0.101

0.484

0.157

Overall

294

0.160

0.143

0.496

0.167

LLaMA-3.2

Non-GL

38

0.421

0.552

0.703

0.412

OAG/S

60

0.834

0.565

0.774

0.391

ACG/S

49

0.184

0.305

0.590

0.302

XFG/S

55

0.364

0.421

0.640

0.301

PDG/S

45

0.712

0.736

0.836

0.586

SGL

47

0.660

0.554

0.761

0.369

Overall

294

0.529

0.522

0.717

0.394

Ci-SSGAN

Non-GL

38

0.818

0.750

0.915

0.722

OAG/S

60

0.955

0.857

0.955

0.863

ACG/S

49

1.000

0.970

0.994

0.962

XFG/S

55

0.750

0.828

0.942

0.880

PDG/S

45

0.667

0.800

0.939

0.835

SGL

47

0.853

0.855

0.949

0.847

Overall

294

0.840

0.843

0.949

0.852

  1. LLMs (GPT-4o, Med-Gemma, LLaMA-3.2) were evaluated in zero-shot setting without fine-tuning on our glaucoma dataset, representing realistic deployment scenarios. Ci-SSGAN was trained on our institutional data. The bold values indicate the best-performing results within each class and metric across the evaluated models.