Table 1 Summary of studies comparing the performance of large Language models and classical machine learning methods in medicine using structured data.

From: Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data

First author; year

Aim and task

Dataset (sample size; #feature)

Transformation techniques

Model, experiment, and metric

Zero-shot performance

Training size: performance

Hegselmann; 2023 (TabLLM)10

To transform table-to-text for binary classification of coronary artery disease and diabetes

Diabetes (768; #7); Heart (918, #11)

Template; Billion parameter LLM; Million parameter LLM

   
    

TabLLM-Diabetes (AUC)

0.82

32: 0.68

512: 0.78

    

XGBoost- Diabetes (AUC)

-

32: 0.69

512: 0.80

    

TabLLM-Heart (AUC)

0.54

32: 0.87

512: 0.92

    

XGBoost-Heart (AUC)

-

32: 0.88

512: 0.92

Wang; 2024 (MediTab)11

To evaluate MediTab (GPT3.5) on seven medical classification tasks and compare it with TabLLM and CML

Seven datasets of breast, lung, and colorectal cancer from clinical trials (average 1451 ranging from 53 to 2968; average #3 categorical, #15 binaries; #7 numerical)

BioBERT-based model fine-tuned on transformation and GPT3.5 for sanity check

   
    

MediTab (Average AUC)

0: 0.82

200: 0.84

    

XGBoost (Average AUC)

10: 0.64

200: 0.79

Cui; 2024 (EHR-CoAgent)12

To investigate the efficacy of LLMs-based disease prediction using structured EHR data generated from clinical encounters (MIMIC: acute care condition in the next hospital visit; CRADLE: CVD in diabetic patients)

MIMIC-III (11,353; #?); CRADLE (34,404; #?)

Disease, medicatin, and procudere codes by mapping the code value to code name (+ prompt engineering techniques)

   
    

EHR-CoAgent-GPT4 – MIMIC (Accuracy; F1)

0.79; 0.73

-

    

GPT-4 – MIMIC (Accuracy; F1)

ZSC: 0.51; 52%

Prompt engineered: 0.62; 0.58

Few-shot (N = 6): 0.65; 0.64

    

GPT-3.5 – MIMIC (Accuracy; F1)

ZSC: 0.78; 0.68

Prompt engineered: 0.72; 0.42

Few-shot (N = 6): 0.76; 0.63

    

RF – MIMIC (Accuracy; F1)

N = 6: 0.69; 0.63

11,353: 0.78; 0.70

    

LR – MIMIC (Accuracy; F1)

N = 6: 0.48; 0.56

11,353: 0.79; 0.73

    

DT – MIMIC (Accuracy; F1)

N = 6: 0.71; 0.51

11,353: 0.81; 0.76

    

EHR-CoAgent-GPT4 – CRADLE (Accuracy; F1)

0.70; 0.60

-

    

GPT-4 – CRADLE (Accuracy; F1)

ZSC: 0.21; 0.22;

Prompt engineered: 0.30; 0.29

Few-shot (N = 6):0.41; 0.40

    

GPT-3.5 – CRADLE (Accuracy; F1)

ZSC: 0.56; 0.52

Prompt engineered: 0.62; 0.54

Few-shot (N = 6): 0.40; 0.40

    

RF – CRADLE (Accuracy; F1)

N = 6: 0.66; 0.51

34,404: 0.80; 0.57

    

LR – CRADLE (Accuracy; F1)

N = 6: 0.54; 0.48

34,404: 0.80; 0.59

    

DT – CRADLE (Accuracy; F1)

N = 6: 0.31; 0.31

34,404: 0.80; 0.52

Nazary; 2024 (XAI4LLM)13

To evaluate the diagnostic accuracy and risk factors, including gender bias and false negative rates using LLM. In addition, a comparison with CML approaches

Heart Disease Dataset (920; #11)

Using feature name-values or transforming to simple manual textual template

   
    

Best XAILLM: LLM + RF (F1

ZSC: 0.741

 
    

XGBoost (F1)

-

920: 0.91

    

RF (F1)

-

920: 0.74

  1. Cui et al.’s and Nazary et al.’s studies were preprint publications