Table 1 Summary of studies comparing the performance of large Language models and classical machine learning methods in medicine using structured data.
First author; year | Aim and task | Dataset (sample size; #feature) | Transformation techniques | Model, experiment, and metric | Zero-shot performance | Training size: performance |
|---|---|---|---|---|---|---|
Hegselmann; 2023 (TabLLM)10 | To transform table-to-text for binary classification of coronary artery disease and diabetes | Diabetes (768; #7); Heart (918, #11) | Template; Billion parameter LLM; Million parameter LLM | Â | Â | Â |
| Â | Â | Â | Â | TabLLM-Diabetes (AUC) | 0.82 | 32: 0.68 512: 0.78 |
| Â | Â | Â | Â | XGBoost- Diabetes (AUC) | - | 32: 0.69 512: 0.80 |
| Â | Â | Â | Â | TabLLM-Heart (AUC) | 0.54 | 32: 0.87 512: 0.92 |
| Â | Â | Â | Â | XGBoost-Heart (AUC) | - | 32: 0.88 512: 0.92 |
Wang; 2024 (MediTab)11 | To evaluate MediTab (GPT3.5) on seven medical classification tasks and compare it with TabLLM and CML | Seven datasets of breast, lung, and colorectal cancer from clinical trials (average 1451 ranging from 53 to 2968; average #3 categorical, #15 binaries; #7 numerical) | BioBERT-based model fine-tuned on transformation and GPT3.5 for sanity check | Â | Â | Â |
| Â | Â | Â | Â | MediTab (Average AUC) | 0: 0.82 | 200: 0.84 |
| Â | Â | Â | Â | XGBoost (Average AUC) | 10: 0.64 | 200: 0.79 |
Cui; 2024 (EHR-CoAgent)12 | To investigate the efficacy of LLMs-based disease prediction using structured EHR data generated from clinical encounters (MIMIC: acute care condition in the next hospital visit; CRADLE: CVD in diabetic patients) | MIMIC-III (11,353; #?); CRADLE (34,404; #?) | Disease, medicatin, and procudere codes by mapping the code value to code name (+ prompt engineering techniques) |  |  |  |
|  |  |  |  | EHR-CoAgent-GPT4 – MIMIC (Accuracy; F1) | 0.79; 0.73 | - |
|  |  |  |  | GPT-4 – MIMIC (Accuracy; F1) | ZSC: 0.51; 52% Prompt engineered: 0.62; 0.58 | Few-shot (N = 6): 0.65; 0.64 |
|  |  |  |  | GPT-3.5 – MIMIC (Accuracy; F1) | ZSC: 0.78; 0.68 Prompt engineered: 0.72; 0.42 | Few-shot (N = 6): 0.76; 0.63 |
|  |  |  |  | RF – MIMIC (Accuracy; F1) | N = 6: 0.69; 0.63 | 11,353: 0.78; 0.70 |
|  |  |  |  | LR – MIMIC (Accuracy; F1) | N = 6: 0.48; 0.56 | 11,353: 0.79; 0.73 |
|  |  |  |  | DT – MIMIC (Accuracy; F1) | N = 6: 0.71; 0.51 | 11,353: 0.81; 0.76 |
|  |  |  |  | EHR-CoAgent-GPT4 – CRADLE (Accuracy; F1) | 0.70; 0.60 | - |
|  |  |  |  | GPT-4 – CRADLE (Accuracy; F1) | ZSC: 0.21; 0.22; Prompt engineered: 0.30; 0.29 | Few-shot (N = 6):0.41; 0.40 |
|  |  |  |  | GPT-3.5 – CRADLE (Accuracy; F1) | ZSC: 0.56; 0.52 Prompt engineered: 0.62; 0.54 | Few-shot (N = 6): 0.40; 0.40 |
|  |  |  |  | RF – CRADLE (Accuracy; F1) | N = 6: 0.66; 0.51 | 34,404: 0.80; 0.57 |
|  |  |  |  | LR – CRADLE (Accuracy; F1) | N = 6: 0.54; 0.48 | 34,404: 0.80; 0.59 |
|  |  |  |  | DT – CRADLE (Accuracy; F1) | N = 6: 0.31; 0.31 | 34,404: 0.80; 0.52 |
Nazary; 2024 (XAI4LLM)13 | To evaluate the diagnostic accuracy and risk factors, including gender bias and false negative rates using LLM. In addition, a comparison with CML approaches | Heart Disease Dataset (920; #11) | Using feature name-values or transforming to simple manual textual template | Â | Â | Â |
|  |  |  |  | Best XAILLM: LLM + RF (F1 | ZSC: 0.741 |  |
| Â | Â | Â | Â | XGBoost (F1) | - | 920: 0.91 |
| Â | Â | Â | Â | RF (F1) | - | 920: 0.74 |