Table 2 Multi-label classification model performance, presented using AUROC, with and without 0–3 month data

From: Enhancing EHR-based pancreatic cancer prediction with LLM-derived embeddings

 

CUMC

CSMC

 

Including 0-3 months data

Time interval (hold-out case set)

0–3 m (566)

3–6 m (22)

6–12 m (23)

12–36 m (35)

36–60 m (14)

0–3 m (283)

3–6 m (23)

6–12 m (18)

12–36 m (23)

36–60 m (7)

Baselinefine_tune32

0.916 [0.901, 0.931]

0.814 [0.718, 0.895]

0.598 [0.479, 0.726]

0.663 [0.572, 0.750]

0.579 [0.434, 0.735]

0.912 [0.891, 0.931]

0.842 [0.752, 0.921]

0.817 [0.707, 0.920]

0.786 [0.720, 0.845]

0.557 [0.381, 0.733]

Baselinefine_tune1536

0.931 [0.918, 0.943]

0.879 [0.825, 0.926]

0.592 [0.466, 0.720]

0.694 [0.609, 0.770]

0.532 [0.373, 0.696]

0.915 [0.896, 0.932]

0.826 [0.761, 0.882]

0.819 [0.695, 0.919]

0.744 [0.635, 0.836]

0.764 [0.635, 0.878]

GPTfine_tune32

0.929 [0.916, 0.942]

0.801 [0.714, 0.882]

0.615 [0.494, 0.738]

0.731 [0.648, 0.805]

0.591 [0.436, 0.751]

 

GPTfine_tune1536

0.930 [0.918, 0.943]

0.856 [0.794, 0.915]

0.673* [0.559, 0.776]

0.724 [0.650, 0.792]

0.604* [0.459, 0.761]

0.932* [0.917, 0.947]

0.884* [0.823, 0.928]

0.858* [0.761, 0.946]

0.824* [0.722, 0.901]

0.787 [0.618, 0.907]

 

Excluding 0-3 months data

  

3–6 m (344)

6–12 m (24)

12–36 m (35)

36–60 m (13)

 

3–6 m (216)

6–12 m (18)

12–36 m (23)

36–60 m (7)

Baselinefine_tune32

 

0.742 [0.716, 0.767]

0.687 [0.564, 0.788]

0.668 [0.569, 0.760]

0.644 [0.522, 0.758]

  

Baselinefine_tune1536

 

0.757 [0.729, 0.782]

0.756 [0.661, 0.838]

0.737 [0.661, 0.807]

0.781 [0.679, 0.876]

 

0.828 [0.800, 0.854]

0.880 [0.849, 0.893]

0.771 [0.695, 0.843]

0.750 [0.675, 0.841]

GPT fine_tune32

 

0.755 [0.731, 0.780]

0.779 [0.696, 0.854]

0.764 [0.696, 0.827]

0.665 [0.585, 0.746]

  

GPTfine_tune1536

 

0.763 [0.738, 0.787]

0.819* [0.741, 0.888]

0.761 [0.699, 0.818]

0.706 [0.598, 0.811]

 

0.870* [0.849, 0.893]

0.893 [0.813, 0.954]

0.811 [0.733, 0.881]

0.725 [0.596, 0.839]

  1. Numbers in parentheses indicate the number of cases in the hold-out test set, which constitutes 20% of the entire dataset, with the remaining 20% used for validation and 60% for training. The number of controls (157,067 for CUMC and 92,073 for CSMC) remained consistent across evaluations. The best-performing outcome at each prediction interval is bolded. A complete list of model evaluation results−including comparisons of different pre-trained models (OpenAI GPT, RGCN, and Mistral), assessments of different embedding sizes (e.g., 32 vs 1536), and the impact of freezing or fine-tuning these embeddings on model performance−is provided in Supplementary Table 2.
  2. * indicates p < 0.05 based on bootstrap testing between Baseline and GPT embedding models.