Table 2 Performances of Fu-LLM with different strategies for adjudications of clinical events

From: A large language model for clinical outcome adjudication from telephone follow-up interviews: a secondary analysis of a multicenter randomized clinical trial

 

Raw agreement, % (95% CI)

Sensitivity, % (95% CI)

Specificity, % (95% CI)

Positive predictive value, % (95% CI)

Negative predictive value, % (95% CI)

Performances of finetune_qwen2_7b

Whether the information came from the participant himself/herself

97.2 (96.1–98.2)

97.9 (96.7–98.9)

96.2 (94.5–97.9)

97.4 (96.1–98.7)

96.9 (95.2–98.6)

Whether the participant died

99.8 (99.5–100.0)

100.0 (100.0–100.0)

99.8 (99.5–100.0)

91.7 (79.3–100.0)

100.0 (100.0–100.0)

Whether the participant was hospitalizeda

82.7 (80.4–85.0)

88.9 (84.7–93.3)

91.3 (88.9–93.6)

79.1 (73.7–83.9)

95.7 (94.0–97.3)

Whether the participant underwent surgerya

92.1 (90.3–93.8)

95.3 (91.8–100.0)

89.8 (87.3–92.5)

75.1 (69.5–80.5)

98.4 (97.0–99.4)

Whether the participant taken medicationa

96.4 (95.2–97.5)

99.8 (99.4–100.0)

91.0 (86.1–95.5)

98.5 (97.6–99.3)

98.5 (96.2–100.0)

Total

93.7 (93.1–94.3)

97.5 (96.7–98.2)

95.0 (94.2–95.8)

93.1 (91.9–94.2)

98.2 (97.8–98.7)

Performances of finetune_qwen2_7b_wo_aug

Whether the information came from the participant himself/ herself

92.7 (91.2–94.3)

95.5 (93.8–97.1)

88.7 (85.6–91.5)

92.5 (90.5–94.6)

93.1 (90.2–95.3)

Whether the participant died

99.6 (99.2–99.9)

100.0 (100.0–100.0)

99.6 (99.2–99.9)

84.6 (70.0–96.8)

100.0 (100.0–100.0)

Whether the participant was hospitalizeda

69.5 (66.5–72.3)

77.9 (72.6–83.1)

89.8 (87.0–92.2)

73.6 (67.9–79.8)

91.7 (89.4–94.0)

Whether the participant underwent surgerya

87.5 (85.5–89.5)

87.7 (82.8–92.4)

85.7 (82.8–88.7)

66.4 (59.7–72.8)

95.6 (93.7–97.4)

Whether the participant taken medicationa

94.8 (93.3–96.2)

98.9 (98.2–99.5)

81.4 (75.2–87.6)

96.8 (95.4–98.0)

92.9 (87.9–96.9)

Total

88.9 (88.1–89.7)

94.4 (93.3–95.4)

92.1 (91.1–93.1)

89.1 (87.7–90.5)

96.0 (95.2–96.7)

Performances of zero_shot_qwen2_7b

Whether the information came from the participant himself/ herself

87.2 (85.0–89.0)

91.2 (88.8–93.2)

81.4 (77.7–85.0)

87.8 (85.3–90.1)

86.3 (82.6–89.3)

Whether the participant died

99.4 (98.9–99.8)

100.0 (100.0–100.0)

99.4 (98.9–99.8)

78.6 (63.0–91.7)

100.0 (100.0–100.0)

Whether the participant was hospitalizeda

25.0 (22.4–27.5)

51.4 (45.0–57.9)

10.6 (8.0–13.1)

17.5 (14.5–20.5)

37.3 (29.9–44.9)

Whether the participant underwent surgerya

67.3 (64.3–70.2)

76.6 (69.9–82.6)

64.4 (60.6–68.3)

40.9 (35.7–46.3)

89.5 (86.3–92.3)

Whether the participant taken medicationa

82.9 (80.5–85.2)

83.6 (81.0–85.9)

86.9 (81.3–92.1)

97.3 (96.1–98.4)

48.1 (42.3–54.2)

Total

72.5 (71.3–73.7)

82.1 (80.2–83.6)

70.3 (68.5–71.9)

65.5 (63.5–67.4)

85.1 (83.5–86.5)

  1. CI confidence interval; GPT generative pretrained transformer.
  2. aFor participants who were reported as dead during follow-up, for humanitarian reasons, the follow-up staff would not inquire about the information of hospitalization, surgery or medication, therefore, these three events of the death cases would not be evaluated (22 recordings vignettes reported death events).