Table 5 Evaluation of other responses.

From: Evaluating the appropriateness and safety of generative AI in delivering lifestyle guidance for atrial fibrillation patients

  

DB GPT

PubMed GPT

GPT-4o

E.P

Odds ratio (95%CI)ª

Scientific consensus, number (%)

Yes

16 (64.0)

16 (64.0)

25 (100)

21 (84.0)

DB: 0.31 (0.06–1.73)

PM: 0.31 (0.06–1.73

4o: not estimable

No

9 (36.0)

9 (36.0)

0 (0)

4 (16.0)

Extent of possible harm, number (%)

No/mild

19 (76.0)

21 (84.0)

24 (100)

21 (84.0)

DBb: 0.60 (0.11–3.42)

PMb: 1.00 (0.16–6.43)

4ob: not estimable

Moderate

5 (20.0)

3 (12.0)

0 (0)

3 (12.0)

Severe/

fatal

1 (4.0)

1 (4.0)

0 (0)

1 (4.0)

Evidence of incorrect

comprehension, number (%)

Yes

9 (36.0)

10 (40.0)

1 (4.0)

6 (24.0)

DB: 0.48 (0.09–2.63)

PM: 0.39 (0.07–2.10)

4o: 9.27 (0.57–151.30)

No

16 (64.0)

15 (60.0)

24 (96.0)

19 (76.0)

Evidence of incorrect retrieval, number (%)

Yes

7 (28.0)

8 (32.0)

0 (0)

6 (24.0)

DB: 0.80 (0.17–3.91)

PM: 0.66 (0.43–3.13)

4o: not estimable

No

18 (72.0)

17 (68.0)

25 (100)

19 (76.0)

Evidence of incorrect reasoning, number (%)

Yes

7 (28.0)

7 (28.0)

1 (4.0)

3 (12.0)

DB: 0.30 (0.04–2.09)

PM: 0.30 (0.04–2.09)

4o: 3.50 (0.19–64.35)

No

18 (72.0)

18 (72.0)

24 (96.0)

22 (88.0)

Inappropriate content, number (%)

Yes

5 (20.0)

6 (24.0)

1 (4.0)

4 (16.0)

DB: 0.67 (0.08–5.86)

PM: 0.45 (0.05–3.96)

4o: 6.36 (0.31–129.57)

No

18 (76.7)

21 (70.0)

24 (80.0)

28 (93.3)

Incorrect content, number (%)

Yes

7 (28.0)

6 (24.0)

1 (4.0)

7 (28.0)

DB: 1.00 (0.18–5.56)

PM: 1.30 (0.23–7.41)

4o: 11.97 (0.74–192.90)

No

18 (72.0)

19 (72.0)

24 (96.0)

18 (72.0)

Specialized content, number (%)

Yes

8 (32.0)

12 (48.0)

21 (84.0)

12 (48.0)

DB: 0.40 (0.08–2.11)

PM: 1.00 (0.21–4.75)

4o: 8.30 (1.33–51.71)

No

17 (68.0)

13 (52.0)

4 (16.0)

13 (52.0)

Bedside manner (empathy

to the user), mean (SD)

 

3.76 (1.23)

3.68 (1.41)

1.48 (0.65)

3.12 (1.30)

DBc: 0.39 (0.09–1.77)

PMc: 0.49 (0.11–2.11)

4oc: 15.41 (2.00–118.94)

Helpfulness of the answer, mean (SD)

 

3.12 (0.88)

2.96 (1.02)

1.20 (0.50)

2.72 (0.84)

DBd: 0.40 (0.09–1.78)

PMd: 0.72 (0.18–2.90)

4od: 30.54 (2.17–430.35)

  1. All evaluation dimensions were analyzed using a Generalized Linear Mixed Model (GLMM) with post-hoc multiple comparisons. DB: DB GPT.
  2. PM: PubMed GPT, 4o: GPT4o, E.P.: electrophysiologist. a: Dunnett’s test results with the Electrophysiologist as the reference group. b: In the analysis, moderate and severe/death categories were combined into a binary variable. c: Bedside Manner: 5-point scale (1 = very empathetic, 5 = not empathetic); scores of 1–2 were treated as “1” and 3–5 as “0”. d: Helpfulness of the Answer: 4-ranked scale (1 = best, 4 = worst); ranks of 1–2 were treated as “1” and 3–4 as “0”.