Table 5 Evaluation of other responses.

From: Evaluating the appropriateness and safety of generative AI in delivering lifestyle guidance for atrial fibrillation patients

		DB GPT	PubMed GPT	GPT-4o	E.P	Odds ratio (95%CI)ª
Scientific consensus, number (%)	Yes	16 (64.0)	16 (64.0)	25 (100)	21 (84.0)	DB: 0.31 (0.06–1.73) PM: 0.31 (0.06–1.73 4o: not estimable
Scientific consensus, number (%)	No	9 (36.0)	9 (36.0)	0 (0)	4 (16.0)	DB: 0.31 (0.06–1.73) PM: 0.31 (0.06–1.73 4o: not estimable
Extent of possible harm, number (%)	No/mild	19 (76.0)	21 (84.0)	24 (100)	21 (84.0)	DB^b: 0.60 (0.11–3.42) PM^b: 1.00 (0.16–6.43) 4o^b: not estimable
	Moderate	5 (20.0)	3 (12.0)	0 (0)	3 (12.0)
	Severe/ fatal	1 (4.0)	1 (4.0)	0 (0)	1 (4.0)
Evidence of incorrect comprehension, number (%)	Yes	9 (36.0)	10 (40.0)	1 (4.0)	6 (24.0)	DB: 0.48 (0.09–2.63) PM: 0.39 (0.07–2.10) 4o: 9.27 (0.57–151.30)
Evidence of incorrect comprehension, number (%)	No	16 (64.0)	15 (60.0)	24 (96.0)	19 (76.0)
Evidence of incorrect retrieval, number (%)	Yes	7 (28.0)	8 (32.0)	0 (0)	6 (24.0)	DB: 0.80 (0.17–3.91) PM: 0.66 (0.43–3.13) 4o: not estimable
Evidence of incorrect retrieval, number (%)	No	18 (72.0)	17 (68.0)	25 (100)	19 (76.0)	DB: 0.80 (0.17–3.91) PM: 0.66 (0.43–3.13) 4o: not estimable
Evidence of incorrect reasoning, number (%)	Yes	7 (28.0)	7 (28.0)	1 (4.0)	3 (12.0)	DB: 0.30 (0.04–2.09) PM: 0.30 (0.04–2.09) 4o: 3.50 (0.19–64.35)
Evidence of incorrect reasoning, number (%)	No	18 (72.0)	18 (72.0)	24 (96.0)	22 (88.0)
Inappropriate content, number (%)	Yes	5 (20.0)	6 (24.0)	1 (4.0)	4 (16.0)	DB: 0.67 (0.08–5.86) PM: 0.45 (0.05–3.96) 4o: 6.36 (0.31–129.57)
Inappropriate content, number (%)	No	18 (76.7)	21 (70.0)	24 (80.0)	28 (93.3)
Incorrect content, number (%)	Yes	7 (28.0)	6 (24.0)	1 (4.0)	7 (28.0)	DB: 1.00 (0.18–5.56) PM: 1.30 (0.23–7.41) 4o: 11.97 (0.74–192.90)
Incorrect content, number (%)	No	18 (72.0)	19 (72.0)	24 (96.0)	18 (72.0)
Specialized content, number (%)	Yes	8 (32.0)	12 (48.0)	21 (84.0)	12 (48.0)	DB: 0.40 (0.08–2.11) PM: 1.00 (0.21–4.75) 4o: 8.30 (1.33–51.71)
Specialized content, number (%)	No	17 (68.0)	13 (52.0)	4 (16.0)	13 (52.0)
Bedside manner (empathy to the user), mean (SD)		3.76 (1.23)	3.68 (1.41)	1.48 (0.65)	3.12 (1.30)	DB^c: 0.39 (0.09–1.77) PM^c: 0.49 (0.11–2.11) 4o^c: 15.41 (2.00–118.94)
Helpfulness of the answer, mean (SD)		3.12 (0.88)	2.96 (1.02)	1.20 (0.50)	2.72 (0.84)	DB^d: 0.40 (0.09–1.78) PM^d: 0.72 (0.18–2.90) 4o^d: 30.54 (2.17–430.35)

All evaluation dimensions were analyzed using a Generalized Linear Mixed Model (GLMM) with post-hoc multiple comparisons. DB: DB GPT.
PM: PubMed GPT, 4o: GPT4o, E.P.: electrophysiologist. a: Dunnett’s test results with the Electrophysiologist as the reference group. b: In the analysis, moderate and severe/death categories were combined into a binary variable. c: Bedside Manner: 5-point scale (1 = very empathetic, 5 = not empathetic); scores of 1–2 were treated as “1” and 3–5 as “0”. d: Helpfulness of the Answer: 4-ranked scale (1 = best, 4 = worst); ranks of 1–2 were treated as “1” and 3–4 as “0”.

Back to article page

Table 5 Evaluation of other responses.

Search

Quick links