Table 2 Qualitative evaluation of human-written vs. AI-generated discharge summaries.

From: Accurate discharge summary generation using fine tuned large language models with self evaluation

Dimension

Human-Written (Mean ± SD)

AI-Generated (Best Model, Mean ± SD)

Fleiss’ κ (AI)

Accuracy

4.8 ± 0.11

4.5 ± 0.19

0.81

Completeness

4.9 ± 0.13

4.6 ± 0.17

0.83

Relevance & Clarity

4.8 ± 0.15

4.4 ± 0.26

0.78

Consistency

4.7 ± 0.11

4.3 ± 0.20

0.80

Utility

4.8 ± 0.16

4.4 ± 0.18

0.82