Table 9 Expert evaluation results
From: Large language models learning to write rhyming Tang poetry A Xunzi Yayun R1 case study
Model | Type | Fluency | Coherence | Poeticness | Average |
|---|---|---|---|---|---|
Human | / | 5.86 | 5.65 | 5.70 | 5.74 |
Xunzi-Yayun-R1-32B | RAG | 5.97 | 5.67 | 5.79 | 5.81 |
QwQ-32B | RAG | 5.93 | 5.80 | 5.59 | 5.77 |
DeepSeek-R1-671B | RAG | 5.86 | 5.81 | 5.58 | 5.75 |
Qwen2.5-32B-Instruct-poem-GRPO | RAG | 5.94 | 5.83 | 5.46 | 5.74 |
Qwen2.5-32B-Instruct | RAG | 5.77 | 5.65 | 5.44 | 5.62 |
internlm2-5-7b-chat | RAG | 5.77 | 5.46 | 5.45 | 5.56 |
Qwen2.5-7B-Instruct-poem-GRPO | RAG | 5.76 | 5.37 | 5.27 | 5.47 |
glm-4-9b-chat | RAG | 5.69 | 5.35 | 5.30 | 5.45 |
Qwen2.5-14B-Instruct | RAG | 5.68 | 5.21 | 5.39 | 5.42 |
Qwen2.5-7B-Instruct | RAG | 5.27 | 5.04 | 5.13 | 5.15 |
Qwen2.5-32B-Instruct-poem-GRPO | reasoning | 5.78 | 5.41 | 5.36 | 5.52 |
Xunzi-Yayun-R1-32B | reasoning | 5.70 | 5.35 | 5.24 | 5.43 |
Qwen2.5-32B-Instruct | reasoning | 5.62 | 5.32 | 5.32 | 5.42 |
glm-4-9b-chat | reasoning | 5.68 | 5.28 | 5.30 | 5.42 |
Qwen2.5-7B-Instruct-poem-GRPO | reasoning | 5.55 | 5.35 | 5.25 | 5.38 |
Qwen2.5-14B-Instruct | reasoning | 5.46 | 5.28 | 5.28 | 5.34 |
Qwen2.5-7B-Instruct | reasoning | 5.02 | 4.82 | 4.67 | 4.84 |
internlm2-5-7b-chat | reasoning | 4.75 | 4.69 | 4.76 | 4.73 |
Qwen2.5-32B-Instruct | SFT | 5.79 | 5.64 | 5.27 | 5.57 |
glm-4-9b-chat | SFT | 5.65 | 5.49 | 5.35 | 5.50 |
Qwen2.5-7B-Instruct | SFT | 5.30 | 5.18 | 5.15 | 5.21 |
internlm2-5-7b-chat | SFT | 5.27 | 5.10 | 5.12 | 5.16 |
Qwen2.5-14B-Instruct | SFT | 5.21 | 4.98 | 4.91 | 5.04 |