Table 9 Expert evaluation results

From: Large language models learning to write rhyming Tang poetry A Xunzi Yayun R1 case study

Model

Type

Fluency

Coherence

Poeticness

Average

Human

/

5.86

5.65

5.70

5.74

Xunzi-Yayun-R1-32B

RAG

5.97

5.67

5.79

5.81

QwQ-32B

RAG

5.93

5.80

5.59

5.77

DeepSeek-R1-671B

RAG

5.86

5.81

5.58

5.75

Qwen2.5-32B-Instruct-poem-GRPO

RAG

5.94

5.83

5.46

5.74

Qwen2.5-32B-Instruct

RAG

5.77

5.65

5.44

5.62

internlm2-5-7b-chat

RAG

5.77

5.46

5.45

5.56

Qwen2.5-7B-Instruct-poem-GRPO

RAG

5.76

5.37

5.27

5.47

glm-4-9b-chat

RAG

5.69

5.35

5.30

5.45

Qwen2.5-14B-Instruct

RAG

5.68

5.21

5.39

5.42

Qwen2.5-7B-Instruct

RAG

5.27

5.04

5.13

5.15

Qwen2.5-32B-Instruct-poem-GRPO

reasoning

5.78

5.41

5.36

5.52

Xunzi-Yayun-R1-32B

reasoning

5.70

5.35

5.24

5.43

Qwen2.5-32B-Instruct

reasoning

5.62

5.32

5.32

5.42

glm-4-9b-chat

reasoning

5.68

5.28

5.30

5.42

Qwen2.5-7B-Instruct-poem-GRPO

reasoning

5.55

5.35

5.25

5.38

Qwen2.5-14B-Instruct

reasoning

5.46

5.28

5.28

5.34

Qwen2.5-7B-Instruct

reasoning

5.02

4.82

4.67

4.84

internlm2-5-7b-chat

reasoning

4.75

4.69

4.76

4.73

Qwen2.5-32B-Instruct

SFT

5.79

5.64

5.27

5.57

glm-4-9b-chat

SFT

5.65

5.49

5.35

5.50

Qwen2.5-7B-Instruct

SFT

5.30

5.18

5.15

5.21

internlm2-5-7b-chat

SFT

5.27

5.10

5.12

5.16

Qwen2.5-14B-Instruct

SFT

5.21

4.98

4.91

5.04

  1. Bold values represent the best results for each metric.