Table 2 Comparison of text-to-video model performance regarding seven evaluation measures.
From: Automatic background animation generation aligned with LLM-generated lyrics for children’s songs
Evaluation | BAGen (ours) | CogVideoX | HotShotXL | Pyramid Flow | WAN | |
|---|---|---|---|---|---|---|
mesaure | ||||||
Pairwise | 0.986 ± 0.015 | 0.912 ± 0.054 | 0.985 ± 0.010 | 0.964 ± 0.026 | 0.968 ± 0.027 | |
CLIP | clip | |||||
Text | 0.313 ± 0.029 | 0.244 ± 0.043 | 0.265 ± 0.03 | 0.272 ± 0.038 | 0.253 ± 0.035 | |
Alignment | ||||||
Subject | 0.972 ± 0.029 | 0.937 ± 0.038 | 0.987 ± 0.007 | 0.947 ± 0.041 | 0.946 ± 0.072 | |
Consistency | ||||||
Background | 0.985 ± 0.013 | 0.928 ± 0.041 | 0.975 ± 0.011 | 0.968 ± 0.016 | 0.971 ± 0.025 | |
Consistency | ||||||
VR | Motion | 0.992 ± 0.006 | 0.986 ± 0.011 | 0.980 ± 0.014 | 0.995 ± 0.003 | 0.988 ± 0.008 |
Bench | Smoothness | |||||
Aesthetic | 0.779 ± 0.073 | 0.355 ± 0.113 | 0.646 ± 0.089 | 0.679 ± 0.046 | 0.654 ± 0.068 | |
Quality | ||||||
Imaging | 0.651 ± 0.102 | 0.551 ± 0.115 | 0.622 ± 0.185 | 0.575 ± 0.098 | 0.652 ± 0.078 | |
Quality |