Table 2 Comparison of text-to-video model performance regarding seven evaluation measures.

From: Automatic background animation generation aligned with LLM-generated lyrics for children’s songs

 

Evaluation

BAGen (ours)

CogVideoX

HotShotXL

Pyramid Flow

WAN

 

mesaure

 

Pairwise

0.986 ± 0.015

0.912 ± 0.054

0.985 ± 0.010

0.964 ± 0.026

0.968 ± 0.027

CLIP

clip

 

Text

0.313 ± 0.029

0.244 ± 0.043

0.265 ± 0.03

0.272 ± 0.038

0.253 ± 0.035

 

Alignment

 

Subject

0.972 ± 0.029

0.937 ± 0.038

0.987 ± 0.007

0.947 ± 0.041

0.946 ± 0.072

 

Consistency

 

Background

0.985 ± 0.013

0.928 ± 0.041

0.975 ± 0.011

0.968 ± 0.016

0.971 ± 0.025

 

Consistency

VR

Motion

0.992 ± 0.006

0.986 ± 0.011

0.980 ± 0.014

0.995 ± 0.003

0.988 ± 0.008

Bench

Smoothness

 

Aesthetic

0.779 ± 0.073

0.355 ± 0.113

0.646 ± 0.089

0.679 ± 0.046

0.654 ± 0.068

 

Quality

 

Imaging

0.651 ± 0.102

0.551 ± 0.115

0.622 ± 0.185

0.575 ± 0.098

0.652 ± 0.078

 

Quality