Fig. 2: Representations on a hypersphere.

The student’s lower layer and the teacher’s output are projected onto the hypersphere. The intermediate self-distillation loss explicitly aligns the representations of the student’s lower layer with those of the teacher’s output, effectively guiding the lower-layer representations toward the output representations and enhancing feature consistency.