Figure 1

The overall overview of our method. Our model is trained in five stages. In the first stage (Frame Sampling), the frame similarity is calculated by the MS-SSIM algorithm, and the image with a large difference in semantic similarity is selected as the keyframe. In the second stage (Feature Extraction), the extraction of 2D and 3D CNN features of video and natural language captioning of images generated by keyframes. In the third stage (ViT Encoder), The encoder block of the transformer architecture is used as the feature encoder of the model structure. In the fourth stage (Decoder), The LSTM is used as the decoder of the model structure. In the fifth stage (Non-parametric Metric Learning), a reverse auxiliary learning module is designed to reinforce learned of the video caption via computing the loss Generated Caption \(\rightarrow \) Ground Truth.