Figure 3

The system pipeline for training. CNN and Edge Feature Extraction (EFE) module are used to encode the reference image x. LSTM is used to extract text features. Finally, train the model via TATLF. (Created by ‘Microsoft Office Visio 2013’ https://www.microsoft.com/zh-cn/microsoft-365/previous-versions/microsoft-vision-2013).