Fig. 2
From: Intelligent recognition of counterfeit goods text based on BERT and multimodal feature fusion

Overview of the CDANet architecture. The auxiliary position detection network (right) reduces interference with downstream tasks by explicitly supervising erroneous characters. It further integrates these hidden states with semantic features from the BERT encoder (Equation 6). This fusion represents features from the pinyin extractor (Equation 8) and character shape extractor (Equation 9), consolidated through multimodal feature fusion (Equation 10) before being fed into the Transformer encoder for correction. In the example input, regarding the erroneous character ‘午’(Wu2, meaning noon), we need not only contextual information for assistance but also have to rely on the contextual information to help us to identify and locate the erroneous character. information for assistance but also have to rely on the visual or phonetic characteristics of the character itself to make a judgment.