Fig. 3: Overview of the proposed end-to-end trainable multi-task architecture based on deep learning.

The input image is processed through two branches, producing recognition and detection results. GAP, or global average pooling, is used as a standard deep learning operation. The detection subnetworkās architecture is illustrated above. An endoscopic image passes through a six-layer deep network, generating six prediction results at different scales, which are then fused to produce the final prediction.