Fig. 3

DPD module diagram. The module comprises the Gaze-Step2 Transformer and the Transformer Con-Decoder, which predict nouns and bounding-boxes corresponding to semantic roles. The gray box retains the conditional spatial queries similar to those in Conditional DETR.