Fig. 2
From: MSSA: memory-driven and simplified scaled attention for enhanced image captioning

Overview of the multimodal feature extraction pipeline, including geometric, color, texture, edge, and frequency-domain feature computation for ROIs in the COCO dataset.