Fig. 2: Architecture of UMedPT.
From: Overcoming data scarcity in biomedical imaging with a foundational multi-task model

a, Features are extracted from an image of size H × W through a shared encoder. b, Classification heads take the neural image representation of an image and apply a single linear layer. c, A shared pixel-wise decoder processes the multi-scale feature maps and returns an embedding per pixel. Segmentation heads employing the popular U-Net spatial decoding strategy for handling segmentation features generate the prediction. d, The multi-scale decoder uses the feature maps and transforms them into features for box regression and box classification. The FCOS-based detection head generates the final prediction.