Extended Data Table 3 Instructions for pretraining tasks along with the corresponding format of the output

From: A generalist vision–language foundation model for diverse biomedical tasks

  1. Here, <img> represents the image token derived from VQ-GAN’s vocabulary. <loc> represents the location token. The instruction for the VQA task is the question itself from the dataset.