Table 2 Pre-training data

From: Efficient GPT-4V level multimodal large language model for deployment on edge devices

Category

Sources

Size

Image Captioning

English

COCO47, VG48, CC3M49, CC12M50

410M

LAION-COCO43, COYO44, LAION-2B43

Chinese

AIC51, LAION-2B-Chinese43, WuKong52

110M

Zero-Chinese53, etc.

OCR+Knowledge

English

WIT54, IDL55, SynthText56, SynthDoG-en57

39M

SynthDoG-zh57, ArxivCap58, etc.

Chinese

WIT54, LAION-2B-OCR

11M

  1. The pre-training data consists of image captioning and OCR data in English and Chinese. LAION-2B-OCR is generated by applying OCR tools to LAION-2B images.