数据集

a)QW-VL：Visual Genome, RefCOCO, RefCOCO+, RefCOCOg，
b)CogVLM：Visual7W，Flickr30K-Entities
c)Kosmos2：GRIT

OFA

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
将多模态任务统一为seq2seq，最大模型900M

文本，图片，物体离散到统一的词表中：
将文本用BPE转化为subwords，将图片简单切分成多个patch并使用image quantization转化为image code，抽取图片中的物体的标签和bounding box并将bounding box离散化为location tokens。统一词表是文本的subwords，图片的image code和物体的location tokens三者的并集。
box表示方式：将坐标映射到1-1000，对应词表中总共1000个location token，一个box即<x1><y1><x2><y2>