【纯 Transformer 也可以取代 CNN 用于CV】Vision Transformer (ViT) 论文精读-526互联

原始题目	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
中文名称	一张图像等价于 16x16 Words: Transformers 来做大规模的图像识别
发表时间	2020年10月22日
平台	ICLR 2021
来源	谷歌大脑
文章链接	https://arxiv.org/abs/2010.11929
开源代码	https://github.com/google-research/vision_transformer
视频讲解	https://www.youtube.com/watch?v=TQ0UGjFlkuA

摘要

本文说了啥？

Transformer architecture 制霸 NLP 领域，但是在 CV 领域还不太行。之前，attention 在 CV 领域主要和 CNN 结合使用，或者用来替换某个 CNN 的部件，但是整体结构还是 CNN。

a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.

问题是需要 pre-trained on large amounts of data，然后迁移到 mid-sized（ImageNet） or small image recognition benchmarks 能达到优越的性能。

好吧，ImageNet 对我来说已经很大了。

1. 引言

在 NLP 领域，Transformers 的主流用法是什么？

在大文本语料上 pre-train，然后在小的特定任务上做 fine-tune。

Transformers 的参数为什么可以很大？性能会出现饱和吗？

Transformers’ computational efficiency and scalability 使得可以训练很大的模型。随着模型和数据集的增长，性能仍然没有饱和的迹象。

是否可以将 self-attention 应用到 CV 领域？目前在 large-scale image recognition 上什么模型是主流？

可以。

combining CNN-like architectures with self-attention
replacing the convolutions entirely（have not yet been scaled effectively on modern hardware accelerators due to
the use of specialized attention patterns. ）

ResNet-like architectures 依旧是主流。

是否可以将一个标准的 Transformer 应用到图像上？应该怎么做？

可以。

图像分块（patches）
Transformer 的输入是：这些 patches 的线性嵌入的序列（Image patches 和 NLP 中的 tokens 是一样的被对待）
有监督训练

在 ImageNet 这种中等规模的数据集上训练标准的 Transformer，如果不加强的约束，那么精度和同等规模的 ResNets 相比会怎么样？原因是什么？Transformers 为什么需要大数据集才能学习的比较好？

实验表明，和同等规模的 ResNets 相比，会低几个百分点。

原因：Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

inductive biases: 其实就是先验知识或者假设。比如 CNN 的两个归纳偏置

locality：CNN 是以滑动窗口的形式进行卷积，假设图像上相邻的区域会有相邻的特征。

translation equivariance（平移等变性）：f(g(x)) = g(f(x))，f 理解为卷积，g 理解为平移，无论是先做平移还是卷积，结果一样。卷积核相当于一个模板，无论图像中的物体在图像的哪里，只要同样的输入进来，遇到该卷积核，输出都是一样的。

CNN 有了这两个先验知识，就可以用相对少的数据学习到一个比较好的模型。Transformers 没有这些先验，所以需要更多的数据来训练从而获得这些先验的能力。

Vision Transformer (ViT) 的预训练的数据集多大？大规模数据集上训练是否可以胜过归纳偏置？

预训练：ImageNet-21k（1400万），JFT-300M ，在 image recognition benchmarks 上达到 SOTA。实验表明，更多的数据可以胜过归纳偏置。

2. 相关工作

BERT 和 GPT 的预训练都是什么任务？

BERT uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task

denoising self-supervised: 就是完形填空
language modeling: 预测下一个词

如何将 self-attention 应用到图像上面？

像素级：最简单的就是将图像的每个像素做 self-attention。但是 quadratic cost in the number of pixels（需要平方复杂度），序列长度太长，无法在像素程度使用 Transformer。
applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions
Sparse Transformers: 只对稀疏的点做 self-attention，是全局注意力的一个近似。
to scale attention is to apply it in blocks of varying sizes， in the extreme case only along individual axes（轴注意力，先在横轴上，再在纵轴上做 self-attention）

这些 specialized attention architectures 的结果都可以，但是硬件加速的话需要复杂的工程来加速算子。

这篇文章的最相近的工作是什么样的？与 ViT 的区别在哪里？

ICLR 2020 的一篇文章，从输入图像中抽取 2*2 的 patches, 2*2的原因是作者使用的 cifar-10 （32*32）的数据集做的实验，如果 16 * 16 那么就太大了。在 2*2 的 patches 上做 self-attention。

区别：证明了 large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs。2*2 的 patches 只能处理小分辨率（32）的图像，而它们的在 224 的图像上处理的也很好。

关键还是大数据 + 大 patches。

combining convolutional neural networks (CNNs) with forms of self-attention 可以做什么？

augmenting feature maps、further processing the output of a CNN using self-attention

image GPT (iGPT) 是干啥的？

applies Transformers to image pixels after reducing image resolution and color space. 该模型是无监督训练的生成模型，对产生的表示做 fine-tuned or probed linearly(直接当成一个特征提取器) 然后进行图像分类，ImageNet 的 Acc 能达到 72%。

在 CNN 架构中使用更多的数据可以提高性能吗？

可以。比如 The use of additional data sources，ImageNet-21k and JFT-300M

但是实验结果表明，在更大的数据集上预训练，在 ImageNet-1k 上测试，ViT 要好于 ResNet-based models

本文的工作主要在 ImageNet-21k and JFT-300M 上训练 Transformers，而不是 ResNet-based models

3. 方法

ViT 的模型为什么和 original Transformer 尽可能一样？

An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.

描述下 ViT 模型

The standard Transformer receives as input a 1D sequence of token embeddings.

输入的图像为 \(\mathbf{x}\in\mathbb{R}^{H\times W\times C}\)，reshape 为一个 2D patches \(\mathbf{x}_p\in\mathbb{R}^{N\times(P^2\cdot C)}\) 的序列。

\((P, P)\) is the resolution of each image patch, and \(N = HW/P2\) is the resulting number of patches, which also serves as the effective input sequence length for the Transformer.

...