AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

{%hackmd SybccZ6XD %}

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

tags: `paper`

ABSTRACT

Image classification tasks Previous: CNN Paper: pure transformer
How to do that
- Transformer
Experiment
- Pre-train and transfer to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.)

INTRODUCTION

Step
- split an image into patches
- embeddings
- Transformer
- MLP
When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size
- Transformers lack some of the inductive biases such as translation equivariance and locality 如果每個patch單獨做self-attention呢?
Trained in larger datasets, excellent result
- large scale training trumps inductive bias.

METHOD

follow the original Transformer (Vaswani et al., 2017) as closely as possible

VISION TRANSFORMER (VIT)

Reshape and Unroll

The input of transformer is a sequence, so reshape the image

Linear Projection and Embedding

$z_0 = [x_{class}; x_p^1E; x_p^2E; ...; x_p^NE] + E_{pos}$ N = 9 in this example $E\in \mathbb{R}^{(P^2C)\times D}$ $E_{pos}\in \mathbb{R}^{(N+1)\times D}$

Embedding code

self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
x + = self.pos_embedding(:,:(n + 1))

class token

similar to BERT (initial zero)

Transformer Encoder

Transformer

MLP

Fully connected and activation function

FINE-TUNING AND HIGHER RESOLUTION

Pre-train on large datasets, and fine-tune to smaller task. Step

Pre-train
- predition head: MLP with one hidden layer
Remove pre-trained predition head
Attach a zero-initialized D x K feedforward layer

EXPERIMENTS

SETUP

==Datasets.==

Pre-train

ILSVRC-2012 ImageNet dataset: 1k classes and 1.3M images
ImageNet-21k: 21k classes and 14M images
JFT: 18k classes and 303M high resolution images

benchmark tasks: (Prepross by Big transfer (BiT): General visual representation learning.)

ImageNet on the original validation labels and the cleaned-up ReaL labels
CIFAR-10/100
Oxford-IIIT Pets
Oxford Flowers-102

19-task VTAB classification suite:

Natural：tasks like the above, Pets, CIFAR
Specialized：medical and satellite imagery
Structured：tasks that require geometric understanding like localization

==Model Variants.== Layers: How many encoder block Hidden size D: The dim of output of linear projection MLP size: Heads: How many head in the Multi-Head Attention

==Training & Fine-tuning.==

Optimization Adam: $\beta_1 = 0.9, \beta_2 = 0.999, batch size = 4096, high weight decay of 0.1$

Fine-tune SGD with momentum: $batch size = 512$

==Metrics.==

results

few-shot accuracy: solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to ${-1,1}^K$ target vectors
fine-tuning accuracy: after fine-tuning

:::warning 補充 (shot) n-shot: 1 class have n samples :::

COMPARISON TO STATE OF THE ART

PRE-TRAINING DATA REQUIREMENTS

The comparison between small datasets and larger datasets when pre-trained.

large ViT models perform worse than BiT
large ViT models shine when pre-trained on larger datasets.

The number in the above picture means P

The comparison between different size of subset

use early-stopping, and report the best validation accuracy achieved during training

SCALING STUDY

INSPECTING VISION TRANSFORMER

SELF-SUPERVISION

my result

acc	time
0.9874	2:20:00

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
models		models
utils		utils
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

tags: `paper`

ABSTRACT

INTRODUCTION

METHOD

VISION TRANSFORMER (VIT)

Reshape and Unroll

Linear Projection and Embedding

Transformer Encoder

MLP

FINE-TUNING AND HIGHER RESOLUTION

EXPERIMENTS

SETUP

COMPARISON TO STATE OF THE ART

PRE-TRAINING DATA REQUIREMENTS

SCALING STUDY

INSPECTING VISION TRANSFORMER

SELF-SUPERVISION

my result

About

Releases

Packages

Languages

haowoo0112/ViT-pytorch

Folders and files

Latest commit

History

Repository files navigation

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

tags: paper

ABSTRACT

INTRODUCTION

METHOD

VISION TRANSFORMER (VIT)

Reshape and Unroll

Linear Projection and Embedding

Transformer Encoder

MLP

FINE-TUNING AND HIGHER RESOLUTION

EXPERIMENTS

SETUP

COMPARISON TO STATE OF THE ART

PRE-TRAINING DATA REQUIREMENTS

SCALING STUDY

INSPECTING VISION TRANSFORMER

SELF-SUPERVISION

my result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

tags: `paper`

Packages