Skip to content

haowoo0112/ViT-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

{%hackmd SybccZ6XD %}

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

tags: paper

ABSTRACT

  • Image classification tasks Previous: CNN Paper: pure transformer

  • How to do that

    • Transformer
  • Experiment

    • Pre-train and transfer to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.)

INTRODUCTION

  • Step
    • split an image into patches
    • embeddings
    • Transformer
    • MLP
  • When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size
    • Transformers lack some of the inductive biases such as translation equivariance and locality 如果每個patch單獨做self-attention呢?
  • Trained in larger datasets, excellent result
    • large scale training trumps inductive bias.

METHOD

follow the original Transformer (Vaswani et al., 2017) as closely as possible

VISION TRANSFORMER (VIT)

Reshape and Unroll

The input of transformer is a sequence, so reshape the image

Linear Projection and Embedding

$z_0 = [x_{class}; x_p^1E; x_p^2E; ...; x_p^NE] + E_{pos}$ N = 9 in this example $E\in \mathbb{R}^{(P^2C)\times D}$ $E_{pos}\in \mathbb{R}^{(N+1)\times D}$

Embedding code

self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
x + = self.pos_embedding(:,:(n + 1))

class token

similar to BERT (initial zero)

Transformer Encoder

Transformer

MLP

Fully connected and activation function

FINE-TUNING AND HIGHER RESOLUTION

Pre-train on large datasets, and fine-tune to smaller task. Step

  • Pre-train
    • predition head: MLP with one hidden layer
  • Remove pre-trained predition head
  • Attach a zero-initialized D x K feedforward layer

EXPERIMENTS

SETUP

==Datasets.==

Pre-train

  • ILSVRC-2012 ImageNet dataset: 1k classes and 1.3M images
  • ImageNet-21k: 21k classes and 14M images
  • JFT: 18k classes and 303M high resolution images

benchmark tasks: (Prepross by Big transfer (BiT): General visual representation learning.)

  • ImageNet on the original validation labels and the cleaned-up ReaL labels
  • CIFAR-10/100
  • Oxford-IIIT Pets
  • Oxford Flowers-102

19-task VTAB classification suite:

  • Natural:tasks like the above, Pets, CIFAR
  • Specialized:medical and satellite imagery
  • Structured:tasks that require geometric understanding like localization

==Model Variants.== Layers: How many encoder block Hidden size D: The dim of output of linear projection MLP size: Heads: How many head in the Multi-Head Attention

==Training & Fine-tuning.==

Optimization Adam: $\beta_1 = 0.9, \beta_2 = 0.999, batch size = 4096, high weight decay of 0.1$

Fine-tune SGD with momentum: $batch size = 512$

==Metrics.==

results

  • few-shot accuracy: solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to ${-1,1}^K$ target vectors
  • fine-tuning accuracy: after fine-tuning

:::warning 補充 (shot) n-shot: 1 class have n samples :::

COMPARISON TO STATE OF THE ART

PRE-TRAINING DATA REQUIREMENTS

The comparison between small datasets and larger datasets when pre-trained.

  • large ViT models perform worse than BiT
  • large ViT models shine when pre-trained on larger datasets.

    The number in the above picture means P

The comparison between different size of subset

  • use early-stopping, and report the best validation accuracy achieved during training

SCALING STUDY

INSPECTING VISION TRANSFORMER

SELF-SUPERVISION

my result

acc time
0.9874 2:20:00

About

Implementation of ViT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages