{%hackmd SybccZ6XD %}
-
Image classification tasks Previous: CNN Paper: pure transformer
-
How to do that
- Transformer
-
Experiment
- Pre-train and transfer to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.)
- Step
- split an image into patches
- embeddings
- Transformer
- MLP
- When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size
- Transformers lack some of the inductive biases such as translation equivariance and locality 如果每個patch單獨做self-attention呢?
- Trained in larger datasets, excellent result
- large scale training trumps inductive bias.
follow the original Transformer (Vaswani et al., 2017) as closely as possible
The input of transformer is a sequence, so reshape the image
Embedding code
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
x + = self.pos_embedding(:,:(n + 1))
class token
similar to BERT (initial zero)
Fully connected and activation function
Pre-train on large datasets, and fine-tune to smaller task. Step
==Datasets.==
Pre-train
- ILSVRC-2012 ImageNet dataset: 1k classes and 1.3M images
- ImageNet-21k: 21k classes and 14M images
- JFT: 18k classes and 303M high resolution images
benchmark tasks: (Prepross by Big transfer (BiT): General visual representation learning.)
- ImageNet on the original validation labels and the cleaned-up ReaL labels
- CIFAR-10/100
- Oxford-IIIT Pets
- Oxford Flowers-102
19-task VTAB classification suite:
- Natural:tasks like the above, Pets, CIFAR
- Specialized:medical and satellite imagery
- Structured:tasks that require geometric understanding like localization
==Model Variants.== Layers: How many encoder block Hidden size D: The dim of output of linear projection MLP size: Heads: How many head in the Multi-Head Attention
==Training & Fine-tuning.==
Optimization
Adam:
Fine-tune
SGD with momentum:
==Metrics.==
results
- few-shot accuracy: solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to
${-1,1}^K$ target vectors - fine-tuning accuracy: after fine-tuning
:::warning 補充 (shot) n-shot: 1 class have n samples :::
The comparison between small datasets and larger datasets when pre-trained.
- large ViT models perform worse than BiT
- large ViT models shine when pre-trained on larger datasets.
The number in the above picture means P
The comparison between different size of subset
acc | time |
---|---|
0.9874 | 2:20:00 |