There are two architectural variants of CAST proposed in the paper. One is for image-classification and another is for segmentation. Both architectures can be pre-trained using self-supervised learning. In the paper, we used MoCo-v3 framework for all self-supervised learning experiments.
We provide the bashscripts for running self-supervised experiments. By default, we use CAST-S
. You can use larger models, e.g. CAST-B
by replacing -a cast_small
with -a cast_base
in the bashscripts.
- Self-supervised learning of CAST on ImageNet-1K:
> bash scripts/moco/train_imagenet1k_cast.sh
- Self-supervised learning of CAST on ImageNet-100:
> bash scripts/moco/train_imagenet100_cast.sh
- Self-supervised learning of ViT on ImageNet-1K:
> bash scripts/moco/train_imagenet1k.sh
- Self-supervised learning of ViT on ImageNet-100:
> bash scripts/moco/train_imagenet100.sh
- In the paper, we ablate the efficacy of our
Graph Pooling
module by replacing it with theToken Merging
module. Both models usesuperpixel
tokens. Run the following bashscript to reproduce our ablation study ofToken Merging
module on ImageNet-100:
> bash scripts/moco/train_imagenet100_tome.sh
- Self-supervised learning of CAST on COCO:
> bash scripts/moco/train_coco_cast.sh